As Told by a Nonprofit Data Wrangler

Data wrangling is the conventional term for the process of transforming raw data into a usable form. Suppose your database stored date records as “YYYY-MM-DD.” If your analysis focused on the calendar month, extracting “MM” (and maybe converting it to an actual name) might be part of your analytical workflow. Though not holistically representative, that’s an easy example of data wrangling.

Many data analysts—in the interest of showing off their competency and technical prowess—paint the process as elegant, smooth, and consistent. Of course, those are the qualities stakeholders desire for the end product. In fast-paced, evidence-based settings, insights and reporting should not only be actionable, usable, and sustainable, but they should also come in pretty packaging.

"...data wrangling can be a fairly indirect process that lands you somewhere you don’t expect. The same can be said of your early professional journey."

The spaghetti code amassed between their PostgreSQL, Pandas, and Matplotlib workflows be damned! Those data scientists can make a good visualization.

As a junior data analyst over a year into the field, I take a lot of pride in producing accessible insights for dozens of different stakeholder groups. In the first half of my fellowship, this has ranged from facilitating market-wide data democratization workshops to running Social Work-focused data trainings to the ever-cliché PowerBI dashboard. Some of this data comes from my original consumer research with participants, but the bulk of it comes from a highly mature and curated system of data collection at Year Up United. If the first year was spent exploring, the second year has so far been spent taking action.

Thus, it becomes a game of making all that raw data usable for study.  According to my coworkers, it’s a game I’m good at. “Avery’s the guy for all things data,” some say! I want to say that the game’s always easy, too. Yet, Truth would come out of her well to shame the consumer insights analyst.

Frequently in my work, Truth does not come out the data well on her own—given the usual conditions of organizational data storage, she can’t. Unconventional formatting, inconsistent styles of data entry, and deprecated categories don’t make data wrangling easier. Rather, data specialists come at Truth, buried somewhere in over a hundred relational tables of data, with mining explosives. And though some data professionals will tell you that Excel is archaic and limited, sometimes Truth doesn’t start upward vertical movement until you break out PivotTables.

Overwrought metaphors aside, the following is a 0 to 1 example of my own analytical workflow, going from an ad hoc stakeholder request to end-product visualization. Keep in mind that this can vary from project to project. This account is slightly fictionalized! The following data, though, is all real:

1.

I get a request to pull and process data from a stakeholder. This might be someone higher up in my department (in this case, the Director of Program), or it might be a stakeholder from another functional team (like the Director of Student Services, Site Director, or even a national Growth & Strategy Lead). Here, Wilfrid Velazquez wants to explore some data regarding “firings” for a debrief meeting. “Firings” are Year Up United’s historical term for attrition—a participant leaving the program, whether intentionally, willingly, or by regulatory necessity. For the purposes of this demonstration, we’ll wrangle only two selections of data into a singular visualization.

2.

I write a relevant SQL query to pull the data from a relevant relational database. In this case, I’m using SOQL (a proprietary dialect of SQL) to rapidly export a table from Salesforce, a CRM (customer relationship management) Platform. I’m specifically looking for participants who have left program—that is, count as attrition.

3.

I export the resulting array from this query into a usable format. (Note: I’ve removed confidential, identifying information important to my typical workflow.)

4.

I load the resulting file (in this case a CSV, a comma-separated value file) into a data manipulation tool. For many, this can be as simple as plopping it into Excel! That is a totally valid avenue of explanation. In this specific case, however, I’m using “pandas”, a data analysis library written in Python (a general purpose programming language) within the program VSCode. If you have the prerequisite knowledge, you can perform the same actions as in Excel with extreme efficiency and relative ease. And all these resources are free!

5.

Some exploratory data analysis is good for you. In this case, I’ve calculated the relative percentages of Attrition Reasons (referred to here as “Reasons for Status Change” and historically referred to as “Firings”). These numbers should add up to approximately 100: that is, a whole population.

6.

It’s become an organizational priority to investigate how our programming may impact different demographics– including age. This analysis sees how many participants of a particular age ultimately left the program. You can definitely see the higher numbers here, but it might be a little difficult to see what’s happening just from this.

7.

Woah! It looks like 19 years take the lead in attrition. It’s much easier to discern from this bar graph. In a Statistics course, we might describe this distribution as a bit “right-skewed”—attritions lean towards younger participants!

8.

That’s all good, but let’s see how this stacks up against the general population distribution of age. That is: maybe this is just proportional to how many 19 year olds were in this particular cohort? Let’s export another CSV using SQL, this time incorporating the entire population regardless of program status. The difference here is that I haven’t specified for “Status__c = ‘Fired’”, which would otherwise limit me to Attrition data. We’re no longer looking at a specific sample: we are looking at the true population.

9.

I’ve now loaded the CSV into my Python environment. You might be curious what those “NaN” values mean for the general population. In its literal most meaning, NaN means “Not a Number.” Contextually, it’s a “null” value: there’s absolutely nothing there. It’s an utter blank space where data could go. There isn’t any attrition data associated with that particular individual’s file because the individual completed the program in full!

10.

I’ll use the pandas library’s native support for Matplotlib, just as I had for the Attrition data. Huh, seems like there’s an entirely different distribution—with the exception of a singular (incorrectly entered) outlier calculated as 0 years old at enrollment!

11.

We’re getting more advanced now. I’ve super-imposed the age distribution of Attrition against the General Population. 19 year-olds are uniquely overrepresented here! I’ve normalized the y-axis—in Statistics, that means I’ve transformed the data in such a way that it’s easier to compare the values presented. The practice of steps 8 to 11 are commonly referred to as “exploratory visualization.” I’m not making any definite claims yet, beyond an observable instance of correlation.

12.

And now we export and save as a “.png” image file. It’s now time to send our end-product to our stakeholder to see if they think this topic is worth further investigation. In a social impact context, such phenomena might indicate a gap in service or the presence of something particular to this age group. As in all statistical investigation—no matter how simple—it’s important to remember that correlation is not causation. Higher attrition correlates with 19 year olds in this particular cohort: there might be any number of factors actually causing this. Still, it’s worth flagging for our stakeholder!

13. 

Don’t forget to tell your stakeholder what exactly they’re looking at if you ever try this yourself! I don’t consider reports—even if they’re small, ad hoc things—to be one-and-done deals. Be ready to explain your process and what the body of data you’re working with actually is. Soft skills are part of data analytics—there’s always going to be someone with a follow-up question.

As you might gather from this 13-step process—whether data layperson or not—data wrangling and ad hoc reporting can be a fairly indirect process that lands you somewhere you don’t expect. The same can be said of your early professional journey, at any organization—whether for-profit or nonprofit, social impact-oriented or driven by sales. “Indirect” doesn’t have to mean bumpy, but many challenges are meant to be transformed into victories.

When I first came into my role at Year Up, I imagined myself as a predominantly qualitative researcher who could flex some mixed methods muscle. Now, I understand myself as an adept data analyst ready to tackle both qualitative and quantitative data to extract whatever story they might tell. As I reflect on my options following my two-year opportunity with the FAO Schwarz Fellowship, I find myself saying what I’ve said before: I don’t just get one “year up,” the fellowship has given me two.

Picture of Avery Trinidad

Avery Trinidad

Avery Trinidad (he/him) is the Research & Insights FAO Schwarz Fellow at Year Up in New York City.

SHARE THIS STORY

Accessibility Toolbar