Tidy Data Principles
Posted in Python
Tidy Data¶
This post was inspired by Tidy Data written by Hadley Wickham. Even though the article was presented using R programming, I took the concepts and followed along using Python Pandas (similar to Daniel Chen's PyData Presentation on Tidying Data in Pandas).
Data tidying is a subset of the data cleaning step where the goal is to prep your data in a structure that is optimal for analytics and visualizations. A dataset is considered "tidy" when it follows these 3 principles:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Wickham gave a simple example of patient treatment data that depicts these principles.
| treatmentA | treatmentB | |
|---|---|---|
| John Smith | - | 2 |
| Jane Doe | 16 | 11 |
| Mary Johnson | 3 | 1 |
To tidy the above dataset, the treatment columns should be considered one variable and combined into a single column. Now the data is represented with variables in columns and observations in rows.
| name | treatment | result |
|---|---|---|
| John Smith | a | - |
| Jane Doe | a | 16 |
| Mary Johnson | a | 3 |
| John Smith | b | 2 |
| Jane Doe | b | 11 |
| Mary Johnson | b | 1 |
More often than not, real world data violates the three "tidy data" principles. Wickham has identified five of the most common situations where data is considered untidy:
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.
I will go over a few of these scenarios and clean up the data using Python.
Lets get started!
import pandas as pd
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; )</style>"))
1. Column headers are values, not variable names¶
The first example of an untidy dataset is column headers that represent values and not variable names. This is common in a pivot table or an exhibit designed for a presentation. Although this can be a valuable view, it is not an ideal shape to generate any additional insights or visualization from. The below dataset depicts the relationship between religion and income in the US.
religion = pd.read_csv('../datasets/tidy-data-master/religion_inc.csv')
religion.head()
In order to clean this data, we need to transpose the income bins into rows. This can be done with Python's melt function. In our example we pass 'religion' into the id_vars parameter, which holds this variable in its column form, while the remaining columns are "melted" (pun intended) into another column. Melt also has additional parameters to assist with renaming the newly converted columns. With this, the data is now in a longer form and better suited for analysis tools and further exploration.
religion_melt = religion.melt(id_vars='religion',var_name='income', value_name='count')
religion_melt.head()
2. Multiple variables stored in one column¶
The next scenario of a messy dataset is when multiple variables are stored in a single column. This can be seen in our next example of tuberculosis cases where the columns represent both gender and age ranges (m/f = Male/Female, age ranges are 0-4, 5-14, 15-24, etc.).
tb = pd.read_csv('../datasets/tidy-data-master/tb.csv')
tb.head()
To start, we'll do some initial cleaning work to get rid of unwanted columns using the drop function and assigning better column names with rename.
tb.drop(columns=['new_sp','new_sp_mu','new_sp_m04','new_sp_m514','new_sp_f04','new_sp_f514','new_sp_fu'],inplace=True)
tb.rename(columns={'iso2':'country'},inplace=True)
Next we will use melt in a similar way as before to reshape the data into a longer form.
tb_melt = tb.melt(id_vars = ['country','year'],var_name='column', value_name='cases')
tb_melt.head()
So that handles converting the variable into a column, now we need to split out the gender from each record. We can get rid of the "newsp" part of the string and then use the str[0] attribute to access the first character and then assign it to a new column called "sex". The remaining piece of the column can be accessed with str[1:] and then assigned to its own "age" column. We no longer have a need for the combined sex/age column, so we can drop that as well.
tb_melt.column = tb_melt.column.str.replace('new_sp_','')
tb_melt['sex'] = tb_melt['column'].str[0]
tb_melt['age'] = tb_melt['column'].str[1:]
tb_melt.drop(columns='column',inplace=True)
tb_melt.head()
This step is more optional, but I wrote a function that evaluates the length of the age buckets and places a hyphen where appropriate. So for example it would turn 014 to 0-14 or 1524 to 15-24.
def add_hyphen(row):
if row=='65':
return '65+'
elif len(row)==2:
return f'{row[0]}-{row[1]}'
elif len(row)==3:
return f'{row[0]}-{row[1:3]}'
else:
return f'{row[0:2]}-{row[2:4]}'
tb_melt['age']=tb_melt['age'].apply(add_hyphen)
...and some finishing touches of reordering columns, filling in nan's with 0 values and converting the cases column to int and there you have it. This dataset is now straightened up and ready to go onto greener analysis.
col_order=['country','year','sex','age','cases']
tb_melt=tb_melt[col_order]
tb_melt = tb_melt.fillna(0)
tb_melt['cases'] = tb_melt['cases'].astype('int')
tb_melt.head()
3. Variables are stored in both rows and columns¶
The final messy dataset scenario I'm going to explore is when variables are stored in both rows and columns. I'll be looking at weather data which shows min and max temperatures by year/month/day.
weather = pd.read_csv('../datasets/tidy-data-master/weather.csv')
weather.head()
In this case, the day columns should be recognized as one variable and stored in a single column, while the "element" column is comprised of two separate variables (tmin and tmax) which should be separated into two columns. To start I will again use the melt function to transpose the day numbers into column form. After that, I create a new 'date' column by converting the day to an integer and using panda's to_datetime method with the year/month/day columns.
weather_melt = weather.melt(id_vars=['id','year','month','element'],var_name='day',value_name='temp')
weather_melt['day'] = weather_melt['day'].str[1]
weather_melt.head()
weather_melt['date'] = pd.to_datetime(weather_melt[['year','month','day']])
weather_melt.head()
# cleaning up columns no longer needed
weather_melt.drop(columns=['year','month','day'],inplace=True)
weather_melt = weather_melt.dropna()
weather_melt.head()
Now for handling the element column with tmax and tmin variables which should have their own columns. Here I will use the pivot_table method and use columns 'id' and 'date' as the index, pass 'element' to the columns parameter and 'temp' to the values parameter. Since I don't want to leave the data in a multi-indexed pivot table, I can use reset.index() to format it back into a regular DataFrame. And with that, dataset three is now tidy.
weather_melt_pvt = weather_melt.pivot_table(index=['id','date'],columns='element',values='temp')
weather_melt_pvt.head()
weather_melt_pvt = weather_melt.pivot_table(index=['id','date'],columns='element',values='temp',aggfunc='sum').reset_index().rename_axis(None,axis=1)
weather_melt_pvt.head(10)
Conclusion¶
Prepping your data in a tidy, well structured manner is extremely important for the later steps of any Data Analytics/Data Science project. In this post I explored some ways to reshape untidy data using Pandas melt, pivot_table, and string (str attribute) functions. Creating a tidy dataset is only part of the overall process of data cleaning, so our data wrangling adventures are just beginning. Thanks for reading!