Tidy Data Principles

Posted in Python

Tidy Data

This post was inspired by Tidy Data written by Hadley Wickham. Even though the article was presented using R programming, I took the concepts and followed along using Python Pandas (similar to Daniel Chen's PyData Presentation on Tidying Data in Pandas).

Data tidying is a subset of the data cleaning step where the goal is to prep your data in a structure that is optimal for analytics and visualizations. A dataset is considered "tidy" when it follows these 3 principles:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Wickham gave a simple example of patient treatment data that depicts these principles.

treatmentA treatmentB
John Smith - 2
Jane Doe 16 11
Mary Johnson 3 1

To tidy the above dataset, the treatment columns should be considered one variable and combined into a single column. Now the data is represented with variables in columns and observations in rows.

name treatment result
John Smith a -
Jane Doe a 16
Mary Johnson a 3
John Smith b 2
Jane Doe b 11
Mary Johnson b 1

More often than not, real world data violates the three "tidy data" principles. Wickham has identified five of the most common situations where data is considered untidy:

  1. Column headers are values, not variable names.
  2. Multiple variables are stored in one column.
  3. Variables are stored in both rows and columns.
  4. Multiple types of observational units are stored in the same table.
  5. A single observational unit is stored in multiple tables.

I will go over a few of these scenarios and clean up the data using Python.

Lets get started!

In [1]:
import pandas as pd
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; )</style>"))

1. Column headers are values, not variable names

The first example of an untidy dataset is column headers that represent values and not variable names. This is common in a pivot table or an exhibit designed for a presentation. Although this can be a valuable view, it is not an ideal shape to generate any additional insights or visualization from. The below dataset depicts the relationship between religion and income in the US.

In [2]:
religion = pd.read_csv('../datasets/tidy-data-master/religion_inc.csv')
In [3]:
religion.head()
Out[3]:
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
0 Agnostic 27 34 60 81 75 137
1 Atheist 12 27 37 52 35 70
2 Buddhist 27 21 30 34 33 58
3 Catholic 418 617 732 670 638 1116
4 Unsure/refused 15 14 15 11 10 35

In order to clean this data, we need to transpose the income bins into rows. This can be done with Python's melt function. In our example we pass 'religion' into the id_vars parameter, which holds this variable in its column form, while the remaining columns are "melted" (pun intended) into another column. Melt also has additional parameters to assist with renaming the newly converted columns. With this, the data is now in a longer form and better suited for analysis tools and further exploration.

In [4]:
religion_melt = religion.melt(id_vars='religion',var_name='income', value_name='count')
In [5]:
religion_melt.head()
Out[5]:
religion income count
0 Agnostic <$10k 27
1 Atheist <$10k 12
2 Buddhist <$10k 27
3 Catholic <$10k 418
4 Unsure/refused <$10k 15

2. Multiple variables stored in one column

The next scenario of a messy dataset is when multiple variables are stored in a single column. This can be seen in our next example of tuberculosis cases where the columns represent both gender and age ranges (m/f = Male/Female, age ranges are 0-4, 5-14, 15-24, etc.).

In [6]:
tb = pd.read_csv('../datasets/tidy-data-master/tb.csv')
In [7]:
tb.head()
Out[7]:
iso2 year new_sp new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544 new_sp_m4554 ... new_sp_f04 new_sp_f514 new_sp_f014 new_sp_f1524 new_sp_f2534 new_sp_f3544 new_sp_f4554 new_sp_f5564 new_sp_f65 new_sp_fu
0 AD 1989 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 AD 1990 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 AD 1991 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 AD 1992 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 AD 1993 15.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 23 columns

To start, we'll do some initial cleaning work to get rid of unwanted columns using the drop function and assigning better column names with rename.

In [8]:
tb.drop(columns=['new_sp','new_sp_mu','new_sp_m04','new_sp_m514','new_sp_f04','new_sp_f514','new_sp_fu'],inplace=True)
In [9]:
tb.rename(columns={'iso2':'country'},inplace=True)

Next we will use melt in a similar way as before to reshape the data into a longer form.

In [10]:
tb_melt = tb.melt(id_vars = ['country','year'],var_name='column', value_name='cases')
In [11]:
tb_melt.head()
Out[11]:
country year column cases
0 AD 1989 new_sp_m014 NaN
1 AD 1990 new_sp_m014 NaN
2 AD 1991 new_sp_m014 NaN
3 AD 1992 new_sp_m014 NaN
4 AD 1993 new_sp_m014 NaN

So that handles converting the variable into a column, now we need to split out the gender from each record. We can get rid of the "newsp" part of the string and then use the str[0] attribute to access the first character and then assign it to a new column called "sex". The remaining piece of the column can be accessed with str[1:] and then assigned to its own "age" column. We no longer have a need for the combined sex/age column, so we can drop that as well.

In [12]:
tb_melt.column = tb_melt.column.str.replace('new_sp_','')
In [13]:
tb_melt['sex'] = tb_melt['column'].str[0]
tb_melt['age'] = tb_melt['column'].str[1:]
tb_melt.drop(columns='column',inplace=True)
tb_melt.head()
Out[13]:
country year cases sex age
0 AD 1989 NaN m 014
1 AD 1990 NaN m 014
2 AD 1991 NaN m 014
3 AD 1992 NaN m 014
4 AD 1993 NaN m 014

This step is more optional, but I wrote a function that evaluates the length of the age buckets and places a hyphen where appropriate. So for example it would turn 014 to 0-14 or 1524 to 15-24.

In [14]:
def add_hyphen(row):
    if row=='65':
        return '65+'
    elif len(row)==2:
        return f'{row[0]}-{row[1]}'
    elif len(row)==3:
        return f'{row[0]}-{row[1:3]}'
    else:
        return f'{row[0:2]}-{row[2:4]}'
In [15]:
tb_melt['age']=tb_melt['age'].apply(add_hyphen)

...and some finishing touches of reordering columns, filling in nan's with 0 values and converting the cases column to int and there you have it. This dataset is now straightened up and ready to go onto greener analysis.

In [16]:
col_order=['country','year','sex','age','cases']

tb_melt=tb_melt[col_order]

tb_melt = tb_melt.fillna(0)
In [17]:
tb_melt['cases'] = tb_melt['cases'].astype('int')
In [18]:
tb_melt.head()
Out[18]:
country year sex age cases
0 AD 1989 m 0-14 0
1 AD 1990 m 0-14 0
2 AD 1991 m 0-14 0
3 AD 1992 m 0-14 0
4 AD 1993 m 0-14 0

3. Variables are stored in both rows and columns

The final messy dataset scenario I'm going to explore is when variables are stored in both rows and columns. I'll be looking at weather data which shows min and max temperatures by year/month/day.

In [19]:
weather = pd.read_csv('../datasets/tidy-data-master/weather.csv')
In [20]:
weather.head()
Out[20]:
id year month element d1 d2 d3 d4 d5 d6 ... d22 d23 d24 d25 d26 d27 d28 d29 d30 d31
0 MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 27.8 NaN
1 MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 14.5 NaN
2 MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN ... NaN 29.9 NaN NaN NaN NaN NaN NaN NaN NaN
3 MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN ... NaN 10.7 NaN NaN NaN NaN NaN NaN NaN NaN
4 MX17004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 35 columns

In this case, the day columns should be recognized as one variable and stored in a single column, while the "element" column is comprised of two separate variables (tmin and tmax) which should be separated into two columns. To start I will again use the melt function to transpose the day numbers into column form. After that, I create a new 'date' column by converting the day to an integer and using panda's to_datetime method with the year/month/day columns.

In [21]:
weather_melt = weather.melt(id_vars=['id','year','month','element'],var_name='day',value_name='temp')
In [22]:
weather_melt['day'] = weather_melt['day'].str[1]
In [23]:
weather_melt.head()
Out[23]:
id year month element day temp
0 MX17004 2010 1 tmax 1 NaN
1 MX17004 2010 1 tmin 1 NaN
2 MX17004 2010 2 tmax 1 NaN
3 MX17004 2010 2 tmin 1 NaN
4 MX17004 2010 3 tmax 1 NaN
In [24]:
weather_melt['date'] = pd.to_datetime(weather_melt[['year','month','day']])
In [25]:
weather_melt.head()
Out[25]:
id year month element day temp date
0 MX17004 2010 1 tmax 1 NaN 2010-01-01
1 MX17004 2010 1 tmin 1 NaN 2010-01-01
2 MX17004 2010 2 tmax 1 NaN 2010-02-01
3 MX17004 2010 2 tmin 1 NaN 2010-02-01
4 MX17004 2010 3 tmax 1 NaN 2010-03-01
In [26]:
# cleaning up columns no longer needed
weather_melt.drop(columns=['year','month','day'],inplace=True)
weather_melt = weather_melt.dropna()
In [27]:
weather_melt.head()
Out[27]:
id element temp date
20 MX17004 tmax 29.9 2010-12-01
21 MX17004 tmin 13.8 2010-12-01
24 MX17004 tmax 27.3 2010-02-02
25 MX17004 tmin 14.4 2010-02-02
40 MX17004 tmax 31.3 2010-11-02

Now for handling the element column with tmax and tmin variables which should have their own columns. Here I will use the pivot_table method and use columns 'id' and 'date' as the index, pass 'element' to the columns parameter and 'temp' to the values parameter. Since I don't want to leave the data in a multi-indexed pivot table, I can use reset.index() to format it back into a regular DataFrame. And with that, dataset three is now tidy.

In [28]:
weather_melt_pvt = weather_melt.pivot_table(index=['id','date'],columns='element',values='temp')
weather_melt_pvt.head()
Out[28]:
element tmax tmin
id date
MX17004 2010-01-03 27.8 14.50
2010-02-01 29.7 13.40
2010-02-02 28.6 12.55
2010-02-03 24.1 14.40
2010-03-01 32.8 17.20
In [29]:
weather_melt_pvt = weather_melt.pivot_table(index=['id','date'],columns='element',values='temp',aggfunc='sum').reset_index().rename_axis(None,axis=1)
In [30]:
weather_melt_pvt.head(10)
Out[30]:
id date tmax tmin
0 MX17004 2010-01-03 27.8 14.5
1 MX17004 2010-02-01 29.7 13.4
2 MX17004 2010-02-02 57.2 25.1
3 MX17004 2010-02-03 24.1 14.4
4 MX17004 2010-03-01 65.6 34.4
5 MX17004 2010-03-05 32.1 14.2
6 MX17004 2010-04-02 36.3 16.7
7 MX17004 2010-05-02 33.2 18.2
8 MX17004 2010-06-01 28.0 17.5
9 MX17004 2010-06-02 30.1 18.0

Conclusion

Prepping your data in a tidy, well structured manner is extremely important for the later steps of any Data Analytics/Data Science project. In this post I explored some ways to reshape untidy data using Pandas melt, pivot_table, and string (str attribute) functions. Creating a tidy dataset is only part of the overall process of data cleaning, so our data wrangling adventures are just beginning. Thanks for reading!

In [ ]: