Tidy Data¶

This post was inspired by Tidy Data written by Hadley Wickham. Even though the article was presented using R programming, I took the concepts and followed along using Python Pandas (similar to Daniel Chen's PyData Presentation on Tidying Data in Pandas).

Data tidying is a subset of the data cleaning step where the goal is to prep your data in a structure that is optimal for analytics and visualizations. A dataset is considered "tidy" when it follows these 3 principles:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Wickham gave a simple example of patient treatment data that depicts these principles.

	treatmentA	treatmentB
John Smith	-	2
Jane Doe	16	11
Mary Johnson	3	1

To tidy the above dataset, the treatment columns should be considered one variable and combined into a single column. Now the data is represented with variables in columns and observations in rows.

name	treatment	result
John Smith	a	-
Jane Doe	a	16
Mary Johnson	a	3
John Smith	b	2
Jane Doe	b	11
Mary Johnson	b	1

More often than not, real world data violates the three "tidy data" principles. Wickham has identified five of the most common situations where data is considered untidy:

Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.

I will go over a few of these scenarios and clean up the data using Python.

Lets get started!

In [1]:

import pandas as pd
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; )</style>"))

1. Column headers are values, not variable names¶

The first example of an untidy dataset is column headers that represent values and not variable names. This is common in a pivot table or an exhibit designed for a presentation. Although this can be a valuable view, it is not an ideal shape to generate any additional insights or visualization from. The below dataset depicts the relationship between religion and income in the US.

In [2]:

religion = pd.read_csv('../datasets/tidy-data-master/religion_inc.csv')

In [3]:

religion.head()

Out[3]:

	religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k
0	Agnostic	27	34	60	81	75	137
1	Atheist	12	27	37	52	35	70
2	Buddhist	27	21	30	34	33	58
3	Catholic	418	617	732	670	638	1116
4	Unsure/refused	15	14	15	11	10	35

In order to clean this data, we need to transpose the income bins into rows. This can be done with Python's melt function. In our example we pass 'religion' into the id_vars parameter, which holds this variable in its column form, while the remaining columns are "melted" (pun intended) into another column. Melt also has additional parameters to assist with renaming the newly converted columns. With this, the data is now in a longer form and better suited for analysis tools and further exploration.

In [4]:

religion_melt = religion.melt(id_vars='religion',var_name='income', value_name='count')

In [5]:

religion_melt.head()

Out[5]:

	religion	income	count
0	Agnostic	<$10k	27
1	Atheist	<$10k	12
2	Buddhist	<$10k	27
3	Catholic	<$10k	418
4	Unsure/refused	<$10k	15

2. Multiple variables stored in one column¶

The next scenario of a messy dataset is when multiple variables are stored in a single column. This can be seen in our next example of tuberculosis cases where the columns represent both gender and age ranges (m/f = Male/Female, age ranges are 0-4, 5-14, 15-24, etc.).

In [6]:

tb = pd.read_csv('../datasets/tidy-data-master/tb.csv')

In [7]:

tb.head()

Out[7]:

	iso2	year	new_sp	new_sp_m04	new_sp_m514	new_sp_m014	new_sp_m1524	new_sp_m2534	new_sp_m3544	new_sp_m4554	...	new_sp_f04	new_sp_f514	new_sp_f014	new_sp_f1524	new_sp_f2534	new_sp_f3544	new_sp_f4554	new_sp_f5564	new_sp_f65	new_sp_fu
0	AD	1989	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	AD	1990	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	AD	1991	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	AD	1992	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	AD	1993	15.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 23 columns

To start, we'll do some initial cleaning work to get rid of unwanted columns using the drop function and assigning better column names with rename.

In [8]:

tb.drop(columns=['new_sp','new_sp_mu','new_sp_m04','new_sp_m514','new_sp_f04','new_sp_f514','new_sp_fu'],inplace=True)

In [9]:

tb.rename(columns={'iso2':'country'},inplace=True)

Next we will use melt in a similar way as before to reshape the data into a longer form.

In [10]:

tb_melt = tb.melt(id_vars = ['country','year'],var_name='column', value_name='cases')

In [11]:

tb_melt.head()

Out[11]:

	country	year	column	cases
0	AD	1989	new_sp_m014	NaN
1	AD	1990	new_sp_m014	NaN
2	AD	1991	new_sp_m014	NaN
3	AD	1992	new_sp_m014	NaN
4	AD	1993	new_sp_m014	NaN

So that handles converting the variable into a column, now we need to split out the gender from each record. We can get rid of the "newsp" part of the string and then use the str[0] attribute to access the first character and then assign it to a new column called "sex". The remaining piece of the column can be accessed with str[1:] and then assigned to its own "age" column. We no longer have a need for the combined sex/age column, so we can drop that as well.

In [12]:

tb_melt.column = tb_melt.column.str.replace('new_sp_','')

In [13]:

tb_melt['sex'] = tb_melt['column'].str[0]
tb_melt['age'] = tb_melt['column'].str[1:]
tb_melt.drop(columns='column',inplace=True)
tb_melt.head()

Out[13]:

	country	year	cases	sex	age
0	AD	1989	NaN	m	014
1	AD	1990	NaN	m	014
2	AD	1991	NaN	m	014
3	AD	1992	NaN	m	014
4	AD	1993	NaN	m	014

This step is more optional, but I wrote a function that evaluates the length of the age buckets and places a hyphen where appropriate. So for example it would turn 014 to 0-14 or 1524 to 15-24.

In [14]:

def add_hyphen(row):
    if row=='65':
        return '65+'
    elif len(row)==2:
        return f'{row[0]}-{row[1]}'
    elif len(row)==3:
        return f'{row[0]}-{row[1:3]}'
    else:
        return f'{row[0:2]}-{row[2:4]}'

In [15]:

tb_melt['age']=tb_melt['age'].apply(add_hyphen)

...and some finishing touches of reordering columns, filling in nan's with 0 values and converting the cases column to int and there you have it. This dataset is now straightened up and ready to go onto greener analysis.

In [16]:

col_order=['country','year','sex','age','cases']

tb_melt=tb_melt[col_order]

tb_melt = tb_melt.fillna(0)

In [17]:

tb_melt['cases'] = tb_melt['cases'].astype('int')

In [18]:

tb_melt.head()

Out[18]:

	country	year	sex	age
0	AD	1989	m	0-14
1	AD	1990	m	0-14
2	AD	1991	m	0-14
3	AD	1992	m	0-14
4	AD	1993	m	0-14

3. Variables are stored in both rows and columns¶

The final messy dataset scenario I'm going to explore is when variables are stored in both rows and columns. I'll be looking at weather data which shows min and max temperatures by year/month/day.

In [19]:

weather = pd.read_csv('../datasets/tidy-data-master/weather.csv')

In [20]:

weather.head()

Out[20]:

	id	year	month	element	d1	d2	d3	d4	d5	d6	...	d22	d23	d24	d25	d26	d27	d28	d29	d30	d31
0	MX17004	2010	1	tmax	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	27.8	NaN
1	MX17004	2010	1	tmin	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	14.5	NaN
2	MX17004	2010	2	tmax	NaN	27.3	24.1	NaN	NaN	NaN	...	NaN	29.9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	MX17004	2010	2	tmin	NaN	14.4	14.4	NaN	NaN	NaN	...	NaN	10.7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	MX17004	2010	3	tmax	NaN	NaN	NaN	NaN	32.1	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 35 columns

In this case, the day columns should be recognized as one variable and stored in a single column, while the "element" column is comprised of two separate variables (tmin and tmax) which should be separated into two columns. To start I will again use the melt function to transpose the day numbers into column form. After that, I create a new 'date' column by converting the day to an integer and using panda's to_datetime method with the year/month/day columns.

In [21]:

weather_melt = weather.melt(id_vars=['id','year','month','element'],var_name='day',value_name='temp')

In [22]:

weather_melt['day'] = weather_melt['day'].str[1]

In [23]:

weather_melt.head()

Out[23]:

	id	year	month	element	day	temp
0	MX17004	2010	1	tmax	1	NaN
1	MX17004	2010	1	tmin	1	NaN
2	MX17004	2010	2	tmax	1	NaN
3	MX17004	2010	2	tmin	1	NaN
4	MX17004	2010	3	tmax	1	NaN

In [24]:

weather_melt['date'] = pd.to_datetime(weather_melt[['year','month','day']])

In [25]:

weather_melt.head()

Out[25]:

	id	year	month	element	day	temp	date
0	MX17004	2010	1	tmax	1	NaN	2010-01-01
1	MX17004	2010	1	tmin	1	NaN	2010-01-01
2	MX17004	2010	2	tmax	1	NaN	2010-02-01
3	MX17004	2010	2	tmin	1	NaN	2010-02-01
4	MX17004	2010	3	tmax	1	NaN	2010-03-01

In [26]:

# cleaning up columns no longer needed
weather_melt.drop(columns=['year','month','day'],inplace=True)
weather_melt = weather_melt.dropna()

In [27]:

weather_melt.head()

Out[27]:

	id	element	temp	date
20	MX17004	tmax	29.9	2010-12-01
21	MX17004	tmin	13.8	2010-12-01
24	MX17004	tmax	27.3	2010-02-02
25	MX17004	tmin	14.4	2010-02-02
40	MX17004	tmax	31.3	2010-11-02

Now for handling the element column with tmax and tmin variables which should have their own columns. Here I will use the pivot_table method and use columns 'id' and 'date' as the index, pass 'element' to the columns parameter and 'temp' to the values parameter. Since I don't want to leave the data in a multi-indexed pivot table, I can use reset.index() to format it back into a regular DataFrame. And with that, dataset three is now tidy.

In [28]:

weather_melt_pvt = weather_melt.pivot_table(index=['id','date'],columns='element',values='temp')
weather_melt_pvt.head()

Out[28]:

	element	tmax	tmin
id	date
MX17004	2010-01-03	27.8	14.50
	2010-02-01	29.7	13.40
	2010-02-02	28.6	12.55
	2010-02-03	24.1	14.40
	2010-03-01	32.8	17.20

In [29]:

weather_melt_pvt = weather_melt.pivot_table(index=['id','date'],columns='element',values='temp',aggfunc='sum').reset_index().rename_axis(None,axis=1)

In [30]:

weather_melt_pvt.head(10)

Out[30]:

	id	date	tmax	tmin
0	MX17004	2010-01-03	27.8	14.5
1	MX17004	2010-02-01	29.7	13.4
2	MX17004	2010-02-02	57.2	25.1
3	MX17004	2010-02-03	24.1	14.4
4	MX17004	2010-03-01	65.6	34.4
5	MX17004	2010-03-05	32.1	14.2
6	MX17004	2010-04-02	36.3	16.7
7	MX17004	2010-05-02	33.2	18.2
8	MX17004	2010-06-01	28.0	17.5
9	MX17004	2010-06-02	30.1	18.0

Conclusion¶

Prepping your data in a tidy, well structured manner is extremely important for the later steps of any Data Analytics/Data Science project. In this post I explored some ways to reshape untidy data using Pandas melt, pivot_table, and string (str attribute) functions. Creating a tidy dataset is only part of the overall process of data cleaning, so our data wrangling adventures are just beginning. Thanks for reading!

In [ ]: