Data Types in Pandas

Posted in Python

Data types (aka dtypes) are useful for programs to determine how they can store and operate on a certain set of data. A simple example of this is how Python treats integers vs. strings. When you add integers in Python (1+2), you get back the sum of the values, while adding together two strings return a concatenation of the string values ('cat' + 'dog' = catdog).

In [1]:
# adding two integers
print(f'Adding integers 1 + 2 = {1+2}')

# adding two strings
print(f'Adding strings "cat" + "dog" = {"cat" + "dog"}')
Adding integers 1 + 2 = 3
Adding strings "cat" + "dog" = catdog

When looking at the list of Pandas data types, you'll find many that are fundementally consistent with Numpy and Python (int64, float64, bool, object) with a few that are unique to the Pandas libarary (datetime64, timedelta, category). Each column in a Pandas DataFrame is assigned a single datatype. In this post I'm going to load in a dataset on wine reviews, explore the data types associated with the information, and make some type conversions.

Pandas dtypes Usage
object holds any Python object, including strings
int64 integer numbers
float64 floating point numbers
bool True or False values
datetime64 Date and Time values
timedelta[ns] Difference between two datetimes
category Limited list of string values
In [2]:
import pandas as pd

wine = pd.read_csv('../datasets/wine_reviews/winemag-data-130k-v2.csv')
wine.head()
Out[2]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

Finding Data types in your DataFrame

Once the data is read into a DataFrame, there are two options to view each column's data type. The first is calling .dtypes on your DataFrame object. This returns a list of column names with the corresponding data type next to it. You can chain together .dtypes.values_counts() to see how many columns each data type has. The second option to view data types is calling .info(). This option shows both the column data types and data type column counts as well as some valuable details on index info, Non-Null Count, and memory usage.

In [3]:
wine.dtypes
Out[3]:
country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object
In [4]:
wine.dtypes.value_counts()
Out[4]:
object     11
float64     1
int64       1
dtype: int64
In [5]:
wine.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129971 entries, 0 to 129970
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                129908 non-null  object 
 1   description            129971 non-null  object 
 2   designation            92506 non-null   object 
 3   points                 129971 non-null  int64  
 4   price                  120975 non-null  float64
 5   province               129908 non-null  object 
 6   region_1               108724 non-null  object 
 7   region_2               50511 non-null   object 
 8   taster_name            103727 non-null  object 
 9   taster_twitter_handle  98758 non-null   object 
 10  title                  129971 non-null  object 
 11  variety                129970 non-null  object 
 12  winery                 129971 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 12.9+ MB

#

Selecting Columns by Data Type

Pandas offers a useful way to subset a DataFrame's columns based on the column dtypes by calling DataFrame.select_dtypes(include=None,exclude-None).

Here are some notes taken from the Pandas documentation on use cases for this method:

  • To select all numeric types, use np.number or 'number'
  • To select strings you must use the object dtype, but note that this will return all object dtype columns
  • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
  • To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
  • To select Pandas categorical dtypes, use 'category'

Using the select_dtypes() method, I'm able to split out number columns and store tham in a list named 'num_cols'. I can do the same with non-number columns using the exclude parameter and storing in a list named 'txt_cols'.

In [6]:
num_cols = wine.select_dtypes(include='number').columns.to_list()
In [7]:
num_cols
Out[7]:
['points', 'price']
In [8]:
txt_cols = wine.select_dtypes(exclude='number').columns.to_list()
In [9]:
txt_cols
Out[9]:
['country',
 'description',
 'designation',
 'province',
 'region_1',
 'region_2',
 'taster_name',
 'taster_twitter_handle',
 'title',
 'variety',
 'winery']

Convert data types and reduce memory usage

When working with large datasets, efficiency and speed start to become areas you need to pay closer attention to. Dealing with data that is flirting with the limits of your machine's memory comes with its challenges. One way to view the biggest memory hogs is to use .memory_usage(deep=True). This method returns the memory usage of each column in bytes. Now this DataFrame is by no means considered large, but here you can see that the 'description' and 'title' columns are taking up a lot of space. This makes sense because they are storing words/sentences. You may determine that wine descriptions are not useful in your analysis and choose to remove the column using the .drop() method and passing in the column parameter.

In [10]:
og_wine_mem = wine.memory_usage(deep=True)
og_wine_mem
Out[10]:
Index                         128
country                   7989711
description              42352034
designation               8149866
points                    1039768
price                     1039768
province                  8774807
region_1                  8576597
region_2                  5993056
taster_name               8626854
taster_twitter_handle     7909446
title                    15076889
variety                   9176985
winery                    9382168
dtype: int64
In [11]:
wine.drop(columns='description',inplace=True)

Another option to free up some memory is to look at the column data types and see if there are any opportunities to convert them to a more appropriate/efficient type. A great example is finding columns that could be a great candidate to be converted to the category dtype. When looking for potential columns to convert to categories it is best to find columns that contain a low volume of unique values. You can do this by calling the .nunique() method, which returns the count of unique values for each column.

In [12]:
wine.nunique()
Out[12]:
country                      43
designation               37979
points                       21
price                       390
province                    425
region_1                   1229
region_2                     17
taster_name                  19
taster_twitter_handle        15
title                    118840
variety                     707
winery                    16757
dtype: int64

Whenever you want/need to change a columns type, you can use the .astype() method and pass in a type name ('int','float','category' etc.). In this example, I converted the 'country' column to a 'category' dtype. I then called the .dtypes attribute to confirm that the change was a success. Using the .memory_usage(deep=True) again we can see that the 'country' column is taking up much less space compared to before and we also got rid of the bytes originally used by the 'description' column after dropping it.

In [13]:
wine['country'] = wine['country'].astype('category')
wine.dtypes
Out[13]:
country                  category
designation                object
points                      int64
price                     float64
province                   object
region_1                   object
region_2                   object
taster_name                object
taster_twitter_handle      object
title                      object
variety                    object
winery                     object
dtype: object
In [14]:
new_wine_mem = wine.memory_usage(deep=True)
new_wine_mem
Out[14]:
Index                         128
country                    133814
designation               8388072
points                    1039768
price                     1039768
province                  8818570
region_1                  8746171
region_2                  5993056
taster_name               8865600
taster_twitter_handle     8033362
title                    17003165
variety                   9314902
winery                    9674954
dtype: int64
In [15]:
wine.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129971 entries, 0 to 129970
Data columns (total 12 columns):
 #   Column                 Non-Null Count   Dtype   
---  ------                 --------------   -----   
 0   country                129908 non-null  category
 1   designation            92506 non-null   object  
 2   points                 129971 non-null  int64   
 3   price                  120975 non-null  float64 
 4   province               129908 non-null  object  
 5   region_1               108724 non-null  object  
 6   region_2               50511 non-null   object  
 7   taster_name            103727 non-null  object  
 8   taster_twitter_handle  98758 non-null   object  
 9   title                  129971 non-null  object  
 10  variety                129970 non-null  object  
 11  winery                 129971 non-null  object  
dtypes: category(1), float64(1), int64(1), object(9)
memory usage: 11.0+ MB

Conclusion

Understanding the dtypes a DataFrame is made up of is an important initial step when you're getting to know your data. In this post, we explored how to use the .dtype attribute to see a list of a DataFrame's columns along with their Dtype. We then learned how to subset out columns based on their dtype by using the .select_dtype() method. This allowed us to separate out the columns made up of numbers vs. non-number dtypes. We then took a look at how to convert dtypes by calling the .astype() method on a column. I showed how to convert a column made up of a lower volume of unique values to the 'category' dtype, which resulted in a more optimized DataFrame memory usage. Thats all for this basic Pandas dtype overview.

Thanks for reading!