Data Types in Pandas
Posted in Python
Data types (aka dtypes) are useful for programs to determine how they can store and operate on a certain set of data. A simple example of this is how Python treats integers vs. strings. When you add integers in Python (1+2), you get back the sum of the values, while adding together two strings return a concatenation of the string values ('cat' + 'dog' = catdog).
# adding two integers
print(f'Adding integers 1 + 2 = {1+2}')
# adding two strings
print(f'Adding strings "cat" + "dog" = {"cat" + "dog"}')
When looking at the list of Pandas data types, you'll find many that are fundementally consistent with Numpy and Python (int64, float64, bool, object) with a few that are unique to the Pandas libarary (datetime64, timedelta, category). Each column in a Pandas DataFrame is assigned a single datatype. In this post I'm going to load in a dataset on wine reviews, explore the data types associated with the information, and make some type conversions.
| Pandas dtypes | Usage |
|---|---|
| object | holds any Python object, including strings |
| int64 | integer numbers |
| float64 | floating point numbers |
| bool | True or False values |
| datetime64 | Date and Time values |
| timedelta[ns] | Difference between two datetimes |
| category | Limited list of string values |
import pandas as pd
wine = pd.read_csv('../datasets/wine_reviews/winemag-data-130k-v2.csv')
wine.head()
Finding Data types in your DataFrame¶
Once the data is read into a DataFrame, there are two options to view each column's data type. The first is calling .dtypes on your DataFrame object. This returns a list of column names with the corresponding data type next to it. You can chain together .dtypes.values_counts() to see how many columns each data type has. The second option to view data types is calling .info(). This option shows both the column data types and data type column counts as well as some valuable details on index info, Non-Null Count, and memory usage.
wine.dtypes
wine.dtypes.value_counts()
wine.info()
#
Selecting Columns by Data Type¶
Pandas offers a useful way to subset a DataFrame's columns based on the column dtypes by calling DataFrame.select_dtypes(include=None,exclude-None).
Here are some notes taken from the Pandas documentation on use cases for this method:
- To select all numeric types, use np.number or 'number'
- To select strings you must use the object dtype, but note that this will return all object dtype columns
- To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
- To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
- To select Pandas categorical dtypes, use 'category'
Using the select_dtypes() method, I'm able to split out number columns and store tham in a list named 'num_cols'. I can do the same with non-number columns using the exclude parameter and storing in a list named 'txt_cols'.
num_cols = wine.select_dtypes(include='number').columns.to_list()
num_cols
txt_cols = wine.select_dtypes(exclude='number').columns.to_list()
txt_cols
Convert data types and reduce memory usage¶
When working with large datasets, efficiency and speed start to become areas you need to pay closer attention to. Dealing with data that is flirting with the limits of your machine's memory comes with its challenges. One way to view the biggest memory hogs is to use .memory_usage(deep=True). This method returns the memory usage of each column in bytes. Now this DataFrame is by no means considered large, but here you can see that the 'description' and 'title' columns are taking up a lot of space. This makes sense because they are storing words/sentences. You may determine that wine descriptions are not useful in your analysis and choose to remove the column using the .drop() method and passing in the column parameter.
og_wine_mem = wine.memory_usage(deep=True)
og_wine_mem
wine.drop(columns='description',inplace=True)
Another option to free up some memory is to look at the column data types and see if there are any opportunities to convert them to a more appropriate/efficient type. A great example is finding columns that could be a great candidate to be converted to the category dtype. When looking for potential columns to convert to categories it is best to find columns that contain a low volume of unique values. You can do this by calling the .nunique() method, which returns the count of unique values for each column.
wine.nunique()
Whenever you want/need to change a columns type, you can use the .astype() method and pass in a type name ('int','float','category' etc.). In this example, I converted the 'country' column to a 'category' dtype. I then called the .dtypes attribute to confirm that the change was a success. Using the .memory_usage(deep=True) again we can see that the 'country' column is taking up much less space compared to before and we also got rid of the bytes originally used by the 'description' column after dropping it.
wine['country'] = wine['country'].astype('category')
wine.dtypes
new_wine_mem = wine.memory_usage(deep=True)
new_wine_mem
wine.info()
Conclusion¶
Understanding the dtypes a DataFrame is made up of is an important initial step when you're getting to know your data. In this post, we explored how to use the .dtype attribute to see a list of a DataFrame's columns along with their Dtype. We then learned how to subset out columns based on their dtype by using the .select_dtype() method. This allowed us to separate out the columns made up of numbers vs. non-number dtypes. We then took a look at how to convert dtypes by calling the .astype() method on a column. I showed how to convert a column made up of a lower volume of unique values to the 'category' dtype, which resulted in a more optimized DataFrame memory usage. Thats all for this basic Pandas dtype overview.
Thanks for reading!