Understanding the Pandas DataFrame

Posted in Python

Pandas is a very popular Python library used to read, clean, analyze, and even plot data. At the core of Pandas, the programmer interacts with something called a DataFrame. The DataFrame is made up of an index, column(s), and values. Each column is a Pandas Series which can only represent a single data type (int, float, bool, object, category etc.).

Lets take a closer look at the components of a Pandas DataFrame. I'll first import pandas and use the common alias of pd. Next I will read in some data on wine reviews using the pd.read_csv() method and store it in a variable called "wine". I can take a look at the first five rows by calling head().

In [1]:
import pandas as pd
In [2]:
wine = pd.read_csv('../datasets/wine_reviews/winemag-data-130k-v2.csv')
wine.head()
Out[2]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

So here you can visualize the DataFrame's components with the index positioned vertically along the left side, the columns horizontally across the top, and the values throughout the rows and columns. Now, if you call type(wine) you can see that this variable is in fact a Pandas DataFrame. Taking it a step further, calling type(wine['counrtry']), confirms that the "country" column is a Pandas Series.

In [3]:
type(wine)
Out[3]:
pandas.core.frame.DataFrame
In [4]:
type(wine['country'])
Out[4]:
pandas.core.series.Series

Accessing the DataFrame's Components

You can access a DataFrame's components by using the index, columns, and values attributes. In this example:

  • The index is something called a RangeIndex that starts at 0 and ends at 129,971.
  • The columns attribute shows all of the column names within the dataframe.
  • The values attribute displays the data records (example below shows the first record in the dataset)
In [5]:
index = wine.index
columns = wine.columns
values = wine.values
In [6]:
index
Out[6]:
RangeIndex(start=0, stop=129971, step=1)
In [7]:
columns
Out[7]:
Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
       'variety', 'winery'],
      dtype='object')
In [8]:
values[0]
Out[8]:
array(['Italy',
       "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.",
       'Vulkà Bianco', 87, nan, 'Sicily & Sardinia', 'Etna', nan,
       'Kerin O’Keefe', '@kerinokeefe',
       'Nicosia 2013 Vulkà Bianco  (Etna)', 'White Blend', 'Nicosia'],
      dtype=object)

Calling type() on these attributes, we see some interesting output. Index and columns appear to be similar in that they are both considered a type of index. It is common to refer to the index as the "row index" and column as the "column index". While values is actually considered a "ndarray", an object that comes from the Numpy library, which Pandas has built upon.

In [9]:
print(type(index))
print(type(columns))
print(type(values))
<class 'pandas.core.indexes.range.RangeIndex'>
<class 'pandas.core.indexes.base.Index'>
<class 'numpy.ndarray'>

The final method I want to show is info(). Calling this method prints out a high level overview of the dataframe, that displays the index, lists the columns, and provides valuable information on non-null counts, data types, and memory usage.

In [10]:
wine.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129971 entries, 0 to 129970
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                129908 non-null  object 
 1   description            129971 non-null  object 
 2   designation            92506 non-null   object 
 3   points                 129971 non-null  int64  
 4   price                  120975 non-null  float64
 5   province               129908 non-null  object 
 6   region_1               108724 non-null  object 
 7   region_2               50511 non-null   object 
 8   taster_name            103727 non-null  object 
 9   taster_twitter_handle  98758 non-null   object 
 10  title                  129971 non-null  object 
 11  variety                129970 non-null  object 
 12  winery                 129971 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 12.9+ MB

Conclusion

In this post I explored the components of the Dataframe using the index, columns, and values attributes. Taking a closer look at the type of these attributes, I discovered that the index and columns are both considered to be Index's, while the values are actually a Numpy ndarray. By calling the info() method, the output combined some of this attribute information along with some very valuable details on non-null counts, column data types, and memory usage.

Thanks for reading!