Pandas is a very popular Python library used to read, clean, analyze, and even plot data. At the core of Pandas, the programmer interacts with something called a DataFrame. The DataFrame is made up of an index, column(s), and values. Each column is a Pandas Series which can only represent a single data type (int, float, bool, object, category etc.).

Lets take a closer look at the components of a Pandas DataFrame. I'll first import pandas and use the common alias of pd. Next I will read in some data on wine reviews using the pd.read_csv() method and store it in a variable called "wine". I can take a look at the first five rows by calling head().

In [1]:

import pandas as pd

In [2]:

wine = pd.read_csv('../datasets/wine_reviews/winemag-data-130k-v2.csv')
wine.head()

Out[2]:

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	US	Tart and snappy, the flavors of lime flesh and...	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
3	US	Pineapple rind, lemon pith and orange blossom ...	Reserve Late Harvest	87	13.0	Michigan	Lake Michigan Shore	NaN	Alexander Peartree	NaN	St. Julian 2013 Reserve Late Harvest Riesling ...	Riesling	St. Julian
4	US	Much like the regular bottling from 2012, this...	Vintner's Reserve Wild Child Block	87	65.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Sweet Cheeks 2012 Vintner's Reserve Wild Child...	Pinot Noir	Sweet Cheeks

So here you can visualize the DataFrame's components with the index positioned vertically along the left side, the columns horizontally across the top, and the values throughout the rows and columns. Now, if you call type(wine) you can see that this variable is in fact a Pandas DataFrame. Taking it a step further, calling type(wine['counrtry']), confirms that the "country" column is a Pandas Series.

In [3]:

type(wine)

Out[3]:

pandas.core.frame.DataFrame

In [4]:

type(wine['country'])

Out[4]:

pandas.core.series.Series

Accessing the DataFrame's Components¶

You can access a DataFrame's components by using the index, columns, and values attributes. In this example:

The index is something called a RangeIndex that starts at 0 and ends at 129,971.
The columns attribute shows all of the column names within the dataframe.
The values attribute displays the data records (example below shows the first record in the dataset)

In [5]:

index = wine.index
columns = wine.columns
values = wine.values

In [6]:

index

Out[6]:

RangeIndex(start=0, stop=129971, step=1)

In [7]:

columns

Out[7]:

Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
       'variety', 'winery'],
      dtype='object')

In [8]:

values[0]

Out[8]:

array(['Italy',
       "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.",
       'Vulkà Bianco', 87, nan, 'Sicily & Sardinia', 'Etna', nan,
       'Kerin O’Keefe', '@kerinokeefe',
       'Nicosia 2013 Vulkà Bianco  (Etna)', 'White Blend', 'Nicosia'],
      dtype=object)

Calling type() on these attributes, we see some interesting output. Index and columns appear to be similar in that they are both considered a type of index. It is common to refer to the index as the "row index" and column as the "column index". While values is actually considered a "ndarray", an object that comes from the Numpy library, which Pandas has built upon.

In [9]:

print(type(index))
print(type(columns))
print(type(values))

<class 'pandas.core.indexes.range.RangeIndex'>
<class 'pandas.core.indexes.base.Index'>
<class 'numpy.ndarray'>

The final method I want to show is info(). Calling this method prints out a high level overview of the dataframe, that displays the index, lists the columns, and provides valuable information on non-null counts, data types, and memory usage.

In [10]:

wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129971 entries, 0 to 129970
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                129908 non-null  object 
 1   description            129971 non-null  object 
 2   designation            92506 non-null   object 
 3   points                 129971 non-null  int64  
 4   price                  120975 non-null  float64
 5   province               129908 non-null  object 
 6   region_1               108724 non-null  object 
 7   region_2               50511 non-null   object 
 8   taster_name            103727 non-null  object 
 9   taster_twitter_handle  98758 non-null   object 
 10  title                  129971 non-null  object 
 11  variety                129970 non-null  object 
 12  winery                 129971 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 12.9+ MB

Conclusion¶

In this post I explored the components of the Dataframe using the index, columns, and values attributes. Taking a closer look at the type of these attributes, I discovered that the index and columns are both considered to be Index's, while the values are actually a Numpy ndarray. By calling the info() method, the output combined some of this attribute information along with some very valuable details on non-null counts, column data types, and memory usage.

Thanks for reading!