Understanding the Pandas DataFrame
Posted in Python
Pandas is a very popular Python library used to read, clean, analyze, and even plot data. At the core of Pandas, the programmer interacts with something called a DataFrame. The DataFrame is made up of an index, column(s), and values. Each column is a Pandas Series which can only represent a single data type (int, float, bool, object, category etc.).
Lets take a closer look at the components of a Pandas DataFrame. I'll first import pandas and use the common alias of pd. Next I will read in some data on wine reviews using the pd.read_csv() method and store it in a variable called "wine". I can take a look at the first five rows by calling head().
import pandas as pd
wine = pd.read_csv('../datasets/wine_reviews/winemag-data-130k-v2.csv')
wine.head()
So here you can visualize the DataFrame's components with the index positioned vertically along the left side, the columns horizontally across the top, and the values throughout the rows and columns. Now, if you call type(wine) you can see that this variable is in fact a Pandas DataFrame. Taking it a step further, calling type(wine['counrtry']), confirms that the "country" column is a Pandas Series.
type(wine)
type(wine['country'])
Accessing the DataFrame's Components¶
You can access a DataFrame's components by using the index, columns, and values attributes. In this example:
- The index is something called a RangeIndex that starts at 0 and ends at 129,971.
- The columns attribute shows all of the column names within the dataframe.
- The values attribute displays the data records (example below shows the first record in the dataset)
index = wine.index
columns = wine.columns
values = wine.values
index
columns
values[0]
Calling type() on these attributes, we see some interesting output. Index and columns appear to be similar in that they are both considered a type of index. It is common to refer to the index as the "row index" and column as the "column index". While values is actually considered a "ndarray", an object that comes from the Numpy library, which Pandas has built upon.
print(type(index))
print(type(columns))
print(type(values))
The final method I want to show is info(). Calling this method prints out a high level overview of the dataframe, that displays the index, lists the columns, and provides valuable information on non-null counts, data types, and memory usage.
wine.info()
Conclusion¶
In this post I explored the components of the Dataframe using the index, columns, and values attributes. Taking a closer look at the type of these attributes, I discovered that the index and columns are both considered to be Index's, while the values are actually a Numpy ndarray. By calling the info() method, the output combined some of this attribute information along with some very valuable details on non-null counts, column data types, and memory usage.
Thanks for reading!