Inspecting DataFrames¶
We just learned a few wats in which we can create dataframes from various sources, but we still don’t know what to do with them or how we shold start our analysis. The first thing we shold do when we read in our data is inspect it; we want to make sure that our dataframe isn’t empty and that the rows look as we would expect. Our main goal is to verify that it was read in properly and that all of the data is there; this inspection will give us ideas on where to direct our data wrangling efforts.
Let’s first setup a dataframe object:
import numpy as np
import pandas as pd
# I downloaded this CSV file from kaggle
df = pd.read_csv('earthquakes.csv')
***Examining the Data:***
First, we want to make sure that we actually have data in our dataframe. We can check the `empty` attribute for this answer:df.empty # Displays False because it's not empty
So far, so good. Next, we should check how much data we read in; we want to know the number of observations (rows) and the number of variables (columns) we have. For this task, we use the shape
attribute:
df.shape # Displays (23412, 21)
Our data has 23, 412 observations of 21 variables, which matches my inspection of the file before I read it in. What does our data actually look like? For this task, we can use the head()
and tail()
methods to look at the top and bottom rows, respectively. This will default to five rows but we can change this by passing a different number into the methods:
df.head()
df.tail(2)
df.head()
Output:
df.tail(2)
Output:
We know that there are 21 columns, but we can’t see them all by calling the head()
and tail()
methods. Let’s instead use the column
attribute to at least see the names of the columns that we have:
df.columns
Output:
We can use the dtypes
attribute to see the data types of the columns as well. This is really useful if you suspect that columns are storing data as the wrong type.
df.dytpes
Output:
Finally, we can use the info()
method to see how many non-null entries of each column we have and get information on our Index object. Null values are missing values, which, in pandas, will typically be represented as None
or NaN
(Not a Number) for non-numeric values in a float
or integer
column.
df.info()
Output:
After this initial inspection, we know a lot about the structure of our data and can now begin to try and make sense of it!