Subsets of Data

So far, we have learned how to work with and summarize the dataframe as a whole. However, we will often be interested in performing operations and/or analyses on subsets of our data. There are many types of subsets we may look to isolate from our data, such as selecting only specific columns or rows as a whole or when a specific criterion is met. In order to obtain subsets of the data, we need to be familiar with selection, slicing, indexing, and filtering with pandas.

For this section, we will work with our earthquake data from previous chapters. Here is the setup that will be necessary for future analyses:

import pandas as pd

df = pd.read_csv('earthquakkes.csv')
df.columns

Output:

Selection:

With selection, we can grab entire columns of data from our dataframe. Using the dataframe from above, let's grab the `Magnitude` column, which contains the magnitudes of the earthquakes measured:

df.Magnitude

Output:

In the last example, we selected the Magnitude column as an attribute of the dataframe; however, we can also access it with dictionary-like notation:

df['Magnitude']

Output:

We aren’t limited to selecting one column at a time. By passing a list to the dictionary lookup, we can select many columns, giving us a DataFrame object that is a subset of the original dataframe.


df[['Magnitude', 'Date', 'Time']]

Output:


This gives us the full Magnitude, Data, and Time columns of the original dataframe.

Any list of column names will work for this, so we can use some of Python’s string methods to help us pick out the columns. For example, if we wanted to select all of the columns that start with Magnitude, we could do the following:

columnNames = []

for i in data.columns:
  if i.startswith('Magnitude'):
    columnNames.append(i)

data[columnNames]

Output:


Here is the full list of string methods that might interest you: https://www.w3schools.com/python/python_ref_string.asp

What if you want a particular value within a dataframe? As long as you know the row and column it can be found at, you can use the same dictionary-like indexing operator seen above:

data['Magnitude'][0] # Displays 6.0 because 6.0 is found at index 0 of the 'Magnitude' column.

Slicing:

When we want to extract certain rows (slices) from our dataframe, we use slsicing. DataFrame slicing works similarly to slicing with other Python objects, such as lists, with the first index being inclusive and the last being exclusive.
data[0:1] # Displays the first row of data in the dataframe

You can chain this with our column selections described above like so:

data[['Magnitude', 'Time', 'Date']][0:1]

Indexing:

As you can see above, we can combine column selection and row slicing to make smaller dataframes. However, it started to get a bit long and messy. If we decide to use the chaining methods discussed above to update values in a dataframe, we would actualy get a warning. Let’s trigger this warning to better understand this problem; I will try to change the magnitude of the first earthquake from a 6.0 to a 6.4.

data['Magnitude'][0:1] = 6.4

Output:


As indicated by the warning, it’s not enough to know selection and slicing – we must also master indexing. Although the update still occurs, it would be nice to be able to do this without getting a warning message. Pandas indexing operations provide us with a one-method way to select both the rows and the columns we want. We can use loc[] and iloc[] to subset our dataframe using label-based or integer-based lookups, respectively. A good way to remember the difference is to think of them as location verses integer loccation. For all indexing methods, we first provide the row indexer and then the column indexer with a comma separating them:

# Pattern -> df.loc[row, column]
data.loc[0:1, 'Magnitude'] = 6.4

In this case, the value found at row 0 and 1 of the Magnitude column would be changed to 6.4. This means that the second index value provided is inclusive within the loc[] operator!

This is not the case for iloc[]. Here is how we would change the value found in the first row of the Magnitude column to 7.0, using iloc[]:

# Pattern -> df.iloc[row, listOfColumnIndeces]
data.iloc[0:1, [8]] = 7.0

As you an see, we are no longer using the column’s name; instead we use a list of index values that represent the column we are trying to select. This is really useful if you don’t know the exact name of a column but you know where it is. You can also use slicing to grab rows or columns that are adjacent to one another:

# Pattern -> df.iloc[row range, column range]
data.iloc[0:1, 8:9] = 3.0

Finally, if you want to look up a particular value in a row or column, you can use at[] and iat[].

data.at[10, 'Magnitude'] # Displays the value in row 10 of the Magnitude column
data.iat[10, 8] # Displays the value in row 10 of the column at index 8 (the Magnitude column).