Data Structures¶

Python has several data structures already, which we learned about in Unit 2 (i.e. lists and dictionaries). Then, we learned about the array object, which a data structure found in NumPy. Pandas provides two main structures to facilitate working with data: Series and DataFrame. The Series and DataFrame data structures each contain another pandas data structure, which is very important to be aware of: Index. In order to understand these data structures, you need to think about NumPy array objects, which is what our pandas data structures are built upon.

All of these data structures are created from Python classes, which are the blueprints your computer needs to actually construct an object of a specific type (i.e. a dictionary object, a NumPy array object, a series object, etc.). When we create a new object in our program, we say that we are instantiating a new object. We often refer to them as instances of a class. This is an important distinction to conceptualize because we will see some actions that can be performed using the object itself (AKA we need to call a method on an object using dot notation), whereas others will require that we pass our object in as an argument to a function.

We use a pandas function to read a CSV file into an object of the DataFrame class, but we use methods on our DataFrame objects to perform actions on them, such as dropping columns or calculating summary statistics. With pandas, we will often want to access the attributes of the object we are working with. This won’t generate action as a method or a function world; rather, we will be given information about our pandas object, such as dimensions, column names, data types, and whether it is empty.

Series:

The `pandas.Series` class provides a data structure for arrays of data of a single type. It's very similar to a NumPy object, except it comes with some additional functionality. This one-dimensional representation can be thought of as a column in a spreadsheet. We have a name for our column, and the dat awe hold in it is of the same type (because we are measuring the same variable):

import pandas as pd

# The following data comes from the US Geological Survey (USGS) on Earthquakes
info = {
    0: '262km NW of Ozernovskiy, Russia',
    1: '25km E of Bitung, Indonesia',
    2: '42km WNW of Sola, Vanautu',
    3: '13km E of Nueva Concepcion, Guatemala',
    4: '128km SE of Kimbe, Papua New Guinea'
}

place = pd.Series(info, name='place')

print(place)

Output:

Found at the bottom of this program’s output is Name: place, dytpe: object. This is telling us that the data type is a Series object and the object’s name is place.

Furthermore, each row in the Series is secretly an Index object and is used to describe how the information is ordered within a Series object.

Index

The addition of the `Index` class makes the `Series` class significantly more powerful than a NumPy array. The Index class gives us row labels, which enable us to select data by row; depending on the type of Index, we can provide a row number, a date, or even a string to select our row. It plays a key role in identifying entries in the data and is used for a multitude of operations in pandas, as we will see throughout the remainder of this text. We access the Index object through the index attribute:

place_index = place.index
print(place_index)

Output:

The index object is built on top of a NumPy array:

place_index.values

Output:

Because the values are a NumPy array, you can now use any NumPy methods or attributes that you would like to wrangle these values.

Some useful attributes that are available on Index objects include the following:

Index Object Attributes

name - The name of the Index object
dtype - The data type of the Index object
shape - The dimensions of the Index object
values - The data in the Index object as a NumPy array
is_unique - Checks if the Index object has all unique values

DataFrame:

With the Series class, we essentially had columns of a spreadsheet, with the data all being of the same type. The DataFrame class builds upon the Series class; we can think of it as representing the spreadsheet as a whole. It can have many columns, each with its own data type. We can turn the example data into a DataFrame object.

# Notice that I am passing the Series object into the function
df = pd.DataFrame(place)
df # Writing the variable name again essentially prints the contents of an object

Output:

This gives us a dataframe of one series objects (one column of data). Our column has a single data type, though data types across multiple columns do not have to be the same.

df.dtypes

Output:

The following are some common attributes of DataFrame objects:

DataFrame Attributes

dtypes - Describes the data types of each column in the dataframe.
shape - Dimensions of the DataFrame object in a pair (number of rows, number of columns)
index - The Index object that is part of the DataFrame object
columns - The names of the columns (as an Index object)
values - The values in the DataFrame object as a NumPy array

Data Science 1

Data Structures¶

Series:

Index

DataFrame: