Bringing Data In: Python Objects

Now that we have seen the data structures we will be working with for the remainder of the semester, we can focus on different ways of creating (instantiating) them. To do so, let’s turn to an example, which utilizes the Python packages datetime, numpy, and pandas.

import datetime as dt
import numpy as np
import pandas as pd

Before we cover all the ways we can turn a Python object into a DataFrame, we should look at how we make a Series object. Remember that a Series object is essentially a column of our DataFrame object, so, once we know this, it should be easy to guess how to create a DataFrame object. Say we wanted to create a Series object of five random numbers between 0 and 1. We could use numpy to generate an array of random numbers and create a Series from that.

To ensure that the result is reproducible, we will set the seed here. The seed gives a starting point for the generation of pseudorandom numbers. No algorithms for random number generation are truly random – they are deterministic, and therefore, by setting this starting point, the numbers generated will be the same each time the code is run. This is good for testing things, but not for simulations (where we want randomness).

np.random.seed(0) # set a seed for reproducibility
pd.Series(np.random.rand(5), name = 'random')

Output:

We can make a Series object with any list-like structure (such as NumPy arrays) by passing it into pd.Series(). Making a DataFrame object is an extension of making a Series object; our dataframe will be composed of one or more series, and each will be distinctly named. This should remind us of dictionary-like structures in Python: the keys are the column names, and the values are the content of the columns.

Since DataFrame columns can all be different data types, let’s get a little fancy with our next example. We are going to create a DataFrame object of three columns, with five observations each:

-random: Five random numbers between 0 and 1 as a NumPy array. -text: A list of five strings or None truth: A list of five random Booleans

We will also create a DatatimeIndex object with the pd.date_range function. The index will be five dates (peirods), all one day apart (freq='1D'), ending with September 23rd, 2021 (end), and be called date.

All we have to do is package the columns in a dictionary using the desired column names as the keys and pass this to pd.DataFrame().

np.random.seed(0)

# Generate a list of length 5 that contains random Boolean values
randoTruth = []
for i in range(5):
  x = np.random.choice([True, False])
  randoTruth.append(x)

dict = {
'random': np.random.rand(5),
'text': ['hot', 'warm', 'cool', 'cold', None],
'truth': x
}

# Create the dataframe given the dictionary created above
df = pd.DataFrame(dict, index = pd.date_range(end = dt.date(2021, 9, 23), freq='1D', periods=5, name='date'))


Having dates in the index makes it easy to select entries by date (or even in a date range). In the above example, the statement pd.DataFrame(dict, index = pd.date_range(end = dt.date(2021, 9, 23), freq='1D', periods=5, name='date')) contains a lot of arguments that are being passed in to be stored within the attributes of either an Index object or a date_range object. The pattern attributeName = value is common to see in functions used to instantiate data structure objects that require the creation of some meta-state. Pandas has default values for these attributes if you choose not to pass in any values of your own. We could rewrite the statement to be:

df = pd.DataFrame(dict)

Output:

As you can see the output is similar, though lacking some of the custom details we described above. You can do a lot if you learn what attributes can be changed within certain function calls.