Bringing Data In: Files

The data we want to analyze will most often come from outside of Python. In many cases, we may have a data dump from a database or website, and have to bring it into Python to sift through it. A data dump gets its name from containing a large amount of data (possibly at a very granular level) and often not discriminating against any of it initially; for this reason, they can often be unwieldly.

Often, these data dumps will come in the form of a text file (.txt) or a CSV file (.csv). Pandas provides many methods to read in different types of files, so it is simply a matter of looking up the one that matches our file format. For example, consider the CSV file found at: https://drive.google.com/file/d/1lFZI0iSKi9ReuAKHIVhjBCs_o2kofRdX/view?usp=sharing

If we want to bring the data into a dataframe object, we use the function: pd.read_csv(). However, we should always do some initial inspection of the file before attempting to read it in; this will inform us whether we need to pass additional arguments, such as sep to specify the delimiter or names to provide the column names ourselves in the absence of a header row in the file.

We can use a IPython shell environment in Google Colab by placing ! in front of commands you would typically execute in your terminal or command line. Here, I am using the wc utility to count the number of words in the csv file linked above. You will need to upload your CSV file to Google Colab like so:

!wc top100_kdrama.csv

Output:


As you can see, there are a ton of words in this file. What if we wanted to count the number of rows in the file instead? You would just add the flag -l to the shell command seen above, like so:

!wc -l top100_kdrama.csv

Output:


We have 101 rows in the file. Please not, the flag is a lowercase L, not a 1. :)

Let’s now imagine that I have uploaded a folder of data instead, and would like to check the file size of our kdrama csv. Here is what my file structure looks like now:


For this task, we will use ls on the sample_data directory (the folder that contains all of the datasets). This will tell us the list of files in that directory. We add the -lh flag to get information about the files in a human-readable format. Finally, we send this output to the grep utility, which will help us isolate the files we want. This tells us that the file is 82K.

!ls -lh sample_data | grep top100_kdrama.csv

Output:


Now, let’s take a look at the top few rows to see if the file comes with headers. We will use the head utility and specify the number of rows with the -n flag. This tells us that the first row contains the headers for the data and that the data is delimited with commas (just because the file is a csv file does not mean that it is comma-delimited).

!head -n 2 sample_data/top100_kdrama.csv

Output:


The output of this is actually quite large, so the image is cut off; there are many other column values displayed. We can also check the bottom rows to make sure there is no extraneous data that we will need to ignore by using the `tail` utility. All you need to do is replace `head` with `tail`.

Lastly, we may be interested in seeing the column count in our data. While we could just count the fields in the first row of the result of head, we have the option of using the awk utility (for pattern scanning and processing) to count our columns. The -F flag allows us to specify the delimiter (comma in this case). Then, we specify what to do for each record in the file. We choose to print NF, which is a predefined variable whose value is the number of fields in the current record. Here, we say exit immediately after the print so that we print the number of fields in the first row of the file; then, we stop. This will look a little complicated, but by no means is something that you need to memorize:

!awk -F',' '{print NF; exit}' sample_data/top100_kdrama.csv # The output is 14 for this file because there are 14 columns

The ability to run shell commands directly from our Colab notebook dramatically streamlines our workflow. However, if you do not have past experience with the command line, it may feel really complicated to learn these commands. Check out this resource for more information on shell commands: https://ipython.readthedocs.io/en/stable/interactive/reference.html#system-shell-access

We have 14 columns and 101 rows, with the first one being the header. The file is 82KB and is comma-delimited. This means we can use pd.read_csv() to create our dataframe object.

import pandas as pd

df = pd.read_csv('sample_data/top100_kdrama.csv')

There are many arguments that you may need to add to this call. Here are some of the most popular that you may pass into the read_csv function:

Arguments

  1. sep - Specifies the delimiter

  2. header - Row number where the column names are located; the default option has pandas infer what it thinks is present

  3. names - List of column names to use as the header

  4. index_col - Column to use as the index

  5. usecols - Specifies which columns to read in

  6. dtype - Specifies the data types of the columns.

  7. converters - Specifies functions for converting data into certain columns

  8. skiprows - Rows to skip

  9. nrows - Amount of rows to read at a time (combine with skiprows to read a file bit by bit)

  10. parse_dates - Automatically parse columns containing dates into datetime objects.

  11. chunksize - For reading the file in chunks

  12. compression - For reading in compressed files without extracting beforehand

  13. encoding - Specifies the file encoding

Lastly, we should also learn how to save our dataframe to a file to share with others. Here, we have to be careful; if our dataframe’s index is just row numbers, we probably don’t want to write that to our file (it has no meaning to consumers of data), but it is the default:

df.to_csv('output.csv', index=False)