Data Analysis With Python
It’s time to put everything we’ve learned together and start doing some data analysis with Python!
Our dataset
The dataset we’ll be working with today is one that was published in Peter H. Rossi, Richard A. Berk and Kenneth J. Lenihan’s 1980 book, Money, Work, and Crime: Some Experimental Results (https://doi.org/10.1016/C2013-0-11412-2). The dataset is made available in the R carData package, and is described as follows:
This data set is originally from Rossi et al. (1980), and is used as an example in Allison (1995). The data pertain to 432 convicts who were released from Maryland state prisons in the 1970s and who were followed up for one year after release. Half the released convicts were assigned at random to an experimental treatment in which they were given financial aid; half did not receive aid.
The documentation page describes the columns that are present in the dataset.
Let’s start by downloading this dataset into our Python project directory.
The Pandas Package
To do our data analysis we’ll be using the wildly popular data analysis package, pandas.
If you’re familiar with the R programming language then you’ll feel right at home with Pandas. Pandas draws a lot of inspiration from R, particularly the Data Frame.
Let’s start creating our analysis in the analysis.py file.
Before we can use the package, make sure you’ve installed in by running the following command in the PyCharm terminal tab:
conda install pandas
Importing Data With Pandas
Start by importing the pandas library at the top of our file with an import statement:
1
import pandas as pd
In order to import a dataset, pandas provides a set of functions to import data from a variety of formats:
read_csv
read_json
read_excel
read_html
read_sas
read_stata
Our data is in CSV format, so we’ll import it using the read_csv
function:
1
data = pd.read_csv('Rossi.csv')
Objects in Python
We’ve already seen that variables in Python can have a type like int, float, or string. But there’s also a whole host of more complex types like the pandas data frame. Often called objects, these more complex variables can hold your data just like a regular variable, but they might also keep other data as well.
For example, the data frame keeps a record of the number of columns and rows in our dataset.
1
print(data.shape)
(432, 62)
In the above code, shape
refers to some extra data that’s stored in our object. We call this data an attribute, though you might also hear people call this a property or member.
In addition to data, objects like this can come with their own functions. For example, the pandas data frame comes with a head()
function, which will return the first few rows from the dataset.
1
print(data.head())
In all these instances, we use the same dotted notation for the attributes of variables that we use for the functions in libraries because they have the same part-and-whole relationship.
Exploring Our Data
To make sure our data has been imported correctly, we can use the data frame info()
function:
1
print(data.info())
This gives us:
Here we can see a variety of information about our data frame:
- It has 432 entries and 63 columns of data
- For each column, we can see its name, the number of non-empty (non-null) values, and the data type that pandas has given to that column.
- The amount of memory that our data consumes
Generating Summary Statistics
Another common way to get a feel for the data is to generate summary statistics for each column. If you have numeric columns, the DataFrame.describe()
function generates the summary statistics of the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'
, in which case it will give some information like the number of unique values and the most common value.
1
print(data.describe())
Gives:
Subsetting and Filtering Data with Pandas
Now that we’ve had a chance to explore our data a little bit, we can start to dig a little deeper.
Filtering Using a Boolean Mask
Let’s start by trying to answer a simple question: Did the financial support program have an effect on recidivism?
Let’s create two subsets of our data, one where the individuals received financial support, and one where they did not:
One of the most common ways you’ll see filtering done is by using a boolean mask. A boolean mask is a list of boolean values, one for each row in our dataset. If the value is True
we know the row satisfied our condition and if it is False
we know that it didn’t:
1
support_mask = data["fin"] == "yes"
Once we have our mask, we can then apply it to our data frame:
1
2
data_support = data[support_mask]
print(data_support.head())
This works because the pandas data frame knows that if we index it using a list of booleans the same length as the data frame, then it will return all of the rows where the boolean value is True
.
This same strategy will work with more complex conditions as well, though it can get a bit finicky:
1
2
3
support_over_50_mask = (data["fin"] == "yes") & (data["age"] >= 50)
data_support_over_50 = data[support_over_50_mask]
print(data_support_over_50.head())
In those cases, it’s often easier to use the built in query method:
1
data_support_over 50 = data.query("fin == 'yes' and age > 50")
Once we’ve got a subset of our data, it’s fairly easy to generate some statistics on it. For example, if we want to know how many people re-offended in each of our groups, then we can sum the arrest column:
1
2
3
4
5
financial_support_mask = data["fin"] == "yes"
data_financial_support = data[financial_support_mask]
support_arrest_count = data_financial_support['arrest'].sum()
print(support_arrest_count)
And if we want to calculate proportion of recidivism, we can calculate the number of arrests out of the cases in the subset:
1
2
3
4
5
6
7
financial_support_mask = data["fin"] == "yes"
data_financial_support = data[financial_support_mask]
support_total_count = len(data_financial_support)
support_arrest_count = data_financial_support['arrest'].sum()
support_recidivism_rate = support_arrest_count / support_total_count
print("Recidivism rate - Financial support: ", support_recidivism_rate)
Which give us 48
. If we do the same for the no support group:
1
2
3
4
5
6
7
no_financial_support_mask = data["fin"] == "no"
data_no_financial_support = data[no_financial_support_mask]
no_support_total_count = len(data_no_financial_support)
no_support_arrest_count = data_no_financial_support['arrest'].sum()
no_support_recidivism_rate = no_support_arrest_count / no_support_total_count
print("Recidivism rate - No financial support: ", no_support_recidivism_rate)
Which gives us 66
.
There are a whole suite of functions that we can use to calculate these aggregations including:
Function Name | NaN-safe Version | Description |
---|---|---|
np.sum | np.nansum | Compute sum of elements |
np.prod | np.nanprod | Compute product of elements |
np.mean | np.nanmean | Compute mean of elements |
np.std | np.nanstd | Compute standard deviation |
np.var | np.nanvar | Compute variance |
np.min | np.nanmin | Find minimum value |
np.max | np.nanmax | Find maximum value |
np.argmin | np.nanargmin | Find index of minimum value |
np.argmax | np.nanargmax | Find index of maximum value |
np.median | np.nanmedian | Compute median of elements |
np.percentile | np.nanpercentile | Compute rank-based statistics of elements |
np.any | N/A | Evaluate whether any elements are true |
np.all | N/A | Evaluate whether all elements are true |
Cleaning up our code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
data = pd.read_csv('Rossi.csv')
financial_support_mask = data["fin"] == "yes"
data_financial_support = data[financial_support_mask]
support_total_count = len(data_financial_support)
support_arrest_count = data_financial_support['arrest'].sum()
support_recidivism_rate = support_arrest_count / support_total_count
print("Recidivism rate - Financial support: ", support_recidivism_rate)
no_financial_support_mask = data["fin"] == "no"
data_no_financial_support = data[no_financial_support_mask]
no_support_total_count = len(data_no_financial_support)
no_support_arrest_count = data_no_financial_support['arrest'].sum()
no_support_recidivism_rate = no_support_arrest_count / no_support_total_count
print("Recidivism rate - No financial support: ", no_support_recidivism_rate)
You might have noticed that our code above isn’t particularly DRY (Don’t Repeat Yourself). We’ve already repeated ourselves a couple of times here, so we have some opportunities for improvement. Primarily, there are two places where we’re doing almost the exact same thing:
- Subsetting our data
- Calculating the recidivism rate
How might we create a more DRY version of this code? Functions!
First let’s define a function that will do the work of filtering our data:
1
2
3
def subset_data(data, column, value):
mask = data[column] == value
return data[mask]
Then, we can create another function to handle calculating the recidivism rate:
1
2
3
4
def calculate_recidivism_rate(data):
total_count = len(data)
arrest_count = data['arrest'].sum()
return arrest_count / total_count
Now that we’ve got our functions defined, all we have to do is call them:
1
2
3
4
5
6
7
8
data_financial_support = subset_data(data, "fin", "yes")
data_no_financial_support = subset_data(data, "fin", "no")
support_recidivism_rate = calculate_recidivism_rate(data_financial_support)
no_support_recidivism_rate = calculate_recidivism_rate(data_no_financial_support)
print("Recidivism rate - Financial support: ", support_recidivism_rate)
print("Recidivism rate - No financial support: ", no_support_recidivism_rate)
Analyzing Multiple files
Our dataset only has one file, but suppose that we had a second replication of this study conducted in another country and wanted to repeat our analysis for each file. The glob
module makes it easy to to analyze multiple files:
Let’s create a second file to process by just creating a copy of our dataset. To keep everything neat and tidy, I’ll move them both to a new data
folder inside of our project directory.
The glob module provides a glob function which will produce a list of files in a directory that match a pattern:
1
2
3
import glob
files = glob.glob('data/*.csv')
print(files)
This gives us a list containing our two files: ['data/Rossi-US.csv', 'data/Rossi-Canada.csv']
Now that we’ve got that, all we’ve got to do is loop through our files one at a time and process them using our functions:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import glob
import pandas as pd
def subset_data(data, column, value):
mask = data[column] == value
return data[mask]
def calculate_recidivism_rate(data):
total_count = len(data)
arrest_count = data['arrest'].sum()
return arrest_count / total_count
files = glob.glob('data/*.csv')
for file in files:
data = pd.read_csv(file)
data_financial_support = subset_data(data, "fin", "yes")
data_no_financial_support = subset_data(data, "fin", "no")
support_recidivism_rate = calculate_recidivism_rate(data_financial_support)
no_support_recidivism_rate = calculate_recidivism_rate(data_no_financial_support)
print(file)
print(support_recidivism_rate)
print(no_support_recidivism_rate)