Introduction to PandasΒΆ

Date: April 4th, 2016

Class: Code 401: Software Development with Python

Pandas is a great tool for reading, storing, and analyzing data. Let’s walk through it.

In [1]:
import pandas as pd

It’s simple to read data into pandas if you know what the format of your dataset looks like. If it’s comma-separated (csv), then you can use the pandas read_csv module with its default settings. The name of the data file is “titanic_data.csv”. Change the path to the data depending on where it’s saved on your machine.

In [2]:
infile = "../downloads/titanic_data.csv"
your_data = pd.read_csv(infile)

Just like in the command line, you can use “head” to look at the first few rows of data.

In [3]:
your_data.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Pandas preserves the column names in the data file to go along with every column. read_csv can also load files delimited in other ways by setting the sep parameter. For tab-separated data, it’s sep='\t'. For pipes, it’s sep='|'. And so on.

Each row in this data set corresponds to a passenger, and each column contains details for that passenger (Sex, Age, etc.). The object that the data is read into is called a DataFrame. The column furthest to the left is the index of each row in this DataFrame.

One of the great things about DataFrames is that you can return data from the DataFrame by the name of the column. So, if I wanted to look specifically at the genders of all the passengers I can access it like I would in a dictionary.

In [4]:
your_data["Sex"][:10]
Out[4]:
0      male
1    female
2    female
3    female
4      male
5      male
6      male
7      male
8    female
9    female
Name: Sex, dtype: object

You can do the same thing using dot notation. The difference is that the dictionary notation allows for column names with spaces and dashes. Dot notation is best for simple column names, and in the case of this data set works here.

In [5]:
your_data.Age[:10]
Out[5]:
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

You can retrieve multiple columns at once using something similar to dictionary notation, though instead of providing keys, you provide a list of keys.

In [6]:
your_data[["Sex", "Age"]][:10]
Out[6]:
Sex Age
0 male 22.0
1 female 38.0
2 female 26.0
3 female 35.0
4 male 35.0
5 male NaN
6 male 54.0
7 male 2.0
8 female 27.0
9 female 14.0

As you’ve seen thus far, you can slice DataFrames like you can slice lists. You can also slice the DataFrame using boolean criteria. For example, we can look at data for just male passengers on the Titanic.

In [7]:
your_data[your_data.Sex == "male"][:10]
Out[7]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S

Notice how only the indices that you’re interested remain. Similarly, you can filter on numeric criteria. Let’s get all passengers older than 22

In [8]:
your_data[your_data.Age > 22][:10]
Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S

I can combine multiple criteria together with boolean operators

In [9]:
your_data[(your_data.Age > 22) & (your_data.Sex == "male")][:10]
Out[9]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
30 31 0 1 Uruchurtu, Don. Manuel E male 40.0 0 0 PC 17601 27.7208 NaN C
33 34 0 2 Wheadon, Mr. Edward H male 66.0 0 0 C.A. 24579 10.5000 NaN S
34 35 0 1 Meyer, Mr. Edgar Joseph male 28.0 1 0 PC 17604 82.1708 NaN C
35 36 0 1 Holverson, Mr. Alexander Oskar male 42.0 1 0 113789 52.0000 NaN S

Awesome. Note that any sliced or filtered DataFrame is also a DataFrame. So all of the same methods that were with the parent are now with the sliced/filtered child.

Now that we can access the data that we want, we can do stuff to it. Let’s find the total Fare spent on tickets for the Titanic.

In [10]:
import numpy as np
np.sum(your_data.Fare)
Out[10]:
28693.949299999967

What’s the Sex of the passenger that paid the most?

In [11]:
your_data[your_data.Age == np.max(your_data.Age)].Sex
Out[11]:
630    male
Name: Sex, dtype: object

There’s plenty more that a DataFrame can do. Dig into the documentation like so (this isn’t Pandas-specific)

In [12]:
? your_data

If you haven’t run the above cell yet, it pops up a window at the bottom of your screen containing all the documentation that you can also access from the .doc method. Dive in, and Slack me your questions.