Introduction to Pandas¶

Date: April 4th, 2016

Class: Code 401: Software Development with Python

Pandas is a great tool for reading, storing, and analyzing data. Let’s walk through it.

In [1]:

import pandas as pd

It’s simple to read data into pandas if you know what the format of your dataset looks like. If it’s comma-separated (csv), then you can use the pandas read_csv module with its default settings. The name of the data file is “titanic_data.csv”. Change the path to the data depending on where it’s saved on your machine.

In [2]:

infile = "../downloads/titanic_data.csv"
your_data = pd.read_csv(infile)

Just like in the command line, you can use “head” to look at the first few rows of data.

In [3]:

your_data.head()

Out[3]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Pandas preserves the column names in the data file to go along with every column. read_csv can also load files delimited in other ways by setting the sep parameter. For tab-separated data, it’s sep='\t'. For pipes, it’s sep='|'. And so on.

Each row in this data set corresponds to a passenger, and each column contains details for that passenger (Sex, Age, etc.). The object that the data is read into is called a DataFrame. The column furthest to the left is the index of each row in this DataFrame.

One of the great things about DataFrames is that you can return data from the DataFrame by the name of the column. So, if I wanted to look specifically at the genders of all the passengers I can access it like I would in a dictionary.

In [4]:

your_data["Sex"][:10]

Out[4]:

    male
  female
  female
  female
    male
    male
    male
    male
  female
  female
Name: Sex, dtype: object

You can do the same thing using dot notation. The difference is that the dictionary notation allows for column names with spaces and dashes. Dot notation is best for simple column names, and in the case of this data set works here.

In [5]:

your_data.Age[:10]

Out[5]:

  22.0
  38.0
  26.0
  35.0
  35.0
   NaN
  54.0
   2.0
  27.0
  14.0
Name: Age, dtype: float64

You can retrieve multiple columns at once using something similar to dictionary notation, though instead of providing keys, you provide a list of keys.

In [6]:

your_data[["Sex", "Age"]][:10]

Out[6]:

	Sex	Age
0	male	22.0
1	female	38.0
2	female	26.0
3	female	35.0
4	male	35.0
5	male	NaN
6	male	54.0
7	male	2.0
8	female	27.0
9	female	14.0

As you’ve seen thus far, you can slice DataFrames like you can slice lists. You can also slice the DataFrame using boolean criteria. For example, we can look at data for just male passengers on the Titanic.

In [7]:

your_data[your_data.Sex == "male"][:10]

Out[7]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
12	13	0	3	Saundercock, Mr. William Henry	male	20.0	0	0	A/5. 2151	8.0500	NaN	S
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S
16	17	0	3	Rice, Master. Eugene	male	2.0	4	1	382652	29.1250	NaN	Q
17	18	1	2	Williams, Mr. Charles Eugene	male	NaN	0	0	244373	13.0000	NaN	S
20	21	0	2	Fynney, Mr. Joseph J	male	35.0	0	0	239865	26.0000	NaN	S

Notice how only the indices that you’re interested remain. Similarly, you can filter on numeric criteria. Let’s get all passengers older than 22

In [8]:

your_data[your_data.Age > 22][:10]

Out[8]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S
15	16	1	2	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	0	0	248706	16.0000	NaN	S
18	19	0	3	Vander Planke, Mrs. Julius (Emelia Maria Vande...	female	31.0	1	0	345763	18.0000	NaN	S

I can combine multiple criteria together with boolean operators

In [9]:

your_data[(your_data.Age > 22) & (your_data.Sex == "male")][:10]

Out[9]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S
20	21	0	2	Fynney, Mr. Joseph J	male	35.0	0	0	239865	26.0000	NaN	S
21	22	1	2	Beesley, Mr. Lawrence	male	34.0	0	0	248698	13.0000	D56	S
23	24	1	1	Sloper, Mr. William Thompson	male	28.0	0	0	113788	35.5000	A6	S
30	31	0	1	Uruchurtu, Don. Manuel E	male	40.0	0	0	PC 17601	27.7208	NaN	C
33	34	0	2	Wheadon, Mr. Edward H	male	66.0	0	0	C.A. 24579	10.5000	NaN	S
34	35	0	1	Meyer, Mr. Edgar Joseph	male	28.0	1	0	PC 17604	82.1708	NaN	C
35	36	0	1	Holverson, Mr. Alexander Oskar	male	42.0	1	0	113789	52.0000	NaN	S

Awesome. Note that any sliced or filtered DataFrame is also a DataFrame. So all of the same methods that were with the parent are now with the sliced/filtered child.

Now that we can access the data that we want, we can do stuff to it. Let’s find the total Fare spent on tickets for the Titanic.

In [10]:

import numpy as np
np.sum(your_data.Fare)

Out[10]:

28693.949299999967

What’s the Sex of the passenger that paid the most?

In [11]:

your_data[your_data.Age == np.max(your_data.Age)].Sex

Out[11]:

630    male
Name: Sex, dtype: object

There’s plenty more that a DataFrame can do. Dig into the documentation like so (this isn’t Pandas-specific)

In [12]:

? your_data

If you haven’t run the above cell yet, it pops up a window at the bottom of your screen containing all the documentation that you can also access from the .doc method. Dive in, and Slack me your questions.