Problem Set - Statistics 1

Data Ref: https://www.kaggle.com/c/titanic/data

Tasks

Describe the data.

  • How big?
  • What are the columns and what do they mean?

What’s the average age of...

  • any Titanic passenger
  • a survivor
  • a non-surviving first-class passenger
  • Male survivors older than 30 from anywhere but Queenstown

For the groups you chose, how far (in years) are the average ages from the median ages?

What’s the most common...

  • passenger class
  • port of Embarkation
  • number of siblings or spouses aboard for survivors

Within what range of standard deviations from the mean (0-1, 1-2, 2-3) is the median ticket price? Is it above or below the mean?

How much more expensive was the 90th percentile ticket than the 5th percentile ticket? Are they the same class?

The highest average ticket price was paid by passengers from which port? Null ports don’t count

Which port has passengers from the most similar passenger class?

What fraction of surviving 1st-class males paid lower than the overall median ticket price?

How much older/younger was the average surviving passenger with family members than the average non-surviving passenger without them?

Display the relationship (i.e. make a plot) between survival rate and the quantile of the ticket price for 20 integer quantiles.

  • To be clearer, what I want is for you to specify 20 quantiles, and for each of those quantiles divide the number of survivors in that quantile by the total number of people in that quantile. That’ll give you the survival rate in that quantile.
  • Then plot a line of the survival rate against the ticket fare quantiles.
  • Make sure you label your axes.

*STRETCH GOAL* For each of the following characteristics, find the median in the data.

  • age
  • ticket price
  • # siblings/spouses
  • # parents/children

If you were to use these medians to draw numerical boundaries separating survivors from non-survivors, which of these characteristics would be the best choice and why?

Submitting Your Work

Report your answers in an iPython/Jupyter notebook with either print statements or markdown. If you write any functions, include a docstring describing what that function does. Note: You are not writing tests for any functions. If you come to any conclusions via math (and you will), make sure your code matches what you say.

When your work is complete, push your work to github and issue a Pull Request to your master branch. Submit the URL for your pull request. After this is complete, you may merge your work to master.

As usual, use the comments function in canvas to submit questions, comments and reflections on this work.