Problem Set - Statistics 4

Data Ref: http://www.seanlahman.com/baseball-archive/statistics/ - get the comma-delimited version for 2014

Note: Data is also available in the class downloads

Tasks

Univariate

  • With the baseball data linked above (some of it was explored in Statistics 4), create a univariate Linear Regression model predicting player salary using some player stat.
  • You will need to join a second table to the “Salaries.csv” table.
  • Cross-validate your model, and produce 68, 95, and 99.7% confidence intervals for your “slope”.
  • Report the \(R^2\) score for your univariate model.
  • Make a scatter plot of your data, with your model predictions overlaid in red.

Multivariate

  • Using no more than 4 characteristics, create a multivariate Linear Regression model predicting player salary.
  • Report the \(R^2\) score for your multivariate model. Aim for \(R^2 > 0.5\).
  • Make a scatter plot of your data side-by-side with a scatter plot of your model predictions.

Submitting Your Work

Report your answers in an iPython/Jupyter notebook with either print statements or markdown. If you write any functions, include a docstring describing what that function does. Note: You are not writing tests for any functions. If you come to any conclusions via math (and you will), make sure your code matches what you say.

When your work is complete, push your work to github and issue a Pull Request to your master branch. Submit the URL for your pull request. After this is complete, you may merge your work to master.

As usual, use the comments function in canvas to submit questions, comments and reflections on this work.