Overview¶
Install¶
Condas is virtualenv for data science.
It will install all the packages necessary for our foray into machine learning & stats and build our environments.
In order to help us prevent network slow downs during lectures on Mondday, it would be great if you could install your root environment.
Here is a work-flow (from a data scientist): http://stiglerdiet.com/blog/2015/Nov/24/my-python-environment-workflow-with-conda/
Here is the documentation for conda: http://conda.pydata.org/docs/
Here is the cheatsheet: http://conda.pydata.org/docs/_downloads/conda-cheatsheet.pdf
Download https://www.continuum.io/downloads for your OS
bash [AndaconadaFile].sh
cd /path/to/anaconda/bin; source activate root;cd
conda list -e > [MY_SPEC_FILE].txt
remove lines contain conda and anaconda from [MY_SPEC_FILE].txt
conda create --name [ENV_NAME] --file [MY_SPEC_FILE].txt
Welcome¶
- Kaggle: https://www.kaggle.com/c/titanic
 - Pods
 - Create your team
 - Setup Your team repo and associate
 
Structure¶
Data Analysis¶
Data Structure¶
- Trees
 
Algorithms¶
- KNN / K-Means (Vector Quantization)
 - Decision Tress
 - Linear/Logistic Regression
 - Multivariate Regression
 
Tools¶
- numpy, scipy
 - scikit-learn
 - jupyter notebook
 - matplotlib, seaborn
 - pandas
 - pytest-ipynb
 - sql
 
Lab¶
- pods (recitation)
 - pairs (alternate daily)
 
Experiment¶
1. Hypothesis | Aim | Objectives¶
- Predict survival of Titanic passangers.
 - Which features most accurately predict the outcome?
 - Which machine learning algorithms are most accurate?
 
2. Data Analysis¶
cleaning¶
- nulls/unknowns
 - aggregate fields
 - noise
 
Pre-processing¶
- Formatting, Sampling
 
codebook¶
- Princeton Codebook: (http://dss.princeton.edu/online_help/analysis/codebook.htm)
 - CDC: http://www.cdc.gov/hiv/pdf/library_software_answr_codebook.pdf
 - McGill Medicine: http://www.medicine.mcgill.ca/epidemiology/joseph/pbelisle/CodebookCookbook.html
 - Kaggle Titanic: https://www.kaggle.com/c/titanic/data
 
statistics & shape¶
Nick
3. Selection of Features¶
input X¶
{independent(causality)|predictor(correlated)|explanatory(statistically dependent)|Feature}
- Class, Sex, Age, Siblings, ParCh, SibSp, Embarked, Cabin
 
4. Experiment Heueristics (Design)¶
Evaluation¶
- Titanic: https://www.kaggle.com/c/titanic/details/evaluation
 - ROC: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
 - RMSE: https://www.kaggle.com/wiki/RootMeanSquaredError)
 - Log Loss: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/details/evaluation
 - Mean F Score: https://www.kaggle.com/wiki/MeanFScore
 
Representation¶
- Data: 60% Train, 10% Validation, 30% Test
 - Algorithms
 
Optimization¶
- off-the-rack
 - consignment
 - thrift-store
 
5. Experiment¶
\(Learning = Representation + Optimization\)¶
~Pedro Domingos https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
6. Conclusions¶
- A well measured experiment...
 
7. Recommendation¶
- Success comes from failures too...