Overview

Install

Condas is virtualenv for data science.

It will install all the packages necessary for our foray into machine learning & stats and build our environments.

In order to help us prevent network slow downs during lectures on Mondday, it would be great if you could install your root environment.

Here is a work-flow (from a data scientist): http://stiglerdiet.com/blog/2015/Nov/24/my-python-environment-workflow-with-conda/

Here is the documentation for conda: http://conda.pydata.org/docs/

Here is the cheatsheet: http://conda.pydata.org/docs/_downloads/conda-cheatsheet.pdf

  • Download https://www.continuum.io/downloads for your OS

  • bash [AndaconadaFile].sh

  • cd /path/to/anaconda/bin; source activate root;cd

  • conda list -e > [MY_SPEC_FILE].txt
    
  • remove lines contain conda and anaconda from [MY_SPEC_FILE].txt

  • conda create --name [ENV_NAME] --file [MY_SPEC_FILE].txt
    

Welcome

Structure

Data Analysis

Data Structure

  • Trees

Algorithms

  • KNN / K-Means (Vector Quantization)
  • Decision Tress
  • Linear/Logistic Regression
  • Multivariate Regression

Tools

  • numpy, scipy
  • scikit-learn
  • jupyter notebook
  • matplotlib, seaborn
  • pandas
  • pytest-ipynb
  • sql

Lab

  • pods (recitation)
  • pairs (alternate daily)

Experiment

1. Hypothesis | Aim | Objectives

  • Predict survival of Titanic passangers.
  • Which features most accurately predict the outcome?
  • Which machine learning algorithms are most accurate?

2. Data Analysis

cleaning

  • nulls/unknowns
  • aggregate fields
  • noise

Pre-processing

  • Formatting, Sampling

statistics & shape

Nick

3. Selection of Features

input X

{independent(causality)|predictor(correlated)|explanatory(statistically dependent)|Feature}

  • Class, Sex, Age, Siblings, ParCh, SibSp, Embarked, Cabin

output Y

{dependent|predicted|response|Outcome}

  • Survived

factors and indicators

{catagorical feature} and {dummy variables}

  • Class, Sex, Cabin, Embarked

4. Experiment Heueristics (Design)

Representation

  • Data: 60% Train, 10% Validation, 30% Test
  • Algorithms

Optimization

  • off-the-rack
  • consignment
  • thrift-store

5. Experiment

\(Learning = Representation + Optimization\)

~Pedro Domingos https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

6. Conclusions

  • A well measured experiment...

7. Recommendation

  • Success comes from failures too...