K-Means Experiment

Objectives

Predict survival of Titanic passangers.

Which features most accurately predict the outcome?

Data Analysis

In [31]:
import pandas
import numpy
from sklearn.cross_validation import train_test_split
from sklearn.cluster import KMeans
from pprint import pprint

MY_TITANIC_TRAIN = '/media/removable/data/train_titanic.csv'
MY_TITANIC_TEST = '/media/removable/data/test_titanic.csv'
titanic_dataframe = pandas.read_csv(MY_TITANIC_TRAIN, header=0)
  • fix missing values
In [32]:
titanic_dataframe = titanic_dataframe.dropna()
  • statistics & shape

Selection of Features

  • Must have a mean –> Remove categorical data
In [33]:
titanic_dataframe.drop(['Name', 'Ticket', 'Cabin', 'Embarked', 'Sex'], axis=1, inplace=True)
#print('length: {0} '.format(len(titanic_dataframe)))
#print(titanic_dataframe.head(5))
length: 183
    PassengerId  Survived  Pclass  Age  SibSp  Parch     Fare
1             2         1       1   38      1      0  71.2833
3             4         1       1   35      1      0  53.1000
6             7         0       1   54      0      0  51.8625
10           11         1       3    4      1      1  16.7000
11           12         1       1   58      0      0  26.5500
  • discrete vs. continuous
In [34]:
print(2.2 * 3.0 == 6.6)
print(3.3 * 2.0 == 6.6)
False
True

Oh look –floats bite.

Experiment Heueristics (Design)

Evaluation

Confusion Matrix: https://en.wikipedia.org/wiki/Confusion_matrix

Confusion Matrix Clarification: http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

Mean F Score: https://www.kaggle.com/wiki/MeanFScore

  • \(F_1 = 2 * \frac{precision * recall}{ precision + recall}\)
  • \(precision = \frac{tp}{tp+fp}\)
  • \(recall = \frac{tp}{tp+fn}\)

Representation

K-means

Distance = Euclidean (yes I mispelled this in KNN.ipynb)

Data: 60% Train, 10% Validation, 30% Test

In [35]:
train, test = train_test_split(titanic_dataframe, test_size = 0.2)
y = train['Survived']
X = train[2:]

Optimization

  • vary numerical features used
  • vary K
  • vary initialization

Experiment

In [36]:
k = 2
kmeans = KMeans(n_clusters=k)
results = kmeans.fit_predict(X.values, y.values)
print(results)
[0 0 1 1 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1 1 1
 0 0 0 1 0 1 1 1 0 0 0 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 0 0 0 1 1 0 0
 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 1
 1 1 0 0 0 1 0 1 1 0 1 0 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1]
  • Prepare and upload to Kaggle
In [38]:
#1. open test.csv & clean
#2. predict on test data
#3. convert predictions to datframe
'''df_result = pandas.DataFrame(results[:,0:2], columns=['PassengerId', 'Survived'])'''
#4. dump csv
'''df_result.to_csv('titanic.csv', index=False) '''
Out[38]:
"df_result.to_csv('titanic.csv', index=False) "

Conclusions

Recommendation