K-Means Experiment¶

Objectives¶

Predict survival of Titanic passangers.

Which features most accurately predict the outcome?

Data Analysis¶

In [31]:

import pandas
import numpy
from sklearn.cross_validation import train_test_split
from sklearn.cluster import KMeans
from pprint import pprint

MY_TITANIC_TRAIN = '/media/removable/data/train_titanic.csv'
MY_TITANIC_TEST = '/media/removable/data/test_titanic.csv'
titanic_dataframe = pandas.read_csv(MY_TITANIC_TRAIN, header=0)

fix missing values

In [32]:

titanic_dataframe = titanic_dataframe.dropna()

statistics & shape

Selection of Features¶

Must have a mean –> Remove categorical data

In [33]:

titanic_dataframe.drop(['Name', 'Ticket', 'Cabin', 'Embarked', 'Sex'], axis=1, inplace=True)
#print('length: {0} '.format(len(titanic_dataframe)))
#print(titanic_dataframe.head(5))

length: 183
    PassengerId  Survived  Pclass  Age  SibSp  Parch     Fare
1             2         1       1   38      1      0  71.2833
3             4         1       1   35      1      0  53.1000
6             7         0       1   54      0      0  51.8625
10           11         1       3    4      1      1  16.7000
11           12         1       1   58      0      0  26.5500

discrete vs. continuous

In [34]:

print(2.2 * 3.0 == 6.6)
print(3.3 * 2.0 == 6.6)

False
True

Oh look –floats bite.

Experiment Heueristics (Design)¶

Evaluation¶

Confusion Matrix: https://en.wikipedia.org/wiki/Confusion_matrix

Confusion Matrix Clarification: http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

Mean F Score: https://www.kaggle.com/wiki/MeanFScore

\(F_1 = 2 * \frac{precision * recall}{ precision + recall}\)
\(precision = \frac{tp}{tp+fp}\)
\(recall = \frac{tp}{tp+fn}\)

Representation¶

K-means

Distance = Euclidean (yes I mispelled this in KNN.ipynb)

Data: 60% Train, 10% Validation, 30% Test

In [35]:

train, test = train_test_split(titanic_dataframe, test_size = 0.2)
y = train['Survived']
X = train[2:]

Optimization¶

vary numerical features used
vary K
vary initialization

Experiment¶

In [36]:

k = 2
kmeans = KMeans(n_clusters=k)
results = kmeans.fit_predict(X.values, y.values)
print(results)

[0 0 1 1 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1 1 1
 0 0 0 1 0 1 1 1 0 0 0 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 0 0 0 1 1 0 0
 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 1
 1 1 0 0 0 1 0 1 1 0 1 0 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1]

Prepare and upload to Kaggle

In [38]:

#1. open test.csv & clean
#2. predict on test data
#3. convert predictions to datframe
'''df_result = pandas.DataFrame(results[:,0:2], columns=['PassengerId', 'Survived'])'''
#4. dump csv
'''df_result.to_csv('titanic.csv', index=False) '''

Out[38]:

"df_result.to_csv('titanic.csv', index=False) "

K-Means Experiment¶

Objectives¶

Data Analysis¶

Selection of Features¶

Experiment Heueristics (Design)¶

Evaluation¶

Representation¶

Optimization¶

Experiment¶

Conclusions¶

Recommendation¶

Table Of Contents

Previous topic

Next topic