K-Means Experiment¶
Objectives¶
Predict survival of Titanic passangers.
Which features most accurately predict the outcome?
Data Analysis¶
In [31]:
import pandas
import numpy
from sklearn.cross_validation import train_test_split
from sklearn.cluster import KMeans
from pprint import pprint
MY_TITANIC_TRAIN = '/media/removable/data/train_titanic.csv'
MY_TITANIC_TEST = '/media/removable/data/test_titanic.csv'
titanic_dataframe = pandas.read_csv(MY_TITANIC_TRAIN, header=0)
- fix missing values
 
In [32]:
titanic_dataframe = titanic_dataframe.dropna()
- statistics & shape
 
Selection of Features¶
- Must have a mean –> Remove categorical data
 
In [33]:
titanic_dataframe.drop(['Name', 'Ticket', 'Cabin', 'Embarked', 'Sex'], axis=1, inplace=True)
#print('length: {0} '.format(len(titanic_dataframe)))
#print(titanic_dataframe.head(5))
length: 183
    PassengerId  Survived  Pclass  Age  SibSp  Parch     Fare
1             2         1       1   38      1      0  71.2833
3             4         1       1   35      1      0  53.1000
6             7         0       1   54      0      0  51.8625
10           11         1       3    4      1      1  16.7000
11           12         1       1   58      0      0  26.5500
- discrete vs. continuous
 
In [34]:
print(2.2 * 3.0 == 6.6)
print(3.3 * 2.0 == 6.6)
False
True
Oh look –floats bite.
Experiment Heueristics (Design)¶
Evaluation¶
Confusion Matrix: https://en.wikipedia.org/wiki/Confusion_matrix
Confusion Matrix Clarification: http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
Mean F Score: https://www.kaggle.com/wiki/MeanFScore
- \(F_1 = 2 * \frac{precision * recall}{ precision + recall}\)
 - \(precision = \frac{tp}{tp+fp}\)
 - \(recall = \frac{tp}{tp+fn}\)
 
Representation¶
K-means
Distance = Euclidean (yes I mispelled this in KNN.ipynb)
Data: 60% Train, 10% Validation, 30% Test
In [35]:
train, test = train_test_split(titanic_dataframe, test_size = 0.2)
y = train['Survived']
X = train[2:]
Optimization¶
- vary numerical features used
 - vary K
 - vary initialization
 
Experiment¶
In [36]:
k = 2
kmeans = KMeans(n_clusters=k)
results = kmeans.fit_predict(X.values, y.values)
print(results)
[0 0 1 1 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1 1 1
 0 0 0 1 0 1 1 1 0 0 0 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 0 0 0 1 1 0 0
 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 1
 1 1 0 0 0 1 0 1 1 0 1 0 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1]
- Prepare and upload to Kaggle
 
In [38]:
#1. open test.csv & clean
#2. predict on test data
#3. convert predictions to datframe
'''df_result = pandas.DataFrame(results[:,0:2], columns=['PassengerId', 'Survived'])'''
#4. dump csv
'''df_result.to_csv('titanic.csv', index=False) '''
Out[38]:
"df_result.to_csv('titanic.csv', index=False) "