DAY 30
0

## [Day31] Kaggle的解題挑戰 2018版 - Kaggle實戰 Titanic

### Titanic

• 下載資料
下載網址
https://www.kaggle.com/c/titanic/data

• 如何training
老實說，我看了下面參考資料的教學，老實說，我目前不打算提高準確率，只想單純的把這題目用最簡單的方式解完，那我們要如何做呢?我們直接使用和教學一樣的Random Forest來進行分析的東西，得出訓練的結果。 另外，目前看來資料清洗的動作也是少不了的。

• 程式碼
基本下我是參考參考資料裏面的程式碼，然後整理一下這樣。我會在最下面的Cell來觀看資料，觀察完就清掉，這樣目前是比較清淨，不過，真的比較複雜還是要另外分開做筆記，不過我覺得觀察和資料的程式碼分開是比較不會搞亂掉。

``````%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.feature_selection import SelectKBest
from sklearn import cross_validation, metrics
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
import warnings
warnings.filterwarnings('ignore')

PassengerId=test['PassengerId']

# all_data = pd.concat([train, test], ignore_index = True)
data = train.append(test)
data.reset_index(inplace=True, drop=True)

data['Family_Size'] = data['Parch'] + data['SibSp']
data['Title1'] = data['Name'].str.split(", ", expand = True)[1]
data['Title1'] = data['Title1'].str.split(".", expand = True)[0]
['Miss','Mrs','Miss','Mr','Mr','Mrs','Mrs','Mr','Mr','Mr','Mr','Mr','Mr','Mrs'])
data['Ticket_info'] = data['Ticket'].apply(lambda x : x.replace(".","").replace("/","").strip().split(' ')[0] if not x.isdigit() else 'X')
data['Embarked'] = data['Embarked'].fillna('S')
data['Fare'] = data['Fare'].fillna(data['Fare'].mean())
data["Cabin"] = data['Cabin'].apply(lambda x : str(x)[0] if not pd.isnull(x) else 'NoCabin')

data['Sex'] = data['Sex'].astype('category').cat.codes
data['Embarked'] = data['Embarked'].astype('category').cat.codes
data['Pclass'] = data['Pclass'].astype('category').cat.codes
data['Title1'] = data['Title1'].astype('category').cat.codes
data['Title2'] = data['Title2'].astype('category').cat.codes
data['Cabin'] = data['Cabin'].astype('category').cat.codes
data['Ticket_info'] = data['Ticket_info'].astype('category').cat.codes

dataAgeNull = data[data["Age"].isnull()]
dataAgeNotNull = data[data["Age"].notnull()]
remove_outlier = dataAgeNotNull[(np.abs(dataAgeNotNull["Fare"]-dataAgeNotNull["Fare"].mean())>(4*dataAgeNotNull["Fare"].std()))|
(np.abs(dataAgeNotNull["Family_Size"]-dataAgeNotNull["Family_Size"].mean())>(4*dataAgeNotNull["Family_Size"].std()))
]
rfModel_age = RandomForestRegressor(n_estimators=2000,random_state=42)
ageColumns = ['Embarked', 'Fare', 'Pclass', 'Sex', 'Family_Size', 'Title1', 'Title2','Cabin','Ticket_info']
rfModel_age.fit(remove_outlier[ageColumns], remove_outlier["Age"])

ageNullValues = rfModel_age.predict(X= dataAgeNull[ageColumns])
dataAgeNull.loc[:,"Age"] = ageNullValues
data = dataAgeNull.append(dataAgeNotNull)
data.reset_index(inplace=True, drop=True)

dataTrain = data[pd.notnull(data['Survived'])].sort_values(by=["PassengerId"])
dataTest = data[~pd.notnull(data['Survived'])].sort_values(by=["PassengerId"])

dataTrain = dataTrain[['Survived', 'Age', 'Embarked', 'Fare',  'Pclass', 'Sex', 'Family_Size', 'Title2','Ticket_info','Cabin']]
dataTest = dataTest[['Age', 'Embarked', 'Fare', 'Pclass', 'Sex', 'Family_Size', 'Title2','Ticket_info','Cabin']]

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(criterion='gini',
n_estimators=1000,
min_samples_split=12,
min_samples_leaf=1,
oob_score=True,
random_state=1,
n_jobs=-1)

rf.fit(dataTrain.iloc[:, 1:], dataTrain.iloc[:, 0])
# print("%.4f" % rf.oob_score_)

rf_res =  rf.predict(dataTest)
submit['Survived'] = rf_res
submit['Survived'] = submit['Survived'].astype(int)
submit.to_csv('submit.csv', index= False)
``````
• 上傳
我們在最後會產生一個submit.csv的檔案。將這個預測結果上傳至Kaggle。
上傳網址:
https://www.kaggle.com/c/titanic/submit

• 結果

1547名，感覺好像還ok，大約是前16%的排名，不過，因為是別人的程式碼，也沒啥好吹噓的。