[R語言]資料分析實作分享-----鐵達尼號存活分析(下集)

r r語言資料分析 titanic.csv

df568923 2020-08-24 00:02:29 ‧ 1993 瀏覽

分享至

接續上次(中集)我們所做的事情，我們已經把Age的資料都給補齊了，我們來回顧一下上次最後的資料:

str(full)

現在我們想要新增一個新的變數， Child 和 Mother 。

在這之前，想先來看一下性別與存活率之間的關係。

# First we'll look at the relationship between age & survival
ggplot(full[1:891,], aes(Age, fill = factor(Survived))) + 
  geom_histogram() + 
  # I include Sex since we know (a priori) it's a significant predictor
  facet_grid(.~Sex) + 
  theme_few()

以及創建 Child 和 Mother ，我們定義所謂 Child 就是小於18歲的人，反之就是成人。

# Create the column child, and indicate whether child or adult
full$Child[full$Age < 18] <- 'Child'
full$Child[full$Age >= 18] <- 'Adult'

# Show counts
table(full$Child, full$Survived)

接著是 Mother ，這邊定義母親是女性，且直系親屬大於1人，且不是Miss的稱謂，並且年齡大於18歲。

# Adding Mother variable
full$Mother <- 'Not Mother'
full$Mother[full$Sex == 'female' & full$Parch > 0 & full$Age > 18 & full$Title != 'Miss'] <- 'Mother'

# Show counts
table(full$Mother, full$Survived)

把剛剛做好的新變數轉換成 factor

# Finish by factorizing our two new factor variables
full$Child  <- factor(full$Child)
full$Mother <- factor(full$Mother)

如此一來，我們所需要的變數大致都完成了，接下來我們要預測存活率！還記得我們在上集的時候，資料是分成訓練集以及測試集，現在我們把他們拆解回來。

# Split the data back into a train set and a test set
train <- full[1:891,]
test <- full[892:1309,]

並且選好一個seed，開始機器學習吧！
我們選上我們覺得較為重要的變數，先用訓練集做出我們的model:

# Set a random seed
set.seed(754)

# Build the model (note: not all possible variables are used)
rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + 
                                            Fare + Embarked + Title + 
                                            FsizeD + Child + Mother,
                                            data = train)

# Show model error
plot(rf_model, ylim=c(0,0.36))
legend('topright', colnames(rf_model$err.rate), col=1:3, fill=1:3)

由上圖，我們黑線表示我們的錯誤率大概低於20%左右，所以算是一個還可以的模型。

接著我們可以試著看一下各個變數的重要性。

# Get importance
importance    <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))

# Create a rank variable based on importance
rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(Importance))))

# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, Importance), 
    y = Importance, fill = Importance)) +
  geom_bar(stat='identity') + 
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
    hjust=0, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variables') +
  coord_flip() + 
  theme_few()

看來之前所製作的 Title 是一個蠻有用的變數，不過後來製作的其他變數就沒有那麼成功了。也有可能只是model並不是做的那麼完善，但是還是很值得可以欣賞 Megan 所做的這些嘗試。

最後我們當然要來預測我們的測試集的存活率:

# Predict using the test set
prediction <- predict(rf_model, test)

# Save the solution to a dataframe with two columns: PassengerId and Survived (prediction)
solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)

# Write the solution to file
write.csv(solution, file = 'rf_mod_Solution.csv', row.names = F)

最後我們來看一下我們預測的結果(solution)和標準答案的差異，並且算一下預測成功率。先把標準答案的資料給讀進來。

gender_submission <- read.csv('F:/Users/yueh/Desktop/titanic08/gender_submission.csv', stringsAsFactors = F)

solution$submission<-as.factor(gender_submission$Survived)
table(solution$Survived, solution$submission)

因此，我們知道預測成功率是88.9%，這次的分享就到這邊，謝謝大家。

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19777 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

[R語言]資料分析實作分享-----鐵達尼號存活分析(下集)

尚未有邦友留言

標記使用者