1

## [R語言]資料分析實作分享-----鐵達尼號存活分析(上集)

Kaggle是一個全球性的資料科學社群網站，上面有很多資料分析的比賽、資料分析的文章，以及可供下載的資料集。今天想跟大家分享一則資料分析的文章，作者(Meg Risdal)所要分析的資料集是有關鐵達尼號的乘客資訊，裡面使用到的data在Kaggle裡面可以下載。那我們就跟著他的腳步來走一遍資料分析。

``````library('ggplot2') # 視覺化
library('ggthemes') # 視覺化
library('scales') # 視覺化
library('dplyr') # 資料整理與轉換
library('mice') # imputation
library('randomForest') # 機器學習
``````

``````train <- read.csv('F:/Users/yueh/Desktop/titanic08/train.csv', stringsAsFactors = F)
test  <- read.csv('F:/Users/yueh/Desktop/titanic08/test.csv', stringsAsFactors = F)
``````

``````str(train)
str(test)
``````

1. PassengerId:乘客編號
2. Survived:是否存活(0:死亡，1:存活)
3. Pclass:票艙分級(1:最高集,...,3:最低集)
4. Name:姓名
5. Sex:性別
6. Age:年齡
7. SibSp:兄弟姊妹及配偶在船上的總數(旁系)
8. Parch:父母及小孩在船上的總數(直系)
9. Ticket:票號
10. Fare:票價
11. Cabin:客艙編號
12. Embarked:上船的港口

``````full  <- bind_rows(train, test) # bind training & test data
str(full)
``````

``````# Grab title from passenger names
full\$Title <- gsub('(.*, )|(\\..*)', '', full\$Name)

# Show title counts by sex
table(full\$Sex, full\$Title)
``````

``````# Titles with very low cell counts to be combined to "rare" level
rare_title <- c('Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don',
'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer')

# Also reassign mlle, ms, and mme accordingly
full\$Title[full\$Title == 'Mlle']        <- 'Miss'
full\$Title[full\$Title == 'Ms']          <- 'Miss'
full\$Title[full\$Title == 'Mme']         <- 'Mrs'
full\$Title[full\$Title %in% rare_title]  <- 'Rare Title'

# Show title counts by sex again
table(full\$Sex, full\$Title)
``````

``````full\$Surname <- sapply(full\$Name,
function(x) strsplit(x, split = '[,.]')[[1]][1])
``````

``````# Create a family size variable including the passenger themselves
full\$Fsize <- full\$SibSp + full\$Parch + 1

# Create a family variable
full\$Family <- paste(full\$Surname, full\$Fsize, sep='_')
``````

``````# Use ggplot2 to visualize the relationship between family size & survival
ggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) +
geom_bar(stat='count', position='dodge') +
scale_x_continuous(breaks=c(1:11)) +
labs(x = 'Family Size') +
theme_few()
``````

``````# Discretize family size
full\$FsizeD[full\$Fsize == 1] <- 'singleton'
full\$FsizeD[full\$Fsize < 5 & full\$Fsize > 1] <- 'small'
full\$FsizeD[full\$Fsize > 4] <- 'large'

# Show family size by survival using a mosaic plot
mosaicplot(table(full\$FsizeD, full\$Survived), main='Family Size by Survival', shade=TRUE)
``````

``````full\$Deck<-factor(sapply(full\$Cabin, function(x) strsplit(x, NULL)[[1]][1]))
``````

``````str(full)
``````