DAY 25
1
Big Data

## [第 25 天] 機器學習（5）整體學習

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.
1.11. Ensemble methods - scikit-learn 0.18.1 documentation

## Bagging

Bagging 是 Bootstrap Aggregating 的簡稱，透過統計學的 Bootstrap sampling 得到不同的訓練資料，然後根據這些訓練資料得到一系列的基本分類器，假如演算法產生了 5 個基本分類器，她們對某個觀測值的預測結果分別為 1, 0, 1, 1, 1，那麼 Bagging 演算法的輸出結果就會是 1，這個過程稱之為基本分類器的投票。

### Python

``````import numpy as np
import pandas as pd
from sklearn import cross_validation, ensemble, preprocessing, metrics

# 載入資料

# 填補遺漏值
age_median = np.nanmedian(titanic_train["Age"])
new_Age = np.where(titanic_train["Age"].isnull(), age_median, titanic_train["Age"])
titanic_train["Age"] = new_Age

# 創造 dummy variables
label_encoder = preprocessing.LabelEncoder()
encoded_Sex = label_encoder.fit_transform(titanic_train["Sex"])

# 建立訓練與測試資料
titanic_X = pd.DataFrame([titanic_train["Pclass"],
encoded_Sex,
titanic_train["Age"]
]).T
titanic_y = titanic_train["Survived"]
train_X, test_X, train_y, test_y = cross_validation.train_test_split(titanic_X, titanic_y, test_size = 0.3)

# 建立 bagging 模型
bag = ensemble.BaggingClassifier(n_estimators = 100)
bag_fit = bag.fit(train_X, train_y)

# 預測
test_y_predicted = bag.predict(test_X)

# 績效
accuracy = metrics.accuracy_score(test_y, test_y_predicted)
print(accuracy)
``````

### R 語言

``````library(adabag)
library(rpart)

titanic_train\$Survived <- factor(titanic_train\$Survived)

# 將 Age 遺漏值以 median 填補
age_median <- median(titanic_train\$Age, na.rm = TRUE)
new_Age <- ifelse(is.na(titanic_train\$Age), age_median, titanic_train\$Age)
titanic_train\$Age <- new_Age

# 切分訓練與測試資料
n <- nrow(titanic_train)
shuffled_titanic <- titanic_train[sample(n), ]
train_indices <- 1:round(0.7 * n)
train_titanic <- shuffled_titanic[train_indices, ]
test_indices <- (round(0.7 * n) + 1):n
test_titanic <- shuffled_titanic[test_indices, ]

# 建立模型
bag_fit <- bagging(Survived ~ Pclass + Age + Sex, data = train_titanic, mfinal = 100)

# 預測
test_titanic_predicted <- predict(bag_fit, test_titanic)

# 績效
accuracy <- 1 - test_titanic_predicted\$error
accuracy
``````

AdaBoost 同樣是基於數個基本分類器的整體學習演算法，跟前述 Bagging 演算法不同的地方在於，她在形成基本分類器時除了隨機生成，還會針對在前一個基本分類器中被分類錯誤的觀測值提高抽樣權重，使得該觀測值在下一個基本分類器形成時有更高機率被選入，藉此提高被正確分類的機率，簡單來說，她是個具有即時調節觀測值抽樣權重的進階 Bagging 演算法。

### Python

``````import numpy as np
import pandas as pd
from sklearn import cross_validation, ensemble, preprocessing, metrics

# 載入資料

# 填補遺漏值
age_median = np.nanmedian(titanic_train["Age"])
new_Age = np.where(titanic_train["Age"].isnull(), age_median, titanic_train["Age"])
titanic_train["Age"] = new_Age

# 創造 dummy variables
label_encoder = preprocessing.LabelEncoder()
encoded_Sex = label_encoder.fit_transform(titanic_train["Sex"])

# 建立訓練與測試資料
titanic_X = pd.DataFrame([titanic_train["Pclass"],
encoded_Sex,
titanic_train["Age"]
]).T
titanic_y = titanic_train["Survived"]
train_X, test_X, train_y, test_y = cross_validation.train_test_split(titanic_X, titanic_y, test_size = 0.3)

# 建立 boosting 模型
boost_fit = boost.fit(train_X, train_y)

# 預測
test_y_predicted = boost.predict(test_X)

# 績效
accuracy = metrics.accuracy_score(test_y, test_y_predicted)
print(accuracy)
``````

### R 語言

``````library(adabag)
library(rpart)

titanic_train\$Survived <- factor(titanic_train\$Survived)

# 將 Age 遺漏值以 median 填補
age_median <- median(titanic_train\$Age, na.rm = TRUE)
new_Age <- ifelse(is.na(titanic_train\$Age), age_median, titanic_train\$Age)
titanic_train\$Age <- new_Age

# 切分訓練與測試資料
n <- nrow(titanic_train)
shuffled_titanic <- titanic_train[sample(n), ]
train_indices <- 1:round(0.7 * n)
train_titanic <- shuffled_titanic[train_indices, ]
test_indices <- (round(0.7 * n) + 1):n
test_titanic <- shuffled_titanic[test_indices, ]

# 建立模型
boost_fit <- boosting(Survived ~ Pclass + Age + Sex, data = train_titanic, mfinal = 100)

# 預測
test_titanic_predicted <- predict(bag_fit, test_titanic)

# 績效
accuracy <- 1 - test_titanic_predicted\$error
accuracy
``````

## 參考連結

### 1 則留言

0
mondeos
iT邦新手 5 級 ‧ 2017-09-05 09:38:09