DAY 23
2
Big Data

## 建立決策樹分類器

• 可以同時處理連續型與類別型變數。
• 不需要進行太多的資料預處理（Preprocessing），像是昨天我們在建立 Logistic 迴歸前得將 `Pclass``Sex` 這兩個變數創造成 dummy variables，但是決策樹分類器不需要。
``````# ... 前略

# 創造 dummy variables
label_encoder = preprocessing.LabelEncoder()
encoded_Sex = label_encoder.fit_transform(titanic_train["Sex"])
encoded_Pclass = label_encoder.fit_transform(titanic_train["Pclass"])

# ... 後略
``````

Decision Trees Classifiers are a non-parametric supervised learning method used for classification, that are capable of performing multi-class classification on a dataset. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
1.10. Decision Trees - scikit-learn 0.18.1 documentation

### Python

``````from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.cross_validation import train_test_split

# 讀入鳶尾花資料
iris_X = iris.data
iris_y = iris.target

# 切分訓練與測試資料
train_X, test_X, train_y, test_y = train_test_split(iris_X, iris_y, test_size = 0.3)

# 建立分類器
clf = tree.DecisionTreeClassifier()
iris_clf = clf.fit(train_X, train_y)

# 預測
test_y_predicted = iris_clf.predict(test_X)
print(test_y_predicted)

# 標準答案
print(test_y)
``````

### R 語言

``````library(rpart)

# 切分訓練與測試資料
n <- nrow(iris)
shuffled_iris <- iris[sample(n), ]
train_indices <- 1:round(0.7 * n)
train_iris <- shuffled_iris[train_indices, ]
test_indices <- (round(0.7 * n) + 1):n
test_iris <- shuffled_iris[test_indices, ]

# 建立分類器
iris_clf <- rpart(Species ~ ., data = train_iris, method = "class")

# 預測
test_iris_predicted = predict(iris_clf, test_iris, type = "class")
test_iris_predicted

# 標準答案
test_iris\$Species
``````

## 決策樹分類器的績效

### Python

``````from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn import metrics

# 讀入鳶尾花資料
iris_X = iris.data
iris_y = iris.target

# 切分訓練與測試資料
train_X, test_X, train_y, test_y = train_test_split(iris_X, iris_y, test_size = 0.3)

# 建立分類器
clf = tree.DecisionTreeClassifier()
iris_clf = clf.fit(train_X, train_y)

# 預測
test_y_predicted = iris_clf.predict(test_X)

# 績效
accuracy = metrics.accuracy_score(test_y, test_y_predicted)
print(accuracy)
``````

### R 語言

``````library(rpart)

# 切分訓練與測試資料
n <- nrow(iris)
shuffled_iris <- iris[sample(n), ]
train_indices <- 1:round(0.7 * n)
train_iris <- shuffled_iris[train_indices, ]
test_indices <- (round(0.7 * n) + 1):n
test_iris <- shuffled_iris[test_indices, ]

# 建立分類器
iris_clf <- rpart(Species ~ ., data = train_iris, method = "class")

# 預測
test_iris_predicted <- predict(iris_clf, test_iris, type = "class")

# 績效
conf_mat <- table(test_iris\$Species, test_iris_predicted)
accuracy <- sum(diag(conf_mat)) / sum(conf_mat)
accuracy
``````

## 建立 k-Nearest Neighbors 分類器

k-Nearest Neighbors 分類器同樣也是可以處理多元分類問題的演算法，由於是以距離作為未知類別的資料點分類依據，必須要將類別變數轉換為 dummy variables 然後將所有的數值型變數標準化，避免因為單位不同，在距離的計算上失真。

The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning.)
1.6. Nearest Neighbors - scikit-learn 0.18.1 documentation

### Python

``````from sklearn.datasets import load_iris
from sklearn import neighbors
from sklearn.cross_validation import train_test_split
from sklearn import metrics

# 讀入鳶尾花資料
iris_X = iris.data
iris_y = iris.target

# 切分訓練與測試資料
train_X, test_X, train_y, test_y = train_test_split(iris_X, iris_y, test_size = 0.3)

# 建立分類器
clf = neighbors.KNeighborsClassifier()
iris_clf = clf.fit(train_X, train_y)

# 預測
test_y_predicted = iris_clf.predict(test_X)
print(test_y_predicted)

# 標準答案
print(test_y)
``````

### R 語言

``````library(class)

# 切分訓練與測試資料
n <- nrow(iris)
shuffled_iris <- iris[sample(n), ]
train_indices <- 1:round(0.7 * n)
train_iris <- shuffled_iris[train_indices, ]
test_indices <- (round(0.7 * n) + 1):n
test_iris <- shuffled_iris[test_indices, ]

# 獨立 X 與 y
train_iris_X <- train_iris[, -5]
test_iris_X <- test_iris[, -5]
train_iris_y <- train_iris[, 5]
test_iris_y <- test_iris[, 5]

# 預測
test_y_predicted <- knn(train = train_iris_X, test = test_iris_X, cl = train_iris_y, k = 5)
test_y_predicted

# 標準答案
print(test_iris_y)
``````

## 如何選擇 k

### Python

``````from sklearn.datasets import load_iris
from sklearn import neighbors
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt

# 讀入鳶尾花資料
iris_X = iris.data
iris_y = iris.target

# 切分訓練與測試資料
train_X, test_X, train_y, test_y = train_test_split(iris_X, iris_y, test_size = 0.3)

# 選擇 k
range = np.arange(1, round(0.2 * train_X.shape[0]) + 1)
accuracies = []

for i in range:
clf = neighbors.KNeighborsClassifier(n_neighbors = i)
iris_clf = clf.fit(train_X, train_y)
test_y_predicted = iris_clf.predict(test_X)
accuracy = metrics.accuracy_score(test_y, test_y_predicted)
accuracies.append(accuracy)

# 視覺化
plt.scatter(range, accuracies)
plt.show()
appr_k = accuracies.index(max(accuracies)) + 1
print(appr_k)
``````

k 在介於 8 到 12 之間模型的準確率最高。

### R 語言

``````library(class)

# 切分訓練與測試資料
n <- nrow(iris)
shuffled_iris <- iris[sample(n), ]
train_indices <- 1:round(0.7 * n)
train_iris <- shuffled_iris[train_indices, ]
test_indices <- (round(0.7 * n) + 1):n
test_iris <- shuffled_iris[test_indices, ]

# 獨立 X 與 y
train_iris_X <- train_iris[, -5]
test_iris_X <- test_iris[, -5]
train_iris_y <- train_iris[, 5]
test_iris_y <- test_iris[, 5]

# 選擇 k
range <- 1:round(0.2 * nrow(train_iris_X))
accuracies <- rep(NA, length(range))

for (i in range) {
test_y_predicted <- knn(train = train_iris_X, test = test_iris_X, cl = train_iris_y, k = i)
conf_mat <- table(test_iris_y, test_y_predicted)
accuracies[i] <- sum(diag(conf_mat))/sum(conf_mat)
}

# 視覺化
plot(range, accuracies, xlab = "k")
which.max(accuracies)
``````

k 在等於 14 與 18 時模型的準確率最高。