DAY 24
1
Big Data

## K-Means

K-Means 演算法可以非常快速地完成分群任務，但是如果觀測值具有雜訊（Noise）或者極端值，其分群結果容易被這些雜訊與極端值影響，適合處理分布集中的大型樣本資料。

### 快速實作

#### Python

``````from sklearn import cluster, datasets

# 讀入鳶尾花資料
iris_X = iris.data

# KMeans 演算法
kmeans_fit = cluster.KMeans(n_clusters = 3).fit(iris_X)

# 印出分群結果
cluster_labels = kmeans_fit.labels_
print("分群結果：")
print(cluster_labels)
print("---")

# 印出品種看看
iris_y = iris.target
print("真實品種：")
print(iris_y)
``````

#### R 語言

``````# 讀入鳶尾花資料
iris_kmeans <- iris[, -5]

# KMeans 演算法
kmeans_fit <- kmeans(iris_kmeans, nstart=20, centers=3)

# 印出分群結果
kmeans_fit\$cluster

# 印出品種看看
iris\$Species
``````

### 績效

#### Python

Compute the mean Silhouette Coefficient of all samples.
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). The best value is 1 and the worst value is -1.
sklearn.metrics.silhouette_score - scikit-learn 0.18.1 documentation

``````from sklearn import cluster, datasets, metrics

# 讀入鳶尾花資料
iris_X = iris.data

# KMeans 演算法
kmeans_fit = cluster.KMeans(n_clusters = 3).fit(iris_X)
cluster_labels = kmeans_fit.labels_

# 印出績效
silhouette_avg = metrics.silhouette_score(iris_X, cluster_labels)
print(silhouette_avg)
``````

#### R 語言

``````# 讀入鳶尾花資料
iris_kmeans <- iris[, -5]

# KMeans 演算法
kmeans_fit <- kmeans(iris_kmeans, nstart=20, centers=3)
ratio <- kmeans_fit\$tot.withinss / kmeans_fit\$totss
ratio
``````

### 如何選擇 k

#### Python

``````from sklearn import cluster, datasets, metrics
import matplotlib.pyplot as plt

# 讀入鳶尾花資料
iris_X = iris.data

# 迴圈
silhouette_avgs = []
ks = range(2, 11)
for k in ks:
kmeans_fit = cluster.KMeans(n_clusters = k).fit(iris_X)
cluster_labels = kmeans_fit.labels_
silhouette_avg = metrics.silhouette_score(iris_X, cluster_labels)
silhouette_avgs.append(silhouette_avg)

# 作圖並印出 k = 2 到 10 的績效
plt.bar(ks, silhouette_avgs)
plt.show()
print(silhouette_avgs)
``````

k 值在等於 2 與 3 的時候 K-Means 演算法的績效較好，這也驗證了我們先前的觀察，setosa 這個品種跟另外兩個品種在花瓣（Petal）的長和寬跟花萼（Sepal）的長和寬有比較大的差異，因此如果是以 K-Means 分群，可能會將 setosa 歸為一群，versicolor 和 virginica 歸為一群。

#### R 語言

``````# 讀入鳶尾花資料
iris_kmeans <- iris[, -5]

# 迴圈
ratio <- rep(NA, times = 10)
for (k in 2:length(ratio)) {
kmeans_fit <- kmeans(iris_kmeans, centers = k, nstart = 20)
ratio[k] <- kmeans_fit\$tot.withinss / kmeans_fit\$betweenss
}
plot(ratio, type="b", xlab="k")
``````

## Hierarchical Clustering

### 快速實作

#### Python

``````from sklearn import cluster, datasets

# 讀入鳶尾花資料
iris_X = iris.data

# Hierarchical Clustering 演算法
hclust = cluster.AgglomerativeClustering(linkage = 'ward', affinity = 'euclidean', n_clusters = 3)

# 印出分群結果
hclust.fit(iris_X)
cluster_labels = hclust.labels_
print(cluster_labels)
print("---")

# 印出品種看看
iris_y = iris.target
print(iris_y)
``````

#### R 語言

``````# 讀入鳶尾花資料
iris_hclust <- iris[, -5]

# Hierarchical Clustering 演算法
dist_matrix <- dist(iris_hclust)
hclust_fit <- hclust(dist_matrix, method = "single")
hclust_fit_cut <- cutree(hclust_fit, k = 3)

# 印出分群結果
hclust_fit_cut

# 印出品種看看
iris\$Species
``````

### 績效

#### Python

``````from sklearn import cluster, datasets, metrics

# 讀入鳶尾花資料
iris_X = iris.data

# Hierarchical Clustering 演算法
hclust = cluster.AgglomerativeClustering(linkage = 'ward', affinity = 'euclidean', n_clusters = 3)

# 印出績效
hclust.fit(iris_X)
cluster_labels = hclust.labels_
silhouette_avg = metrics.silhouette_score(iris_X, cluster_labels)
print(silhouette_avg)
``````

#### R 語言

``````library(GMD)

# 讀入鳶尾花資料
iris_hclust <- iris[, -5]

# Hierarchical Clustering 演算法
dist_matrix <- dist(iris_hclust)
hclust_fit <- hclust(dist_matrix)
hclust_fit_cut <- cutree(hclust_fit, k = 3)

# 印出績效
hc_stats <- css(dist_matrix, clusters = hclust_fit_cut)
hc_stats\$totwss / hc_stats\$totbss

# Dendrogram
plot(hclust_fit)
rect.hclust(hclust_fit, k = 2, border = "red")
rect.hclust(hclust_fit, k = 3, border = "green")
``````