[Day20] 分群數目衡量

2022 iThome 鐵人賽

DAY 20

AI & Data

人類行為數據分析- 以R和Python進行實作系列第 20 篇

14th鐵人賽

anonymous9007

團隊NTUEPM_STAT LIFE

2022-10-01 09:11:44

3707 瀏覽

分享至

分群的目的為讓群內的總變異最小，群間的總變異最大，因此在執行分群任務時，如何找到恰當的分群數目(k)是一個重要的課題。
在衡量分群數量時，可使用手肘法(Elbow Method)或平均輪廓/平均側影法(Average Silhouette Method)。

Elbow Method

手肘法(Elbow Method)針對所有資料點和各群中心距離的平方誤差和(Sum of Squared Error, SSE)進行計算，當k=1時，SSE為最大，隨著k的增加，SSE會逐漸下降，而找到最適合的分群(k)時，SSE的下降斜率會趨於平緩。

Average Silhouette Method

輪廓係數/側影係數(The Silhouette Index)為根據每個資料點的分散以及聚合來衡量分群的結果。
方法：

計算樣本i到同分類(C_i)其他樣本的平均距離a(i)，若a(i)越小，代表樣本i越屬於此分類，因此a(i)可視為樣本i在此類別內的不相似度
計算樣本i到其他分類C_j的所有樣本的平均距離b(i)，視為樣本i的群間不相似度，由於會有多個群，因此b(i)=min⁡(b_i1,b_i2,…,b_ik)
由a(i)和b(i)計算s(i)，而s(i)則稱為樣本i的輪廓係數

當s(i)越接近1，代表樣本i越屬於該分類；s(i)越接近-1，代表樣本i應該分為在其他分類中；s(i)接近0，則代表樣本i介於兩個分類的邊界上
計算所有樣本的s(i)後取平均，則可代表此分類結果的輪廓係數

Elbow method 實作

R: `kmeans()$tot.withinss`的數值為within-clusters sum of squares

library(purrr)
#用較小的資料量進行
subset_testing <- testing[sample(1:length(testing$Activity),10000),]
wss <- function(k){kmeans(subset_testing[,1:12],k,nstart = 20)$tot.withinss}

k.value <- 1:20
wss_value <- map_dbl(k.value,wss)

plot(k.value,wss_value,type= 'l',xlab = 'Number of cluster K',ylb = 'Total within-clusters sum of squares')

Python: `kMeans().fit_predict().inertia_`的數值為within-clusters sum of squares

import matplotlib.pyplot as plt
## 用較小的資料量進行
index = list(range(0,(X_test.shape[0])-1))
sample_index = random.sample(index, 10000)

wss_avg = []
for i in range(2,20):
    kmeans_fit = cluster.KMeans(n_clusters = i,algorithm="elkan").fit(X_test[sample_index,:])
    kmeans_fit.fit_predict(X_test[sample_index,:])
    wss_avg.append(kmeans_fit.inertia_)
plt.plot(range(2,20), wss_avg)

Silhouette Index 實作

接續前一天的Kmeans模型，實作分群衡量。

R: `cluster`套件中的`silhouette`

## 用較小的資料量
## subset_testing[,13]為activity
subset_testing <- testing[sample(1:length(testing$Activity),10000),]
silhouette_score <- function(k){
  km <- kmeans(subset_testing[,1:12], centers = k, nstart=25)
  ss <- silhouette(km$cluster, dist(subset_testing[,13]))
  mean(ss[, 3])
}
k <- 2:20
avg_sil <- sapply(k, silhouette_score)
plot(k, type='b', avg_sil, xlab='Number of clusters', ylab='Average Silhouette Scores', frame=FALSE)

Python:`sklearn.metrics`套件中的`silhouette_score`

from sklearn import cluster
from sklearn.metrics import silhouette_score
# KMeans
pred_kmean = cluster.KMeans(n_clusters = 13,algorithm="elkan")
pred_kmean.fit_predict(X_test)

pred_labels = pred_kmean.labels_

score = silhouette_score(X_test, pred_labels, metric='euclidean')
print('Silhouetter Score: %.3f' % score)
## Silhouetter Score: 0.244

使用不同的K，計算silhouette score

import matplotlib.pyplot as plt
## 減少資料量
index = list(range(0,(X_test.shape[0])-1))
sample_index = random.sample(index, 10000)

silhouette_avg = []
for i in range(2,20):
    kmeans_fit = cluster.KMeans(n_clusters = i,algorithm="elkan").fit(X_test[sample_index,:])
    kmeans_fit.fit_predict(X_test[sample_index,:])
    silhouette_avg.append(silhouette_score(Y_test.iloc[sample_index].array.reshape(-1, 1), kmeans_fit.labels_,metric='euclidean'))
plt.plot(range(2,20), silhouette_avg)