Day12 K-近鄰演算法(K Nearest Neighbors, KNN)-- Python建立模型

2022 iThome 鐵人賽

DAY 12

AI & Data

人類學習機器學習的學習筆記 with Python系列第 12 篇

14th鐵人賽機器學習 python系列文章

liaochenpo

團隊NTUEPM_STAT LIFE

2022-09-23 00:09:06

1284 瀏覽

分享至

前言

今天將以Python建立KNN的模型，包含如何選擇一個適當的K值。以iris為例，將屬種(Species)當成反應變數或outcome，共有三類，以KNN嘗試建立預測模型。

匯入iris資料集

urlprefix = 'https://vincentarelbundock.github.io/Rdatasets/csv/' 
dataname = 'datasets/iris.csv'
iris = pd.read_csv(urlprefix + dataname)
iris = iris.drop("Unnamed: 0", 1)

查看前六筆資料：

iris.head()

資料標準化：
因為KNN要尋找與input最接近的鄰居，因此需要利用標準化使每個變數間的比例一樣

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(iris.drop('Species',axis=1))
scaled_features = scaler.transform(iris.drop('Species',axis=1))
iris_feat = pd.DataFrame(scaled_features,columns=iris.columns[:-1])
iris_feat.head()

將資料切割為訓練集與測試集：

from sklearn.model_selection import train_test_split

X = iris_feat
y = iris['Species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

執行KNN演算法(從1開始測試)：

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 1)
knn.fit(X_train,y_train)
pred = knn.predict(X_test)

以混淆矩陣 (Confusion matrix)觀察模型結果：
當成反應變數的Species共有三類，在矩陣的對角線上表示模型分類正確，非對角線上為錯誤的分類

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,pred))

嘗試其他K值，觀察錯誤率：
由圖中可以發現當K為13時錯誤率最低，因此選擇K=13在這筆資料中較為適合。

error_rate = []

for i in range(1,30):
  knn = KNeighborsClassifier(n_neighbors = i)
  knn.fit(X_train,y_train)
  pred_i = knn.predict(X_test)
  error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize=(10, 6))
plt.plot(range(1, 30),error_rate,color = 'blue',linestyle = 'dashed', marker = 'o', markerfacecolor = 'red', markersize = 8)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.show()