- 我們今天就要利用 sklearn 提供的 Iris(鳶尾花)資料,並且手工撰寫 logistic regression 來分類他們,並且利用今天的實作,解釋一下整個從零開始的模型訓練跟測試,完整的示範一遍基礎的流程
資料取得、資料前處理和目標訂定
- 首先當然是先取得資料,然後來看看我們的目標是什麼
- 所以這邊先引入鳶尾花資料集,並且作基礎的資料判斷
from sklearn import datasets
iris = datasets.load_iris()
# print(iris.DESCR)
- 可以看到我們的資料有三個種類,且有四種不同的參數特徵
- 為了方便示範,我們就取兩個類別,且只留兩種特徵做判斷
- 所以下面的操作會是取兩種資料特徵跟四個特徵
import pandas as pd
import numpy as np
# use pandas as dataframe and merge features and targets
feature = pd.DataFrame(iris.data, columns=iris.feature_names)
target = pd.DataFrame(iris.target, columns=['target'])
iris_data = pd.concat([feature, target], axis=1)
# keep only sepal length in cm, sepal width in cm and target
iris_data = iris_data[['sepal length (cm)', 'sepal width (cm)', 'target']]
# keep only Iris-Setosa and Iris-Versicolour classes
iris_data = iris_data[iris_data.target <= 1]
# print(iris_data.head(5))
- 那我們的目標就是基於兩個特徵,去判斷品種
- 我們先把訓練資料跟測試資料分類出來,在這邊可以使用 sklearn 提供的
model_selection
函式把資料分為兩群 train、test
- 那這邊要注意 Logistic Regression 我們還會先對資料做特徵的縮放(為了避免梯度下降時,因為資料特徵數值差異過大而造成不必要的效率問題,可以想像成 normalize),因此會用到 sklearn 的 StandardScaler 套件
from sklearn.model_selection import train_test_split
train_feature, test_feature, train_target, test_target = train_test_split(
iris_data[['sepal length (cm)', 'sepal width (cm)']], iris_data[['target']], test_size=0.3, random_state=4
)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
train_feature = sc.fit_transform(train_feature)
test_feature = sc.fit_transform(test_feature)
train_target = np.array(train_target)
test_target = np.array(test_target)
# print(train_feature, test_feature)
Logistic Regression
- 前面我們有資料了,那就是開始撰寫我們的 Logistic Regrssion 吧~
- 我們前面有提到說 Logistic Regession 其實跟 Linear Regression 差別就只差了 sigmoid function,所以我們來看看程式會怎麼寫
# 1) model
# f = wx + b, sigmoid at the end
class LogisticRegression():
def __init__(self):
super(LogisticRegression, self).__init__()
def linear(self, x, w, b):
return np.dot(x, w) + b
def sigmoid(self, x):
return 1/(1 + np.exp(-x))
def forward(self, x, w, b):
y_pred = self.sigmoid(self.linear(x, w, b)).reshape(-1, 1)
return y_pred
model = LogisticRegression()
檢查參數 & 更新程式
- 前面有提到說 Logistic Regression 的 Loss check 不能是 MSE,因此這邊我們用 CrossEntropy 作為 loss function,那筆者相信自己數學解釋的不是那麼好 QQ,所以想了解詳細數學的可能麻煩自己去查一下了
- 那這邊的參數更新(這邊我們就開始稱呼為優化,因為確實是希望把參數更新的越來越好),就選擇跟 Linear Regression 一樣的 Gradient Descent
# 2) loss and optimizer
learning_rate = 0.01
# CrossEntropy
class BinaryCrossEntropy():
def __init__(self):
super(BinaryCrossEntropy, self).__init__()
def cross_entropy(self, y_pred, target):
x = target*np.log(y_pred) + (1-target)*np.log(1-y_pred)
return -(np.mean(x))
def forward(self, y_pred, target):
return self.cross_entropy(y_pred, target)
# GradientDescent
class GradientDescent():
def __init__(self, lr=0.1):
super(GradientDescent, self).__init__()
self.lr = lr
def forward(self, w, b, y_pred, target, data):
w = w - self.lr * np.mean(data * (y_pred - target), axis=0)
b = b - self.lr * np.mean((y_pred - target), axis=0)
return w, b
criterion = BinaryCrossEntropy()
optimizer = GradientDescent(lr=learning_rate)
開始訓練
- 一樣依照 檢查錯誤率 -> 更新參數 的方式去修正資料,並且操作想要的次數
# 3) training loop
w = np.array([0, 0])
b = np.array([0])
num_epochs = 100
for epoch in range(num_epochs):
for i, data in enumerate(train_feature):
# forward pass and loss
y_pred = model.forward(data, w, b)
loss = criterion.forward(y_pred, train_target[i])
# update
w, b = optimizer.forward(w, b, y_pred, train_target[i], data)
if (epoch+1) % 10 == 0:
print(f'epoch {epoch + 1}: loss = {loss}')
# checking testing accuracy
y_pred = model.forward(test_feature, w, b)
y_pred_cls = y_pred.round()
acc = np.equal(y_pred_cls, test_target).sum() / float(test_target.shape[0])
print(f'accuracy = {acc: .4f}')
每日小結
- 從上面可以看到撰寫的流程並不是那們的簡單,所以就讓我們之後再感受 Framework 的威力吧~
- 明天就來聊聊深度學習吧~