在本篇文章中,我們將實作一個簡單的多層感知機(MLP)模型,並使用MNIST資料集來進行訓練。MNIST資料集包含手寫數字的影像,我們將其作為多類別分類問題來解決。此實作主要運用numpy
進行數值運算,而避免使用像TensorFlow或PyTorch這類的深度學習框架。
我們將實作一個帶有一層隱藏層的MLP模型,並透過反向傳播來更新模型參數。具體過程分為以下幾個步驟:
初始化權重
我們從隨機初始化輸入層、隱藏層、輸出層的權重矩陣開始。隱藏層的大小由參數hidden_size
決定。
激活函數
我們使用sigmoid
函數作為隱藏層的激活函數,定義為:
$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$
輸出層使用softmax
函數來確保輸出的數值符合概率的範疇:
$$
\text{softmax}(z) = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$
前向傳播
我們首先計算隱藏層的輸入與激活值,接著計算輸出層的激活值(即預測結果)。
損失函數
損失函數使用交叉熵損失,它計算了預測值和實際標籤之間的差異。具體公式如下:
$$
\text{Loss} = - \frac{1}{n} \sum_{i=1}^{n} y_i \log(\hat{y_i})
$$
反向傳播
在反向傳播中,我們透過鏈式法則計算損失相對於各層權重和偏差的導數,並更新權重矩陣。
訓練模型
訓練過程通過設定一個迭代次數(epoch)來進行,每次更新模型參數後,我們會計算當前的損失值。
在第11章中,我們詳細探討了多層人工神經網路(MLP)的概念,並逐步實作了反向傳播演算法。以下是幾個重點:
前向傳播與反向傳播:我們在網路中依次將數據傳遞給每一層,並計算出輸出。在每個節點中,使用激活函數來產生非線性的映射【20:11†source】。反向傳播則通過計算損失函數的梯度,將誤差分配給每個神經元,從而調整權重【20:19†source】。
MNIST資料集:MNIST資料集包含60,000張訓練圖像與10,000張測試圖像,圖像大小為28x28像素。我們將數據轉換成784維向量進行訓練【20:17†source】。
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
def initialize_weights(input_size, hidden_size, output_size):
W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1. / input_size)
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1. / hidden_size)
b2 = np.zeros((1, output_size))
return W1, b1, W2, b2
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_derivative(z):
return sigmoid(z) * (1 - sigmoid(z))
def softmax(z):
exp_z = np.exp(z - np.max(z, axis=1, keepdims=True)) # For numerical stability
return exp_z / np.sum(exp_z, axis=1, keepdims=True)
def forward_pass(X, W1, b1, W2, b2):
z1 = np.dot(X, W1) + b1
a1 = sigmoid(z1)
z2 = np.dot(a1, W2) + b2
a2 = softmax(z2)
return z1, a1, z2, a2
def compute_loss(y_true, y_pred):
n_samples = y_true.shape[0]
logp = - np.log(y_pred[range(n_samples), np.argmax(y_true, axis=1)])
loss = np.sum(logp) / n_samples
return loss
def backward_pass(X, y_true, z1, a1, z2, a2, W1, W2):
n_samples = X.shape[0]
dz2 = a2 - y_true
dW2 = np.dot(a1.T, dz2) / n_samples
db2 = np.sum(dz2, axis=0, keepdims=True) / n_samples
dz1 = np.dot(dz2, W2.T) * sigmoid_derivative(z1)
dW1 = np.dot(X.T, dz1) / n_samples
db1 = np.sum(dz1, axis=0, keepdims=True) / n_samples
return dW1, db1, dW2, db2
def update_weights(W1, b1, W2, b2, dW1, db1, dW2, db2, learning_rate):
W1 -= learning_rate * dW1
b1 -= learning_rate * db1
W2 -= learning_rate * dW2
b2 -= learning_rate * db2
return W1, b1, W2, b2
def train(X_train, y_train, hidden_size=64, epochs=10, learning_rate=0.1):
input_size = X_train.shape[1]
output_size = y_train.shape[1]
W1, b1, W2, b2 = initialize_weights(input_size, hidden_size, output_size)
for epoch in range(epochs):
z1, a1, z2, a2 = forward_pass(X_train, W1, b1, W2, b2)
loss = compute_loss(y_train, a2)
dW1, db1, dW2, db2 = backward_pass(X_train, y_train, z1, a1, z2, a2, W1, W2)
W1, b1, W2, b2 = update_weights(W1, b1, W2, b2, dW1, db1, dW2, db2, learning_rate)
if epoch % 100 == 0:
print(f'Epoch {epoch}, Loss: {loss}')
return W1, b1, W2, b2
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
def initialize_weights(input_size, hidden_size, output_size):
W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1. / input_size)
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1. / hidden_size)
b2 = np.zeros((1, output_size))
return W1, b1, W2, b2
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_derivative(z):
return sigmoid(z) * (1 - sigmoid(z))
def softmax(z):
exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
return exp_z / np.sum(exp_z, axis=1, keepdims=True)
def forward_pass(X, W1, b1, W2, b2):
z1 = np.dot(X, W1) + b1
a1 = sigmoid(z1)
z2 = np.dot(a1, W2) + b2
a2 = softmax(z2)
return z1, a1, z2, a2
def compute_loss(y_true, y_pred):
n_samples = y_true.shape[0]
logp = - np.log(y_pred[range(n_samples), np.argmax(y_true, axis=1)])
loss = np.sum(logp) / n_samples
return loss
def backward_pass(X, y_true, z1, a1, z2, a2, W1, W2):
n_samples = X.shape[0]
dz2 = a2 - y_true
dW2 = np.dot(a1.T, dz2) / n_samples
db2 = np.sum(dz2, axis=0, keepdims=True) / n_samples
dz1 = np.dot(dz2, W2.T) * sigmoid_derivative(z1)
dW1 = np.dot(X.T, dz1) / n_samples
db1 = np.sum(dz1, axis=0, keepdims=True) / n_samples
return dW1, db1, dW2, db2