[Day29] 運用 Naive Bayes 建立垃圾郵件分類模型：從訓練到預測

2024 iThome 鐵人賽

DAY 29

AI/ ML & Data

深度學習的學習之旅：從理論到實作系列第 29 篇

16th鐵人賽機器學習人工智慧 naivebayes ai

arbin

團隊NUTC imac

2024-10-07 01:33:54

1409 瀏覽

分享至

前言

昨天介紹了一些相關的東西，也把一些基礎的部分完成了，今天就要進入最後的重點，也就是訓練及預測了，廢話不多說，直接來看看程式碼吧~

訓練功能

這邊主要功能是進行垃圾郵件分類模型的訓練和評估，定義了一個train_model功能。

# 完整程式碼
def train_model():
    messages = pd.read_csv('spam.csv', encoding='ISO-8859-1')
    messages = messages[['v1', 'v2']].rename(columns={'v1': 'label', 'v2': 'message'})
    
    corpus = [preprocess_text(message) for message in messages['message']]
    
    vectorizer = TfidfVectorizer(max_features=5000)
    X = vectorizer.fit_transform(corpus).toarray()
    
    y = pd.get_dummies(messages['label'], drop_first=True).values.ravel()
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    model = MultinomialNB()
    model.fit(X_train, y_train)
    
    with open(MODEL_FILE, 'wb') as f:
        pickle.dump(model, f)
    with open(VECTORIZER_FILE, 'wb') as f:
        pickle.dump(vectorizer, f)
    
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision, recall, fscore, support = precision_recall_fscore_support(y_test, y_pred, average='binary')
    print(f"Model training completed, accuracy: {accuracy * 100:.2f}%")
    print(f"Precision: {precision:.3f}, Recall: {recall:.3f}, F1-Score: {fscore:.3f}")

    from sklearn.metrics import roc_curve, auc

    y_prob = model.predict_proba(X_test)[:,1]
    fpr, tpr, thresholds = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)

    plt.figure()
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.show()

讀取資料

messages = pd.read_csv('spam.csv', encoding='ISO-8859-1')
messages = messages[['v1', 'v2']].rename(columns={'v1': 'label', 'v2': 'message'})

使用 pandas 函式讀取 CSV 檔案 spam.csv，該檔案包含兩欄：v1 是訊息的標籤（ham 或 spam），v2 是訊息的內容。
接著重命名欄位名稱為更具描述性的 label 和 message。
2. 預處理訊息

corpus = [preprocess_text(message) for message in messages['message']]

這裡利用預處理函數 preprocess_text 處理每個訊息的文字，去除標點符號、詞幹化（例如將複數詞轉換為單數）以及移除常見的停用詞。
處理過的訊息會存入 corpus，以便進行後續的向量化。
3. 特徵向量化

vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(corpus).toarray()

使用 TfidfVectorizer 將預處理後的文字轉換為數值特徵，方法為 TF-IDF（詞頻-逆文件頻率）。這個方法會衡量每個詞語在文本中的重要性。
max_features=5000 表示只選擇出現次數最多的 5000 個詞語來構建特徵矩陣。轉換後的特徵矩陣存入 X。
4. 標籤編碼

y = pd.get_dummies(messages['label'], drop_first=True).values.ravel()

透過 get_dummies 將訊息的標籤 ham 和 spam 轉換為二進制數字標籤（ham=0, spam=1），並移除冗餘欄位，將結果展平成一維陣列存入 y。
5. 分割訓練集與測試集

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

使用 train_test_split 將資料分成訓練集與測試集，訓練集占 80%，測試集占 20%。設定 random_state=42 是為了讓分割結果每次都一致。
6. 訓練模型

model = MultinomialNB()
model.fit(X_train, y_train)

使用 Multinomial Naive Bayes 分類器進行模型訓練。這個演算法非常適合文字分類的問題，尤其當特徵是詞頻或 TF-IDF 特徵時。
至於甚麼是貝氏分類器呢？我留著明天再補充介紹。
7. 保存模型與向量化器

with open(MODEL_FILE, 'wb') as f:
    pickle.dump(model, f)
with open(VECTORIZER_FILE, 'wb') as f:
    pickle.dump(vectorizer, f)

將訓練好的模型 model 和向量化器 vectorizer 分別保存到檔案中，這樣之後可以直接載入它們，而不用重新訓練模型。
8. 模型評估

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision, recall, fscore, support = precision_recall_fscore_support(y_test, y_pred, average='binary')
print(f"Model training completed, accuracy: {accuracy * 100:.2f}%")
print(f"Precision: {precision:.3f}, Recall: {recall:.3f}, F1-Score: {fscore:.3f}")

model.predict 使用訓練好的模型來預測測試集資料。
使用 accuracy_score、precision_recall_fscore_support 計算並顯示模型的準確率、精確率、召回率和 F1 分數。
9. 繪製 ROC 曲線

from sklearn.metrics import roc_curve, auc

y_prob = model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

ROC 曲線（Receiver Operating Characteristic）是用來評估分類模型能力的工具，表示不同的分類閾值下的假陽性率（FPR）與真陽性率（TPR）。
predict_proba 用來取得模型對每個類別的機率預測值，然後根據這些值計算 FPR 和 TPR，再用 auc 計算 ROC 曲線下面積（AUC），這是模型分類能力的總體衡量指標。

預測功能

# 完整程式碼
def predict_spam(input_message):
    if not os.path.exists(MODEL_FILE) or not os.path.exists(VECTORIZER_FILE):
        print("Model not found, training in progress...")
        train_model()
    
    with open(MODEL_FILE, 'rb') as f:
        model = pickle.load(f)
    with open(VECTORIZER_FILE, 'rb') as f:
        vectorizer = pickle.load(f)
    
    preprocessed_message = preprocess_text(input_message)
    
    input_features = vectorizer.transform([preprocessed_message]).toarray()
    
    prediction = model.predict(input_features)
    
    if prediction[0] == 1:
        return "This is a spam email"
    else:
        return "This is a normal email"

檢查模型與向量化器檔案是否存在

if not os.path.exists(MODEL_FILE) or not os.path.exists(VECTORIZER_FILE):
    print("Model not found, training in progress...")
    train_model()
os.path.exists(MODEL_FILE) 和 os.path.exists(VECTORIZER_FILE)

用來檢查儲存模型和向量化器的檔案是否存在。
若模型或向量化器檔案不存在，則觸發模型訓練過程（train_model() 函式），此函式會重新訓練模型並將其保存到檔案中。
2. 載入模型與向量化器

with open(MODEL_FILE, 'rb') as f:
    model = pickle.load(f)
with open(VECTORIZER_FILE, 'rb') as f:
    vectorizer = pickle.load(f)

這裡使用 pickle.load 從模型檔案 MODEL_FILE 和向量化器檔案 VECTORIZER_FILE 中分別載入已經訓練好的模型和向量化器。
pickle 是 Python 用來序列化和反序列化（存取檔案和還原物件）的方法。
3. 預處理輸入訊息

preprocessed_message = preprocess_text(input_message)

使用事先定義的 preprocess_text 函數來對使用者輸入的訊息 input_message 進行預處理。預處理過程可能包括：

去除標點符號。
轉換大小寫。
詞幹化和移除停用詞等。

預處理後的訊息會變得更簡化且結構化，便於模型進行分類。

轉換訊息為特徵向量

input_features = vectorizer.transform([preprocessed_message]).toarray()

使用已經訓練好的 TfidfVectorizer 向量化器將預處理過的訊息轉換為特徵向量。
vectorizer.transform() 將訊息轉換為 TF-IDF 向量，並使用 .toarray() 轉換為數字矩陣。
輸入訊息被打包成單個元素的列表（[preprocessed_message]），這是因為 transform() 需要一個可迭代的物件。
5. 使用模型進行預測

prediction = model.predict(input_features)

使用訓練好的 MultinomialNB 模型來對訊息的特徵進行分類預測。此函數會返回一個數字（0 或 1），表示預測結果：
1 代表垃圾郵件（spam）。
0 代表正常郵件（ham）。
6. 返回預測結果

if prediction[0] == 1:
    return "This is a spam email"
else:
    return "This is a normal email"

根據預測結果 prediction[0]，如果預測值是 1，則表示訊息是垃圾郵件，函數返回 "This is a spam email"。
如果預測值是 0，則表示訊息是正常郵件，函數返回 "This is a normal email"。

結語

今天的篇幅有點長了，剩下的部分我留到明天一起介紹！明天我會把結果、測試、及一些補充還有插曲全部說明完，希望大家可以繼續收看我的最後一篇文章喔～明天見！

[Day 28] 利用機器學習打造 SMS 垃圾郵件分類器

[Day 30] 垃圾郵件分類完整測試結果 - 人工智慧最終篇章

系列文

深度學習的學習之旅：從理論到實作共 30 篇

RSS系列文訂閱系列文

7 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22195 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

深度學習的學習之旅：從理論到實作系列 第 29 篇