[Day 6] Naive Bayes — 主題實作 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 6

AI & Data

ML From Scratch系列第 6 篇

[Day 6] Naive Bayes — 主題實作

15th鐵人賽 machine learning python

whoami

2023-09-06 10:57:32

699 瀏覽

分享至

上次說到 Naive Bayes 是以貝氏定理來解決機器學習上的分類問題。

下方會透過一個簡單的郵件分類來說明 Naive Bayes Classifier 的實做方式。

Implementation

目前 sklearn 有提供以下種類的 Naive Bayes 套件：

Multinomial Naive Bayes
預設特徵的先驗分佈為多項式分佈，此 Multinomial Naive Bayes 適用於具有離散特徵的分類
Gaussian Naive Bayes
藉由假設先驗機率為高斯分佈，計算訓練資料的後驗機率
BernouliNB Naive Bayes
與 Multinomial Naive Bayes 相同，可以用於離散型資料，但 BernouliNB Naive Bayes 是專為 Binary classification 設計

Prepare data

這裡使用的是 Spam email Dataset

Preprocessing

這裡會使用 CountVectorizer 將文本轉換為標記計數矩陣。

from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
spamham_countVectorizer=vectorizer.fit_transform(spam_dataframe['text'])

Using Naive Bayes Classifier

由於是屬於離散型分類，這裡是用的是 MultinomialNB

from sklearn.naive_bayes import MultinomialNB

NB_classifier=MultinomialNB()
NB_classifier.fit(X_train,y_train)

y_predict_test=NB_classifier.predict(X_test)
y_predict_test

Evaluation

透過 classification_report 我們可以計算出分類預測的準確率

from sklearn.metrics import classification_report

print(classification_report(y_test,y_predict_test))

From Scratch

在 From Scratch 的部分僅使用 numpy

由於預處理的結果是標記計數矩陣和詞彙列表，所以我們必須實做以下函式。

preprocess_text

該函數用於對文本進行預處理。它將刪除標點符號、停用詞，並將文本轉換為小寫。

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = text.split()
    return ' '.join(tokens)

build_vocabulary

該函數用於從文本中構建詞彙。它將返回詞彙詞典。

def build_vocabulary(texts):
    vocabulary = set()
    for text in texts:
        tokens = text.split()
        vocabulary.update(tokens)
    return list(vocabulary)

convert_text_to_vector

該函數用於將文本轉換為向量。它將返回一個文本向量。

def create_bow(texts, vocabulary):
    bow_matrix = []
    for text in texts:
        tokens = text.split()
        bow_vector = [tokens.count(word) for word in vocabulary]
        bow_matrix.append(bow_vector)
    return bow_matrix

MultinomialNB

class CustomMultinomialNB:
    def __init__(self, alpha=1):
        self.alpha = alpha

    def fit(self, X, y):
        self.X = X
        self.y = y
        self.classes = np.unique(y)
        self.parameters = {}
        for i, c in enumerate(self.classes):
            X_c = X[np.where(y == c)]
            self.parameters["phi_" + str(c)] = len(X_c) / len(X)
            self.parameters["theta_" + str(c)] = (X_c.sum(axis=0) + self.alpha) / (np.sum(X_c.sum(axis=0) + self.alpha))

    def predict(self, X):
        predictions = []
        for x in X:
            phi_list = []
            for i, c in enumerate(self.classes):
                phi = np.log(self.parameters["phi_" + str(c)])
                theta = np.sum(np.log(self.parameters["theta_" + str(c)]) * x)
                phi_list.append(phi + theta)
            predictions.append(self.classes[np.argmax(phi_list)])
        return predictions

訓練階段

訓練朴素貝葉斯分類器。它將返回一個概率字典。

數學上，概率計算如下：

$https://chart.googleapis.com/chart?cht=tx&chl=P(y)%20%3D%20%5Cfrac%7Bcount(y)%7D%7Bcount(Y)%7D$

$https://chart.googleapis.com/chart?cht=tx&chl=P(x_i%7Cy)%20%3D%20%5Cfrac%7Bcount(x_i%2C%20y)%7D%7Bcount(y)%7D$

預測階段

用於預測數據的類別。它將返回一個預測列表。

數學上，概率計算如下：

$https://chart.googleapis.com/chart?cht=tx&chl=P(y%7Cx_1%2C%20x_2%2C%20...%2C%20x_n)%20%3D%20%5Cfrac%7BP(y)%20%5Ctimes%20P(x_1%7Cy)%20%5Ctimes%20P(x_2%7Cy)%20%5Ctimes%20...%20%5Ctimes%20P(x_n%7Cy)%7D%7BP(x_1)%20%5Ctimes%20P(x_2)%20%5Ctimes%20...%20%5Ctimes%20P(x_n)%7D$