[Day 7] Naive Bayes — 解決真實問題 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 7

AI & Data

ML From Scratch系列第 7 篇

[Day 7] Naive Bayes — 解決真實問題

15th鐵人賽 machine learning python

whoami

2023-09-07 11:18:50

665 瀏覽

分享至

第7天了！

今天所要學習的是透過 Naive Bayes Classifier 去完成 Digit Recognizer

我們首先可以觀察到這次任務的性質是屬於數字分類，屬於離散特徵分類，所以 Multinomial Naive Bayes 或是 BernouliNB Naive Bayes 都是可以運用的演算法。

這裡我們使用 Multinomial Naive Bayes

以下是實做的程式碼

Import Library

import pandas as pd
import numpy as np

Data preprocess

# Load the training data from Kaggle
train_data = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')
test_data = pd.read_csv('/kaggle/input/digit-recognizer/test.csv')

# Split the data into features (X) and labels (y)
X_train = train_data.drop(columns=['label'])
y_train = train_data['label']

# Split the data into features (X) and labels (y)
X_test = test_data

這裡我們觀察到 train.csv 有的 column 欄位有 label 以及 pixel0 到 pixel8

所以我們把 label 跟其他欄位分開，以便後續餵進 model 的方便性

Library from scratch

class MultinomialNB:
    def __init__(self, alpha=1.0):
        self.alpha = alpha  # Laplace smoothing parameter
        self.class_prior_ = None
        self.feature_log_prob_ = None

    def fit(self, X, y):
        # Calculate class priors
        classes, class_counts = np.unique(y, return_counts=True)
        self.class_prior_ = class_counts / len(y)

        # Calculate conditional probabilities using Laplace smoothing
        num_classes = len(classes)
        num_features = X.shape[1]
        self.feature_log_prob_ = np.zeros((num_classes, num_features))

        for i, c in enumerate(classes):
            class_mask = (y == c)
            class_count = class_counts[i]
            term_counts = X[class_mask].sum(axis=0)
            self.feature_log_prob_[i] = np.log((term_counts + self.alpha) / (class_count + self.alpha * num_features))

    def predict(self, X):
        # Calculate the log probability of each class for each sample
        log_probs = np.dot(X, self.feature_log_prob_.T) + np.log(self.class_prior_)

        # Select the class with the highest log probability
        return np.argmax(log_probs, axis=1)

這段 code 包含以下 function

__init__(self, alpha)
fit(self, X, y)
predict(self, X)

`init(self, alpha)`

初始化 Multinomial Naive Bayes 分類器的一些參數。

alpha 是拉普拉斯平滑（Laplace smoothing）的參數，用於處理機率估計中的零頻率問題。

`fit(self, X, y)`

在這個函數中，首先計算了每個類別的先驗機率（class priors），即每個類別在訓練數據中出現的機率。

接下來，使用拉普拉斯平滑，計算了每個特徵在每個類別下的條件機率。

這些條件機率表示了給定類別時，每個特徵值的機率。

這些條件機率被保存在 self.feature_log_prob_ 中。

`predict(self, X)`

在這個函數中，對輸入的新數據進行預測。

首先，計算了每個類別對於每個輸入樣本的條件機率的對數值（log probabilities）。

然後，對於每個樣本，選擇具有最高對數機率的類別作為預測結果。

Prediction

model = MultinomialNB()
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Create a DataFrame for the submission
submission = pd.DataFrame({'ImageId': range(1, len(y_pred) + 1), 'Label': y_pred})

# Save the submission DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

最後所預測的結果儲存在 submission 的 Dataframe 中。

格式是