Day 28：文本分類 (Text Classification)

2025 iThome 鐵人賽

DAY 28

Software Development

來一場軟體開發學習之旅系列第 28 篇

17th鐵人賽

Liu Po Yi

2025-09-29 00:14:09

145 瀏覽

分享至

在NLP的任務中，文本分類 (Text Classification) 是最常見、最實用的應用之一。它的目的是將一段文字分到某個已知的類別中，例如：
垃圾郵件分類 (Spam Detection)：判斷郵件是否為垃圾郵件
情感分析 (Sentiment Analysis)：判斷評論是正面還是負面
新聞分類 (News Categorization)：將新聞分為政治、體育、科技等類別

今天我們就來用Python嘗試一個簡單的文本分類任務。
一、文本分類的流程
資料收集：準備帶有標籤的文字資料
資料清理與斷詞：去除雜訊並轉換為詞向量
特徵表示：使用Bag-of-Words、TF-IDF或Word Embeddings
模型訓練：選擇合適的分類器 (Naive Bayes、Logistic Regression、SVM等)
模型評估：檢查準確率、召回率、F1-score

二、實作：使用TF-IDF + Naive Bayes進行情感分類
我們用scikit-learn來快速做一個分類器，資料使用簡單的影評文字。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

#範例資料
texts = [
"I love this movie, it is amazing!",
"This film was terrible and boring.",
"Absolutely fantastic acting and story.",
"Worst movie I have ever seen.",
"I really enjoyed this film, great experience.",
"The plot was dull and predictable."
]
labels = ["positive", "negative", "positive", "negative", "positive", "negative"]

#建立 TF-IDF + Naive Bayes 模型
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

#訓練模型
model.fit(texts, labels)

#測試新的句子
test_texts = [
"The movie was wonderful and inspiring!",
"I hated this film, it was awful."
]

predictions = model.predict(test_texts)

for text, label in zip(test_texts, predictions):
print(f"文本: {text} -> 預測結果: {label}")

三、程式輸出結果 (範例)
文本: The movie was wonderful and inspiring! -> 預測結果: positive
文本: I hated this film, it was awful. -> 預測結果: negative

四、延伸應用
多類別分類，例如新聞分類 (政治、娛樂、科技、體育)
多語言支持，利用多語言的詞向量或Transformer模型
進階方法，使用BERT、GPT等深度學習模型取代傳統方法

今天我們學會了文本分類的流程與簡單實作。
明天我會帶你進一步挑戰主題建模 (Topic Modeling)，學習如何自動從文本中找出隱藏的主題！