我決定不寫了,因為 Kaggle Notebook 真的太難用了。
CPU 跑超慢,提交限制又一大堆,搞得我心情很差。
這不是在訓練模型,這是在訓練耐性。
Colab真的把Kaggle Notebook壓在地上扁
😡😡😡😡😡 😡😡😡😡😡
昨天我們完成了 Word2Vec (Google News 預訓練 300 維) + XGBoost baseline
結果在本地驗證集 Accuracy 大約 0.736,但是 Kaggle leaderboard 掉到 0.535
這代表模型在「家裡練功」很強,但到「比武大會」就輸慘
Word2Vec + mean pooling 太單純,模型還不夠聰明。
在 NLP 世界裡,詞向量 (Word Embedding) 是把文字變成數字的第一步,不同的詞向量模型其實有不同哲學:
Word2Vec
GloVe (Global Vectors)
FastText
app
, ppl
, ple
。(以英文來說,有些字根字首是有意義的,那多考慮這些資訊可以讓一些比較少見的字可以靠字根字首推敲出他的意思)換句話說:
Word2Vec 是「江湖經驗」,GloVe 是「大數據分析」,FastText 是「DNA 分析師」。
為了公平比較 Word2Vec 和其他兩者,我保持流程一致:
body
欄位轉成句向量。並在程式碼實作的部分根據4.1-4.3,分別開了三個版本去做控制,這點很重要,因為submission.csv有一定格式,一定要分開進行。
import numpy as np
import pandas as pd
from gensim.models import KeyedVectors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb
# 載入 Word2Vec (Google News 300 維度)
w2v_path = "/kaggle/input/googlenews-vectors-negative300/GoogleNews-vectors-negative300.bin"
w2v = KeyedVectors.load_word2vec_format(w2v_path, binary=True)
# 句子轉向量
def sentence_to_vec(sentence, model, dim=300):
words = [w for w in sentence.split() if w in model]
if len(words) == 0:
return np.zeros(dim)
return np.mean(model[words], axis=0)
X_text = train["body"].astype(str).tolist()
X_vec = np.vstack([sentence_to_vec(text, w2v) for text in X_text])
y = train["rule_violation"].values
# 切分資料
X_tr, X_val, y_tr, y_val = train_test_split(
X_vec, y, test_size=0.2, random_state=42, stratify=y
)
# XGBoost
clf = xgb.XGBClassifier(
n_estimators=500, max_depth=6, learning_rate=0.05,
subsample=0.8, colsample_bytree=0.8,
random_state=42, n_jobs=-1, use_label_encoder=False,
eval_metric="logloss"
)
clf.fit(X_tr, y_tr)
# 驗證
y_pred = clf.predict(X_val)
print("Word2Vec + XGBoost Accuracy:", accuracy_score(y_val, y_pred))
用整個訓練集重新訓練模型、再去對 test.csv 做預測 → 最後產生 submission
這邊要注意在kaggle notebook要上傳到leaderboard 的檔名只能是 submission.csv
final_clf = xgb.XGBClassifier(
n_estimators=500, max_depth=6, learning_rate=0.05,
subsample=0.8, colsample_bytree=0.8,
random_state=42, n_jobs=-1, use_label_encoder=False,
eval_metric="logloss"
)
final_clf.fit(X_vec, y)
X_text_test = test["body"].astype(str).tolist()
X_vec_test = np.vstack([sentence_to_vec(text, w2v) for text in X_text_test])
test_pred = final_clf.predict(X_vec_test)
submission = submission.copy()
submission.iloc[:, 1] = test_pred
submission.to_csv("submission.csv", index=False)
print("Word2Vec_submission.csv")
分數的部分就是跟昨天一樣0.535
先在 kaggle notebook 把 GloVe download 下來
import kagglehub
# Download latest version
path = kagglehub.dataset_download("thanakomsn/glove6b300dtxt")
print("Path to dataset files:", path)
#訓練模型
import numpy as np
import pandas as pd
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb
# 先把 GloVe 轉成 Word2Vec 格式
glove_input = "/kaggle/input/glove6b300dtxt/glove.6B.300d.txt"
word2vec_output = "glove.6B.300d.word2vec.txt"
glove2word2vec(glove_input, word2vec_output)
glove = KeyedVectors.load_word2vec_format(word2vec_output, binary=False)
# 句子轉向量
X_text = train["body"].astype(str).tolist()
X_vec_glove = np.vstack([sentence_to_vec(text, glove) for text in X_text])
y = train["rule_violation"].values
# 訓練 & 驗證
X_tr, X_val, y_tr, y_val = train_test_split(
X_vec_glove, y, test_size=0.2, random_state=42, stratify=y
)
clf.fit(X_tr, y_tr)
y_pred = clf.predict(X_val)
print("GloVe + XGBoost Accuracy:", accuracy_score(y_val, y_pred))
#做驗證
final_clf = xgb.XGBClassifier(
n_estimators=500, max_depth=6, learning_rate=0.05,
subsample=0.8, colsample_bytree=0.8,
random_state=42, n_jobs=-1, use_label_encoder=False,
eval_metric="logloss"
)
final_clf.fit(X_vec_glove, y)
# 對 test.csv 做向量化
X_text_test = test["body"].astype(str).tolist()
X_vec_test_glove = np.vstack([sentence_to_vec(text, glove) for text in X_text_test])
# 預測 test set
test_pred = final_clf.predict(X_vec_test_glove)
# 輸出 submission
submission = submission.copy()
submission.iloc[:, 1] = test_pred
submission.to_csv("submission.csv", index=False)
print("GloVe_submission.csv Done")
先在 kaggle notebook 把 FastText download 下來
import kagglehub
# Download latest version
path = kagglehub.dataset_download("facebook/fasttext-wikinews")
print("Path to dataset files:", path)
接下來一樣的步驟......
import numpy as np
import pandas as pd
from gensim.models import KeyedVectors
from gensim.models.fasttext import load_facebook_vectors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb
# 載入 FastText (Wiki News 300d)
ft_path = "/kaggle/input/fasttext-wikinews/wiki-news-300d-1M.vec"
fasttext = KeyedVectors.load_word2vec_format(ft_path)
# 句子轉向量
X_text_test = test["body"].astype(str).tolist()
X_vec_ft = np.vstack([sentence_to_vec(text, fasttext) for text in X_text])
# 訓練 & 驗證
X_tr, X_val, y_tr, y_val = train_test_split(
X_vec_ft, y, test_size=0.2, random_state=42, stratify=y
)
clf.fit(X_tr, y_tr)
y_pred = clf.predict(X_val)
print("FastText + XGBoost Accuracy:", accuracy_score(y_val, y_pred))
final_clf = xgb.XGBClassifier(
n_estimators=500, max_depth=6, learning_rate=0.05,
subsample=0.8, colsample_bytree=0.8,
random_state=42, n_jobs=-1, use_label_encoder=False,
eval_metric="logloss"
)
final_clf.fit(X_vec_ft, y)
#對 test.csv 做向量化
X_text_test = test["body"].astype(str).tolist()
X_vec_test_ft = np.vstack([sentence_to_vec(text, fasttext) for text in X_text_test])
#預測 test set
test_pred = final_clf.predict(X_vec_test_ft)
# 輸出 submission
submission = submission.copy()
submission.iloc[:, 1] = test_pred
submission.to_csv("submission.csv", index=False)
print("submission.csv Done")
明天會補觀察結果,其實我三個詞向量都跑好了,但在上傳 kaggle leaderboard 的時候一直被擋,出到版本21了,覺得非常的不滿意,明天再戰吧