一起來參加Kaggle競賽-提升實戰經驗16(Pooling 大亂鬥：平均？最大？還是兩個都要？)

17th鐵人賽

yuhua__

2025-09-29 23:44:26

56 瀏覽

分享至

回顧昨天

稍微修改了一下昨天的程式碼，確定輸出的格式符合比賽需求。結果最高分是用GloVe，所以今天我們要更進一步，來試試不同的 Pooling 策略，看看誰能讓模型更聰明！

1.Pooling 是什麼？

Pooling 其實就是在做一件事：「把整句話的詞向量壓縮成一個固定長度的向量，方便丟進模型。」

我們先把每個詞變成向量，但 XGBoost 不能吃一整串「詞向量序列」，所以需要想個辦法，把一句話濃縮成一個向量。這個濃縮的步驟，就是 Pooling。

2.我們的三個對手

2.1 Mean Pooling（平均池化）

就是把句子裡每個詞向量取平均。

優點：平滑、穩定，不容易受極端詞影響
缺點：會把情緒拉平，像是平均一班成績，裡面有第一名也被稀釋掉

def mean_pooling(texts, model, dim=300):
    return np.vstack([
        np.mean([model[w] for w in t.split() if w in model] or [np.zeros(dim)], axis=0)
        for t in texts
    ])

2.2 Max Pooling（最大池化）

就是挑每個維度裡「最大的值」。

優點：把句子裡最強烈的語意抓出來
缺點：可能被一兩個極端詞帶偏

def max_pooling(texts, model, dim=300):
    return np.vstack([
        np.max([model[w] for w in t.split() if w in model] or [np.zeros(dim)], axis=0)
        for t in texts
    ])

2.3 Mean + Max Concat Pooling（雙管齊下）

就是把平均和最大值都算出來，然後串接起來當特徵。

優點：既看整體趨勢，又抓到最強訊號
缺點：特徵維度變兩倍，訓練時間變長

def mean_max_concat_pooling(texts, model, dim=300):
    features = []
    for t in texts:
        word_vecs = [model[w] for w in t.split() if w in model] or [np.zeros(dim)]
        mean_vec = np.mean(word_vecs, axis=0)
        max_vec  = np.max(word_vecs, axis=0)
        features.append(np.concatenate([mean_vec, max_vec]))
    return np.vstack(features)

3.流程步驟

根據昨天出來的最高分(GloVe詞嵌入技術) * 分別用 Mean Pooling、Max Pooling、Mean+Max Concat Pooling
丟進同一個 XGBoost 模型
記錄 Validation Accuracy，觀察哪種 pooling 效果最好

4.程式碼

import numpy as np
import pandas as pd
from gensim.models import KeyedVectors
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb

# 先載入選好的詞向量模型 (GloVe)
w2v_path = "/kaggle/input/googlenews-vectors-negative300/GoogleNews-vectors-negative300.bin"
w2v = KeyedVectors.load_word2vec_format(w2v_path, binary=True)

# 定義 pooling 方法（用上面寫好的函式）
X_text = train["body"].astype(str).tolist()
y = train["rule_violation"].values

# 選一種 pooling 策略
X_vec = mean_pooling(X_text, w2v, dim=300)  # 或 max_pooling / mean_max_concat_pooling

# 分割資料
X_tr, X_val, y_tr, y_val = train_test_split(
    X_vec, y, test_size=0.2, random_state=42, stratify=y
)

# 訓練 XGBoost
clf = xgb.XGBClassifier(
    n_estimators=500, max_depth=6, learning_rate=0.05,
    subsample=0.8, colsample_bytree=0.8,
    random_state=42, n_jobs=-1, use_label_encoder=False,
    eval_metric="logloss"
)
clf.fit(X_tr, y_tr)

# 驗證
y_pred = clf.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, y_pred))