[Day 10]Head Mask Pooling 池化大法好🪄🦄-解析第二三名的優勝作法：Head Mask Pooling 與 Multi-Task Learning

2024 iThome 鐵人賽

DAY 10

AI/ ML & Data

一個Kaggle金牌解法是如何誕生的？跟隨Kaggle NLP競賽高手的討論，探索解題脈絡系列第 10 篇

[Day 10]Head Mask Pooling 池化大法好🪄🦄-解析第二三名的優勝作法：Head Mask Pooling 與 Multi-Task Learning

16th鐵人賽 kaggle nlp ai data science

壓縮甜

2024-09-24 22:31:56

535 瀏覽

分享至

在看了一堆解法分享後，看到第二名的做法真的會眼前一亮🤩，而且是絕對可以低成本偷學帶到其他賽題繼續使用的！

我們直接切入正題：

🥈 2nd Solution

第二名1的做法其實非常非常簡單，既簡單又有效！

💡 Head Mask Pooling

簡單來說，就是他發現本次賽題一個堪稱魔法的 pooling 方法🪄🦄。

首先，他僅使用 deberta 作為他們預測模型，以下是他的 input 資料的格式：

'Think through this step by step : ' + prompt_question + [SEP] + 'Pay attention to the content and wording : ' + text + [SEP] + prompt_text

他增加 deberta 的字數上限到 2048，然後依照上面的格式將 prompt_question, text 與 prompt_text 接在一起當作輸入。

接著，他為這樣的輸入設定一個特定的 head mask 如下：
Input : [TOKEN] [TOKEN] [SEP] [TOKEN] [TOKEN] [SEP] [TOKEN] [TOKEN]
Head Mask : [0] [0] [1] [1] [1] [0] [0] [0]

我們在使用 BERT 做 classification 或相關任務的時候，我們通常會設計一個只有 0, 1 的 attention mask，把我們輸入的文字部分對應的位置，在 attention mask 填上 1，其他為了對齊別的句子長度而填上的 padding [PAD] ，則在其對應的attention mask位置填上0。這樣我們最後在做 pooling 時，就可以把每一個 token 的 hidden representation 乘上這個 attention mask，這樣 [PAD] 的部分因為乘上 0，他的 representation 就會被忽視不計，避免 model 被 [PAD] 干擾。

但這邊除了上面提到的 attention mask，作者自己額外設計了一個 head mask，只在輸入 text 的地方填上 1，其他跟 prompt_question 和 prompt_text 的地方都填上 0，這使得在最後一層做 pooling 的時候只會考慮 text 的 hidden state 的資訊，強迫模型更把重點放在學生寫的摘要 text 上。

但這並不代表輸入 prompt_question 和 prompt_text 沒有用喔～

因為模型在前面不管是做 self-attention 或是 MLP 的部分，其實都會看到 prompt_question 和 prompt_text 的內容，因此 text 的 hidden representation ，其實也是透過計算這三者彼此的關係得來的。

只是在最後透過 pooling 時只去計算 text 的 hidden state，讓最後一層的 NN 預測 content 和 wording score 時，可以更關注在 text 的 last hidden representation 做完 pooling 後的結果上。

下面，我們重構作者的代碼(可參考2)，看看他是如何創建 head_mask，以及如何在後續 pooling 的時候利用這個 head_mask 展開計算的吧！

創建包含 head_mask 的 Dataset


class CustomDataset(Dataset):
    def __init__(self, df, tokenizer, use_prompt_text=True):
        """
        初始化 CustomDataset，將 DataFrame 中的文本進行 tokenization 並生成 head_mask 和 attention_mask。
        
        :param df: 包含文本數據的 DataFrame
        :param tokenizer: 用於將文本轉為 token 的分詞器
        :param use_prompt_text: 是否使用 prompt_text 作為提示
        """
        self.df = df
        self.tokenizer = tokenizer
        self.use_prompt_text = use_prompt_text
        self.separator = " " + self.tokenizer.sep_token + " "  # 用於分隔 prompt 和學生回答的分隔符

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        """
        為每個樣本生成 input_ids、attention_mask 和 head_mask。
        head_mask 用於標記學生回答的部分，而忽略其他部分（如 prompt 和提示部分）。
        """
        row = self.df.iloc[index]

        # 根據是否使用 prompt_text 來決定是否附加 prompt_text
        prompt_text = (self.separator + row.prompt_text) if self.use_prompt_text else ''
        input_text = (
            'Think through this step by step : ' + row.prompt_question + 
            self.separator + 
            'Pay attention to the content and wording : ' + row.text + 
            prompt_text
        )

        # 使用 tokenizer 將輸入文本轉換為 token ids
        tokenized_output = self.tokenizer(input_text, add_special_tokens=False)

        input_ids = tokenized_output.input_ids
        attention_mask = tokenized_output.attention_mask

        # 創建 head_mask，專注於學生回答部分的 token
        head_mask = []
        is_student_answer = False
        for token in input_ids:
            if token == self.tokenizer.sep_token_id:
                # 當遇到 SEP token 時，切換是否專注於學生回答部分
                is_student_answer = not is_student_answer

            # 只有在學生回答部分（is_student_answer 為 True）時，head_mask 為 1，其他部分為 0
            head_mask.append(1 if is_student_answer else 0)

        return {
            'input_ids': torch.tensor(input_ids),
            'attention_mask': torch.tensor(attention_mask),
            'head_mask': torch.tensor(head_mask)
        }

根據 head_mask 來做 pooling

class MeanPooling(nn.Module):
    def __init__(self, clamp_min=1e-9):
        """
        初始化 MeanPooling，並設置最小值以避免除以 0 的情況。
        """
        super(MeanPooling, self).__init__()
        self.clamp_min = clamp_min

    def forward(self, hidden_states, mask):
        """
        對 hidden_states 進行加權平均，使用 head_mask 作為加權因子。
        
        :param hidden_states: 模型的隱藏層輸出 (batch_size, seq_len, hidden_dim)
        :param mask: head_mask 或 attention_mask，用來指示哪些 token 需要參與池化計算
        """
        # 將 mask 擴展到與 hidden_states 相同的形狀，這樣我們可以對 token 進行逐位乘法
        mask_expanded = mask.unsqueeze(-1).expand(hidden_states.size()).float()

        # 計算每個 token 的加權和，僅考慮被 mask 標記為 1 的 token
        sum_embeddings = torch.sum(hidden_states * mask_expanded, dim=1)

        # 計算 mask 中有效的 token 數量，並避免除以 0
        sum_mask = mask_expanded.sum(dim=1)
        sum_mask = torch.clamp(sum_mask, min=self.clamp_min)

        # 計算加權平均
        mean_embeddings = sum_embeddings / sum_mask
        return mean_embeddings

定義模型
在取得 deberta output 的 hidden state 之後，就調用我們上面寫的 pooling method，並把 head_mask 傳進去。（通常的做法是把 attention_mask 傳進去，這邊則是要把前面 dataset 所計算出的 head_mask 傳入，這邊的改動也是此作者的主要創新點）

class SimpleModel(nn.Module):
    def __init__(self, base_model, hidden_size):
        """
        初始化模型，使用基礎的 Transformer 模型（如 DeBERTa），並添加池化和輸出層。
        
        :param base_model: 預訓練的 Transformer 模型
        :param hidden_size: 隱藏層的維度
        """
        super(SimpleModel, self).__init__()
        self.base_model = base_model  # 預訓練的 Transformer 模型
        self.pooling = MeanPooling()  # 自定義的 MeanPooling
        self.classifier = nn.Linear(hidden_size, 1)  # 最後的分類層

    def forward(self, input_ids, attention_mask, head_mask):
        """
        模型的前向傳播，使用 head_mask 來控制池化的部分。
        
        :param input_ids: 輸入的 token ids
        :param attention_mask: attention mask，用來指示哪些 token 是有效的
        :param head_mask: head mask，用來指示哪些 token 需要進行池化
        """
        # 通過預訓練的 Transformer 模型獲取 hidden states
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        hidden_states = outputs.last_hidden_state  # (batch_size, seq_len, hidden_dim)

        # 使用 head_mask 來進行加權平均池化，只聚焦於學生回答部分
        pooled_output = self.pooling(hidden_states, head_mask)

        # 最終通過分類層進行預測
        logits = self.classifier(pooled_output)
        return logits

到這邊，也許你會想：
前面我們不是提過透過人工計算一些 linguistic 相關的 feature ，例如和原文的 n_gram overlap 等等，輸入給 LGBM，看能不能提升 wording 這個拖油瓶的分數嗎？但現在第二名又拿掉 LGBM 又讓模型著重關注 text 本身，這樣模型真的不會表現變差嗎？

作者這邊為了加強模型對 wording 分數的預測，使用 Auxiliary Classes 來加強訓練。

Multi-Task Learning on Auxiliary Classes

第二名還使用了輔助類別來增強模型訓練，具體做法如下：

輔助類別：這些類別來自 Feedback 3.0 比賽的目標類別，包括：

'cohesion'（凝聚力）
'syntax'（句法）
'vocabulary'（詞彙）
'phraseology'（措辭）
'grammar'（語法）
'conventions'（規範）

創建輔助標籤：參賽者使用在 Feedback 3.0 數據集上訓練的模型，對比賽數據中的 text 列進行推理，產生 pseudo-label（偽標籤）。這些標籤並非來自比賽官方數據，而是來自不同來源的外部數據集。

損失函數中的使用：輔助類別被集成到損失函數中，這樣模型的損失函數變為有兩個目標：

(loss * 0.5) + (aux_loss * 0.5)

在這裡，主要損失和輔助損失各占一半權重來引導模型學習。這種方式可以提升模型對這些語言特徵（如句法、詞彙等）的識別能力。

每隔一步使用：這些輔助類別只在每隔一步時使用一次，可能是為了避免輔助損失影響過大，從而保持主任務的損失優先級。

這個做法個人覺得滿有創意的✨✨！

其他人大多都是透過 feature_engineering 去自己定義和發掘和'syntax'（句法）, 'vocabulary'（詞彙）, 'grammar'（語法）相關的 feature，然後把這些 feature 顯式地告訴模型要根據這些來判斷這個學生的摘要要拿多少分；但這邊卻是拿一個在這些評估方向都學習過的模型，幫這筆學生在這些方向打分，再讓模型從預測這些評估指標的分數過程中，學會關注這些 linguistic 相關的 feature。

也就是說，有別於前面顯示地告訴模型：“length, n_gram_concurrence_ratio, n_gram overlap, spelling error rate 等 feature 跟 wording 相關喔！你要根據這些 feature 找到預測 wording score 的方法“；這邊則像是告訴模型：”這篇摘要的語法和詞彙表顯得很好，然後他的 wording score 是xxx，模型你自己要去找到這篇摘要的哪些特徵跟語法、詞彙有關係，然後因為他語法詞彙表現好，wording 才會得到 xxx 的分數“。

有點類似這樣的感覺，模型一開始是不知道要去關注“length, n_gram_concurrence_ratio, n_gram overlap, spelling error rate 等等feature的，他是被'syntax', 'vocabulary'和 wording 分數算出來的 loss 引導，進而自己學會去關注和這些指標相關的 feature。

另外作者也有用 LLM 去改寫原先trainset僅有的那4個 prompt，在訓練的時候只給模型LLM改寫的那些 prompt_question, prompt_text 等資訊，然後在真實訓練資料給的 prompt 上面做評估。

不過以上這些辦法，提升效果最顯著的，還是 Head Mask Pooling，也因此讓很多參賽者都大呼這才是本次競賽的魔法糖！

3rd Solution

Reverse Autocorrect

還記得我們前面用 Levenshtein distance 發現整個 trainset 有一部內容完全一模一樣的摘要嗎？有些內容完全一樣的摘要，都被打出超極低的分數；有些卻沒有。但確實有一個共同的趨勢，就是內容相見（distance 小）的那些摘要，會得到相似的分數。
第三名的作者也注意到有許多類似的文章，他提出了生成合成數據的方法，即針對這些Levenshtein 距離很相近的文本，隨機替換一些詞彙，模擬拼寫錯誤或類似的變異文本，這種技術可以被稱為“反向自動更正”（reverse autocorrect）。他發現透過這種方法能夠增加數據多樣性，進一步提升模型的穩健性。

另外，第三名的作者3也設計自己的 pooling 方法：

自定義池化（Custom Pooling）：
他使用了兩種池化方式的結合：

CLS token：CLS token 是 Transformer 模型中預設的特殊 token，通常用來表示整體序列的表示。
學生文本的 mean pooling：他還對學生回答部分進行 mean pooling，將這兩部分的特徵進行拼接，這是一個自定義的池化方法，能夠讓模型同時考慮整體和細節特徵。

和第二名的 Head Mask Pooling 有異曲同工之妙～

另外他在訓練模型時還有用到一些 tricks，只是我個人猜測這些 tricks 應該只會帶來微幅的提升：

EMA（Exponential Moving Average）：他提到模型在沒有 EMA 的情況下不穩定，因此 EMA 對他來說是必須的。EMA 是一種動態更新權重的方法，通過對模型參數的指數加權平均，使得模型訓練過程更加穩定。

如果要使用 EMA 的話，可以import torch-ema 這個library，以下介紹如何在訓練 deberta 時使用 EMA 的技術：

import torch
from torch_ema import ExponentialMovingAverage
from transformers import DebertaForSequenceClassification, DebertaTokenizer

# 加載 DeBERTa 模型和 tokenizer
model = DebertaForSequenceClassification.from_pretrained('microsoft/deberta-base', num_labels=2)
tokenizer = DebertaTokenizer.from_pretrained('microsoft/deberta-base')

# 創建 EMA 實例，將模型的參數傳入
ema = ExponentialMovingAverage(model.parameters(), decay=0.999)

# 模擬訓練過程的準備
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
criterion = torch.nn.CrossEntropyLoss()

# 創建一個簡單的數據集
texts = ["This is a positive example.", "This is a negative example."]
labels = torch.tensor([1, 0])

# Tokenization
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]

# 模擬訓練 epoch
for epoch in range(3):  # 訓練 3 個 epoch 作為範例
    optimizer.zero_grad()

    # 前向傳播
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits

    # 計算損失
    loss = criterion(logits, labels)
    
    # 反向傳播
    loss.backward()

    # 更新模型參數
    optimizer.step()

    # 更新 EMA 的權重
    ema.update()

    print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

# 推理階段：
# 在推理時，我們使用 EMA 的參數來進行預測
ema.store()         # 保存當前模型的參數
ema.copy_to()       # 將 EMA 的參數應用到模型中

# 執行推理（推理時模型將使用 EMA 的權重）
with torch.no_grad():
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    print(f"Predictions using EMA: {predictions}")

# 恢復模型原來的參數
ema.restore()

差異化學習率（Differential Learning Rates）：他使用了不同層的差異化學習率，即不同層的參數使用不同的學習率進行更新。這是一個很常見的深度學習技術，特別是在 fine-tuning 預訓練模型時有較大幫助。

小結

"CommonLit - Evaluate Student Summaries" 這個比賽的系列介紹文到今天就結束啦～
我很喜歡大家挖掘這個賽題的訓練資料背後現象的過程，透過前三天的文章([Day 6] 別著急訓練模型，挖掘好用的 feature 是成功的一半: EDA 實戰演練（上）(中)（下)），希望有把這個 dataset 的故事說好給大家；另外，第一名使用 LLM 擴增資料的方法，以及結合文本資料和數值型資料的多模態策略，與今天第二名第三名都使用到的自定義池化方法，個人覺得都很實用，希望能在大家工作或研究的路上提供一些靈感～

明天就要開始進入我超期待的 LLM 賽題啦！

我們明天見！