[Day5] Self-Attention、Mask Self-Attention、Multi-Head Attention

2025 iThome 鐵人賽

DAY 5

生成式 AI

從上下文工程到 Agent：30 天生成式 AI 與 LLM 學習紀錄系列第 5 篇

17th鐵人賽

ruiyang0630

團隊nutc imac

2025-09-19 22:57:50

172 瀏覽

分享至

在 Day4 裡，我們用簡單的範例理解了 Attention 基本的運作原理與 Query、Key、Value 的角色，但是在實際應用 Transformer 的時候，Attention 還有三個重要的變種：

1. Self-Attention

　　在一般的 Attention 機制中，Q(Query)來自一個序列，K(Key) & V(Value)來自另一個序列，例如翻譯，英文句子是一個序列，要對上中文的句子序列。但在很多 NLP 的任務中，我們只需要讓一句話之間的詞語互相比對，這個就是所謂的 Self-Attention。

　　在 Self-Attention 中，Q、K、V 都來自同一個輸入序列，模型會計算句子中的每個詞與其他所有詞的關聯度，得到更豐富的上下文表示。

Self-Attention的模擬(因為沒有訓練輸出是無意義的)

先定義Embedding

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# 模擬詞表
vocab = {"我":0, "愛":1, "鐵人賽":2}
sentence = ["我", "愛", "鐵人賽"]

# 把詞轉成 index
ids = torch.tensor([[vocab[w] for w in sentence]])  # [[0,1,2]]

# 定義 embedding
embed_dim = 8
embedding = nn.Embedding(len(vocab), embed_dim)

x = embedding(ids)  # (batch=1, seq_len=3, embed_dim=8)
print("輸入向量形狀:", x.shape)

輸出結果
我們將每個詞用8維的向量來表示。

輸入向量形狀: torch.Size([1, 3, 8])

Self-Attention

d_k = embed_dim
W_q = nn.Linear(d_k, d_k, bias=False)
W_k = nn.Linear(d_k, d_k, bias=False)
W_v = nn.Linear(d_k, d_k, bias=False)

Q, K, V = W_q(x), W_k(x), W_v(x)

scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
attn = F.softmax(scores, dim=-1)
output = torch.matmul(attn, V)

print("Self-Attention 注意力分數:\n", attn)

輸出的 attn 形狀是 (1, 3, 3)，這表示每個詞對序列中所有詞的注意力分佈。
輸出結果

Self-Attention 注意力分數:
 tensor([[[0.3417, 0.2756, 0.3828],
         [0.3028, 0.3760, 0.3212],
         [0.2962, 0.3034, 0.4003]]], grad_fn=<SoftmaxBackward0>)

2. Mask Self-Attention

　　在語言模型（例如 GPT）裡，我們的目標是「一步一步的預測下一個詞」，已知「我愛」這個詞，預測下一個詞是「鐵人賽」，接著持續往下預測。

　　但如果在訓練的時候，模型處理到「愛」這個詞，就已經能「偷看」到「鐵人賽」，那預測就會變成作弊，因為模型不是根據前文推理，而是直接偷到答案。

Mask Self-Attention的模擬

輸入的部分我們延續上面的例子

seq_len = x.size(1)
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
mask = mask.masked_fill(mask==1, float("-inf"))

masked_scores = scores + mask  # 把未來的詞遮住
attn_masked = F.softmax(masked_scores, dim=-1)

print("Mask Self-Attention 注意力分數:\n", attn_masked)

輸出結果
可以從下方輸出結果發現，第一行的權重只有注意自己，第二行權重會關注自己和前一個詞，在 Mask Self-Attention 下，每個 token 都只能「往左看」，無法偷看右邊的未來詞。

Mask Self-Attention 注意力分數:
 tensor([[[1.0000, 0.0000, 0.0000],
         [0.4460, 0.5540, 0.0000],
         [0.2962, 0.3034, 0.4003]]], grad_fn=<SoftmaxBackward0>)

3. Multi-Head Attention

　　單一 Attention 計算，只能捕捉一種語意關係，但語言的關聯往往是多樣的，像是「蘋果」和「水果」有語義關聯，「蘋果」和「公司」有實體關聯，Multi-Head Attention 是用來補足這個缺陷的機制
　　Multi-Head Attention 會利用不同的線性轉換矩陣，將輸入投影到多個子空間（多個 Head），每個 Head 都各自計算一次 Attention ，並且每個 Head 都會學到不同的語意關係，最後再把多個 Head 的結果拼接起來，形成更全面的表示。

Multi-Head Attention 的模擬

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.attention = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)

    def forward(self, x):
        # Q, K, V 都是同一個輸入 (Self-Attention)
        out, attn_weights = self.attention(
            x, x, x, need_weights=True, average_attn_weights=False
        )
        return out, attn_weights

# 模擬輸入：batch=1, seq_len=4, embed_dim=8
x = torch.randn(1, 4, 8)
mha = MultiHeadAttention(embed_dim=8, num_heads=2)

out, attn_weights = mha(x)

print("Multi-Head Attention 輸出形狀:", out.shape)          # (1, 4, 8)
print("每個 Head 的注意力權重形狀:", attn_weights.shape)  # (1, 2, 4, 4)

# 看 batch 0 的 Head 0 和 Head 1
head0 = attn_weights[0, 0]
head1 = attn_weights[0, 1]

print("Batch 0 - Head 0 注意力矩陣:\n", head0)
print("Batch 0 - Head 1 注意力矩陣:\n", head1)

輸出結果
可以從下面的輸出注意到，兩個 Head 對於不同 token 注意力不一樣。

Multi-Head Attention 輸出形狀: torch.Size([1, 4, 8])
每個 Head 的注意力權重形狀: torch.Size([1, 2, 4, 4])
Batch 0 - Head 0 注意力矩陣:
 tensor([[0.2394, 0.2477, 0.2485, 0.2644],
        [0.2908, 0.2376, 0.2168, 0.2548],
        [0.1838, 0.2790, 0.3089, 0.2283],
        [0.3772, 0.2338, 0.1994, 0.1896]], grad_fn=<SelectBackward0>)
Batch 0 - Head 1 注意力矩陣:
 tensor([[0.2823, 0.2339, 0.2452, 0.2385],
        [0.3475, 0.2127, 0.2290, 0.2107],
        [0.1924, 0.2706, 0.2590, 0.2780],
        [0.3969, 0.2027, 0.2168, 0.1836]], grad_fn=<SelectBackward0>)