在 Day4 我們認識了「注意力機制」,在 Day5 我們進一步看到 Self-Attention、Mask Self-Attention 與 Multi-Head Attention 的不同,那麼這些模組要如何組合起來,才能搭建出完整的 Transformer 呢?答案就是 Encoder-Decoder 架構!
Transformer 最初是為了解決 Seq2Seq(序列到序列)的問題而設計的,像是翻譯任務,輸入:「我愛鐵人賽」,輸出:"I love Ironman Contest",這裡 Encoder 負責「讀懂輸入」,而 Decoder 負責「生成輸出」。
除了 Attention,Transformer 還有三個非常重要的組件:
Attention 能幫助模型抓住詞與詞之間的關聯,但是他本身是線性的,因此我們在每個 Encoder/Decoder 層裡都加入 FNN,提供模型非線性的轉換,讓模型能學到更複雜的表示。這個 FFN 是對序列中每個位置獨立應用的,不會跨「token」,這是 Transformer 的一個特色。
class FFN(nn.Module):
def __init__(self, dim, hidden_dim, dropout=0.1):
super().__init__()
self.fc1 = nn.Linear(dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.fc2(self.dropout(F.relu(self.fc1(x))))
在輸入進 Attention 和 FFN 之前,會先送進 LayerNorm,能讓訓練更穩定,讓不同層之間的數值分布保持穩定,加速收斂,避免梯度爆炸或消失。
LayerNorm 和 BatchNorm 不同,LayerNorm 是對單個樣本在最後一維做歸一化,不依賴 batch 大小,因此更適合 NLP(句子長度不同、batch size 不固定)。
ln = nn.LayerNorm(8) # 假設 hidden_dim = 8
Residual Connection 的作用是提供一條捷徑,讓訊息與梯度能直接傳遞,如果子層(Attention/FFN)沒學到有效的轉換,模型至少還能保留原始輸入,避免因為網路加深而出現退化問題。
class ResidualBlock(nn.Module):
def __init__(self, dim, hidden_dim):
super().__init__()
self.ffn = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, dim)
)
def forward(self, x):
# 子層輸出 + 原始輸入
return self.ffn(x) + x
有了上述基本觀念後,我們可以開始建立 Transformer 的 Encoder 了。Encoder 是由多個 Encoder Layer 組成的,Encoder Layer 包含:
a. 一個 Multi-Head Self-Attention
b. 一個 FFN
c. 每個子層都有 LayerNorm + Residual Connection
EncoderLayer 實作:
# Feed Forward Network
class FFN(nn.Module):
def __init__(self, dim, hidden_dim, dropout=0.1):
super().__init__()
self.fc1 = nn.Linear(dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.fc2(self.dropout(F.relu(self.fc1(x))))
# Encoder Layer (Pre-LN)
class EncoderLayer(nn.Module):
def __init__(self, dim, num_heads, dropout=0.1):
super().__init__()
self.attn_norm = nn.LayerNorm(dim)
self.attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
self.attn_dropout = nn.Dropout(dropout)
self.ffn_norm = nn.LayerNorm(dim)
self.ffn = FFN(dim, hidden_dim=4*dim, dropout=dropout)
self.ffn_dropout = nn.Dropout(dropout)
def forward(self, x):
# ---- Multi-Head Self-Attention + Residual ----
h, _ = self.attn(self.attn_norm(x), self.attn_norm(x), self.attn_norm(x))
x = x + self.attn_dropout(h)
# ---- Feed Forward + Residual ----
h = self.ffn(self.ffn_norm(x))
x = x + self.ffn_dropout(h)
return x
Decoder 的每一層比 Encoder 多了一層注意力機制,Decoder 一樣是由多個 Decoder Layer 組成的,Decoder Layer 包含:
a. Masked Self-Attention(避免偷看未來)
b. Encoder-Decoder Attention(Q 來自 Decoder,K/V 來自 Encoder)
c. FFN
d. 每個子層都有 LayerNorm + Residual Connection
DecoderLayer 實作:
class FFN(nn.Module):
def __init__(self, dim, hidden_dim, dropout=0.1):
super().__init__()
self.fc1 = nn.Linear(dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.fc2(self.dropout(F.relu(self.fc1(x))))
class DecoderLayer(nn.Module):
def __init__(self, dim, num_heads, dropout=0.1):
super().__init__()
# Masked Self-Attention
self.self_norm = nn.LayerNorm(dim)
self.self_attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
self.self_dropout = nn.Dropout(dropout)
# Encoder-Decoder Attention
self.cross_norm = nn.LayerNorm(dim)
self.cross_attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
self.cross_dropout = nn.Dropout(dropout)
# FFN
self.ffn_norm = nn.LayerNorm(dim)
self.ffn = FFN(dim, hidden_dim=4*dim, dropout=dropout)
self.ffn_dropout = nn.Dropout(dropout)
def forward(self, x, enc_out, tgt_mask=None):
# ---- Masked Self-Attention ----
h, _ = self.self_attn(self.self_norm(x), self.self_norm(x), self.self_norm(x),
attn_mask=tgt_mask)
x = x + self.self_dropout(h)
# ---- Encoder-Decoder Attention ----
h, _ = self.cross_attn(self.cross_norm(x), enc_out, enc_out)
x = x + self.cross_dropout(h)
# ---- FFN ----
h = self.ffn(self.ffn_norm(x))
x = x + self.ffn_dropout(h)
return x