[Day6] Transformer 的核心：Encoder-Decoder 架構

2025 iThome 鐵人賽

DAY 6

生成式 AI

從上下文工程到 Agent：30 天生成式 AI 與 LLM 學習紀錄系列第 6 篇

17th鐵人賽

ruiyang0630

團隊nutc imac

2025-09-20 20:26:30

113 瀏覽

分享至

　　在 Day4 我們認識了「注意力機制」，在 Day5 我們進一步看到 Self-Attention、Mask Self-Attention 與 Multi-Head Attention 的不同，那麼這些模組要如何組合起來，才能搭建出完整的 Transformer 呢？答案就是 Encoder-Decoder 架構！

　　Transformer 最初是為了解決 Seq2Seq（序列到序列）的問題而設計的，像是翻譯任務，輸入：「我愛鐵人賽」，輸出："I love Ironman Contest"，這裡 Encoder 負責「讀懂輸入」，而 Decoder 負責「生成輸出」。

1. Transformer 的基礎組件

除了 Attention，Transformer 還有三個非常重要的組件：

(1) Feed Forward Network (FFN)

　　Attention 能幫助模型抓住詞與詞之間的關聯，但是他本身是線性的，因此我們在每個 Encoder/Decoder 層裡都加入 FNN，提供模型非線性的轉換，讓模型能學到更複雜的表示。這個 FFN 是對序列中每個位置獨立應用的，不會跨「token」，這是 Transformer 的一個特色。

class FFN(nn.Module):
    def __init__(self, dim, hidden_dim, dropout=0.1):
        super().__init__()
        self.fc1 = nn.Linear(dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.fc2(self.dropout(F.relu(self.fc1(x))))

(2) LayerNorm

　　在輸入進 Attention 和 FFN 之前，會先送進 LayerNorm，能讓訓練更穩定，讓不同層之間的數值分布保持穩定，加速收斂，避免梯度爆炸或消失。
　　LayerNorm 和 BatchNorm 不同，LayerNorm 是對單個樣本在最後一維做歸一化，不依賴 batch 大小，因此更適合 NLP（句子長度不同、batch size 不固定）。

ln = nn.LayerNorm(8)  # 假設 hidden_dim = 8

(3) Residual Connection

Residual Connection 的作用是提供一條捷徑，讓訊息與梯度能直接傳遞，如果子層（Attention/FFN）沒學到有效的轉換，模型至少還能保留原始輸入，避免因為網路加深而出現退化問題。

class ResidualBlock(nn.Module):
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.ffn = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, dim)
        )

    def forward(self, x):
        # 子層輸出 + 原始輸入
        return self.ffn(x) + x

2. Encoder 結構

有了上述基本觀念後，我們可以開始建立 Transformer 的 Encoder 了。Encoder 是由多個 Encoder Layer 組成的，Encoder Layer 包含：
a. 一個 Multi-Head Self-Attention
b. 一個 FFN
c. 每個子層都有 LayerNorm + Residual Connection

EncoderLayer 實作：

# Feed Forward Network
class FFN(nn.Module):
    def __init__(self, dim, hidden_dim, dropout=0.1):
        super().__init__()
        self.fc1 = nn.Linear(dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.fc2(self.dropout(F.relu(self.fc1(x))))

# Encoder Layer (Pre-LN)
class EncoderLayer(nn.Module):
    def __init__(self, dim, num_heads, dropout=0.1):
        super().__init__()
        self.attn_norm = nn.LayerNorm(dim)
        self.attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
        self.attn_dropout = nn.Dropout(dropout)

        self.ffn_norm = nn.LayerNorm(dim)
        self.ffn = FFN(dim, hidden_dim=4*dim, dropout=dropout)
        self.ffn_dropout = nn.Dropout(dropout)

    def forward(self, x):
        # ---- Multi-Head Self-Attention + Residual ----
        h, _ = self.attn(self.attn_norm(x), self.attn_norm(x), self.attn_norm(x))
        x = x + self.attn_dropout(h)

        # ---- Feed Forward + Residual ----
        h = self.ffn(self.ffn_norm(x))
        x = x + self.ffn_dropout(h)

        return x

3. Decoder 結構

Decoder 的每一層比 Encoder 多了一層注意力機制，Decoder 一樣是由多個 Decoder Layer 組成的，Decoder Layer 包含：
a. Masked Self-Attention（避免偷看未來）
b. Encoder-Decoder Attention（Q 來自 Decoder，K/V 來自 Encoder）
c. FFN
d. 每個子層都有 LayerNorm + Residual Connection

DecoderLayer 實作：

class FFN(nn.Module):
    def __init__(self, dim, hidden_dim, dropout=0.1):
        super().__init__()
        self.fc1 = nn.Linear(dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.fc2(self.dropout(F.relu(self.fc1(x))))

class DecoderLayer(nn.Module):
    def __init__(self, dim, num_heads, dropout=0.1):
        super().__init__()
        # Masked Self-Attention
        self.self_norm = nn.LayerNorm(dim)
        self.self_attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
        self.self_dropout = nn.Dropout(dropout)

        # Encoder-Decoder Attention
        self.cross_norm = nn.LayerNorm(dim)
        self.cross_attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
        self.cross_dropout = nn.Dropout(dropout)

        # FFN
        self.ffn_norm = nn.LayerNorm(dim)
        self.ffn = FFN(dim, hidden_dim=4*dim, dropout=dropout)
        self.ffn_dropout = nn.Dropout(dropout)

    def forward(self, x, enc_out, tgt_mask=None):
        # ---- Masked Self-Attention ----
        h, _ = self.self_attn(self.self_norm(x), self.self_norm(x), self.self_norm(x),
                              attn_mask=tgt_mask)
        x = x + self.self_dropout(h)

        # ---- Encoder-Decoder Attention ----
        h, _ = self.cross_attn(self.cross_norm(x), enc_out, enc_out)
        x = x + self.cross_dropout(h)

        # ---- FFN ----
        h = self.ffn(self.ffn_norm(x))
        x = x + self.ffn_dropout(h)

        return x