iT邦幫忙

2025 iThome 鐵人賽

DAY 5
0
生成式 AI

AI咒術迴戰~LLM絕對領域展開系列 第 5

Day5 - LLM運作原理(五條悟)

  • 分享至 

  • xImage
  •  

https://ithelp.ithome.com.tw/upload/images/20250802/20163257B5kW5eOXdx.jpg

LLM運作原理

大型語言模型(LLM)核心是Transformer架構,理解Transformer架構如何運作,是理解LLM關鍵

🔴Transformer架構

  • Embedding層:將輸入的文字轉換成電腦可以理解向量表示 (Word Embedding)
  • Encoder(編碼器):負責理解輸入序列的語義
  • Decoder(解碼器):負責生成目標序列
  • Self-Attention 機制(自注意力機制):讓模型關注輸入序列中不同位置的詞語之間的關係
  • Multi-Head Attention(多頭注意力):使用多個注意力機制,讓模型可以關注不同方面的關係
  • Feed-Forward Network(前饋神經網路):對每個詞語的向量表示進行非線性轉換
  • Normalization(正規化):幫助模型更快更好地收斂

https://ithelp.ithome.com.tw/upload/images/20250802/20163257ajiekUAzE2.png
圖片來源:https://aiml.com/explain-the-transformer-architecture/

🟠Self-Attention機制

Transformer 架構核心,它允許模型在處理一個詞語時,考慮到序列中所有其他詞語的影響

  • Query (Q)Key (K)Value (V):輸入序列中每個詞語,模型會計算出三個向量Query、Key、Value
  • 計算注意力分數:每個Query模型會與所有Key計算相似度(通常使用點積),得到注意力分數
  • 加權求和:將注意力分數應用到Value上,得到每個詞語加權表示

🟡Multi-Head Attention

是Self-Attention擴展,它使用多個獨立的注意力機制(不同Query、Key、Value矩陣),讓模型可以關注輸入序列中不同方面的關係,將多個頭的輸出合併。

🟢Encoder和Decoder

  • Encoder:負責處理輸入序列,將轉換成一個包含語義信息的向量表示。通常由多個相同的層堆疊而成,每一層包含 Multi-Head AttentionFeed-Forward Network
  • Decoder:負責生成目標序列,由多個相同層堆疊而成,每一層包含 Masked Multi-Head Attention(防止模型在生成時看到未來的詞語)、Encoder-Decoder Attention(讓 Decoder 關注 Encoder 輸出)和 Feed-Forward Network

程式碼 (Python、使用TensorFlow/Keras)

import tensorflow as tf
from tensorflow.keras.layers import Layer, Dense, Embedding

class SelfAttention(Layer):
    def __init__(self, embed_dim, num_heads=8):
        super(SelfAttention, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        if embed_dim % num_heads != 0:
            raise ValueError(
                f"embedding dimension = {embed_dim} should be divisible by number of heads = {num_heads}"
            )
        self.projection_dim = embed_dim // num_heads
        self.query_dense = Dense(embed_dim)
        self.key_dense = Dense(embed_dim)
        self.value_dense = Dense(embed_dim)
        self.combine_heads = Dense(embed_dim)

    def attention(self, query, key, value):
        score = tf.matmul(query, key, transpose_b=True)
        dim_key = tf.cast(tf.shape(key)[-1], tf.float32)
        scaled_score = score / tf.math.sqrt(dim_key)
        weights = tf.nn.softmax(scaled_score, axis=-1)
        output = tf.matmul(weights, value)
        return output

    def separate_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, inputs):
        batch_size = tf.shape(inputs)[0]
        query = self.query_dense(inputs)
        key = self.key_dense(inputs)
        value = self.value_dense(inputs)
        query = self.separate_heads(query, batch_size)
        key = self.separate_heads(key, batch_size)
        value = self.separate_heads(value, batch_size)
        attention = self.attention(query, key, value)
        attention = tf.transpose(attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(attention, (batch_size, -1, self.embed_dim))
        output = self.combine_heads(concat_attention)
        return output

# 簡單Transformer Encoder層
class TransformerBlock(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = SelfAttention(embed_dim, num_heads)
        self.ffn = tf.keras.Sequential(
            [Dense(ff_dim, activation="relu"), Dense(embed_dim),]
        )
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

# 簡單Transformer 模型
class TransformerModel(tf.keras.Model):
    def __init__(self, num_layers, embed_dim, num_heads, ff_dim, vocab_size, maxlen, rate=0.1):
        super(TransformerModel, self).__init__()
        self.embedding = Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_embedding = Embedding(input_dim=maxlen, output_dim=embed_dim)
        self.transformer_blocks = [
            TransformerBlock(embed_dim, num_heads, ff_dim, rate) for _ in range(num_layers)
        ]
        self.dense = Dense(vocab_size)

    def call(self, inputs, training):
        x = self.embedding(inputs)
        positions = tf.range(start=0, limit=tf.shape(inputs)[1], delta=1)
        x = x + self.pos_embedding(positions)
        for block in self.transformer_blocks:
            x = block(x, training)
        x = self.dense(x)
        return x

# 範例使用
vocab_size = 10000
maxlen = 100
embed_dim = 32
num_heads = 2
ff_dim = 32
num_layers = 2

model = TransformerModel(num_layers, embed_dim, num_heads, ff_dim, vocab_size, maxlen)
inputs = tf.random.uniform((1, maxlen), minval=0, maxval=vocab_size, dtype=tf.int32)
output = model(inputs, training=False)
print(output.shape)

程式碼說明

  • 定義 SelfAttention 層,實現自注意力機制
  • 定義 TransformerBlock 層,包含自注意力、前饋神經網路和正規化
  • 定義 TransformerModel 類,將 Embedding 層、位置編碼層和多個 TransformerBlock 層組合起來
  • 範例使用:創建一個簡單的 Transformer 模型,並輸入隨機數據進行測試

上一篇
Day4-大型語言模型LLM(釘崎野薔薇)
系列文
AI咒術迴戰~LLM絕對領域展開5
圖片
  熱門推薦
圖片
{{ item.channelVendor }} | {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言