這是所有機器學習專案的基礎,對於 LLM 更是至關重要,模型的知識和能力完全源於它所「閱讀」過的資料
開發團隊會從網路上爬取海量的文本和程式碼資料。這些資料來源極其廣泛,包括:
公開網頁
:維基百科、新聞網站、部落格、論壇等書籍
:數位化的圖書,涵蓋小說、非小說、教科書等程式碼
:GitHub 等開源程式碼庫學術論文
:ArXiv 等論文預印本網站資料規模通常達到 TB(太位元組)甚至 PB(拍位元組)等級,包含數千億甚至上兆個「詞元」(Tokens)
原始資料充滿了雜訊,必須進行清理,例如:
這是整個訓練過程中最耗時、最耗費計算資源的階段。目標是讓模型學習語言的通用規則、事實知識、推理能力
現今LLM大多採用Transformer架構。這種架構的核心是「自注意力機制」(Self-Attention Mechanism),它讓模型在處理一個詞元時,能夠權衡句子中所有其他詞元的重要性,從而更好地理解上下文
在預訓練階段,模型並非由人類手動標註資料來學習。而是採用「自監督學習」的方式。最常見的目標是:
遮罩語言模型 (Masked Language Model)
:隨機遮蓋輸入句子中某些詞元,然後讓模型預測這些被遮蓋的詞元是什麼。這迫使模型學習詞語之間的語法和語義關係
下一個詞元預測 (Next Token Prediction)
:給定一段文字,讓模型預測下一個最可能出現的詞元。這讓模型學會生成流暢且連貫的文本
🟧大規模平行運算 (Massive Parallel Computing)
模型參數巨大(動輒數千億個),資料量也極其龐大,預訓練必須在由數千個 GPU(圖形處理器)組成超級電腦叢集上進行,通常需要花費數週甚至數月的時間
為了讓基礎模型變得更有用、更安全、更能理解並遵循人類的指令,需要進行「對齊」階段的微調。這個階段的目標是讓模型的行為與人類的意圖和價值觀保持一致。這通常包含以下幾個步驟:
目的:教導模型如何遵循指令
由人類標註者或 AI 輔助,創建一批高品質的「指令-回答」樣本,例如:指令是「幫我寫一首關於秋天的詩」,回答就是一首符合要求的詩,使用這些高品質的樣本對來微調預訓練好的基礎模型,讓模型學會針對特定指令生成對應的輸出
目的
:根據人類的偏好,進一步提升模型回答的品質、有用性和無害性
過程
:這是一個更複雜的步驟,包含三個小階段:
(a) 訓練獎勵模型 (Reward Model Training)
:讓模型針對同一個指令生成多個不同的回答。然後,由人類標註者對這些回答進行排序,評選出哪個最好,哪個最差。利用這些排序資料,訓練一個「獎勵模型」,這個模型學會了評斷什麼樣的回答是人類偏好的(給予高分),什麼樣的是不好的(給予低分)
(b) 強化學習微調 (Reinforcement Learning Fine-tuning)
:將 SFT 後的模型作為策略(Policy),讓它去生成新的回答。用上一步訓練好的「獎勵模型」來給這些新回答打分。這個分數作為強化學習的「獎勵訊號」
(c) 優化策略 (Policy Optimization)
:使用強化學習演算法(PPO),根據獎勵訊號來更新LLM 參數,目標是讓LLM生成回答能夠在獎勵模型中獲得更高的分數
在整個訓練和微調過程中,模型需要不斷地進行評估,以確保其性能、安全性和可靠性
基準測試(Benchmarking)
:使用學術界和業界公認的標準資料集來評估模型在不同任務(如問答、推理、翻譯、程式碼生成)上的表現紅隊演練(Red Teaming)
:由專門的團隊或使用者,故意用各種刁鑽、惡意或意想不到的問題來測試模型的極限,找出模型的弱點和漏洞,特別是與安全性和倫理相關的方面迭代改進
:根據評估和紅隊演練的結果,開發團隊會回到前面的階段,調整資料集、改進微調策略,然後重新訓練和評估,不斷迭代優化transformers
, datasets
, tokenizers
或 sentencepiece
, accelerate
, peft
,trl
預訓練
:純文字(每行一段),或 JSONL 的 {"text": "..."}
SFT
:{"instruction": "...", "input": "...", "output": "..."}
或 {"messages": [{"role": "user", ...}, ...]}
如果語料是
corpus.txt
pip install sentencepiece
spm_train --input=corpus.txt \
--model_prefix=zh_spm --vocab_size=32000 \
--character_coverage=0.9995 --model_type=bpe
# 會生成 zh_spm.model / zh_spm.vocab
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('zh_spm.model')
ids = sp.encode('今天天氣不錯,我們去喝咖啡吧!', out_type=int)
text = sp.decode(ids)
教學取向、可在小語料上跑通概念;實務仍建議用 Transformers/PEFT
損失
:Causal LM(shifted cross-entropy)# minimal_gpt.py
import math, torch
import torch.nn as nn
import torch.nn.functional as F
class MultiHeadSelfAttention(nn.Module):
def __init__(self, d_model: int, n_heads: int, dropout: float = 0.0):
super().__init__()
assert d_model % n_heads == 0
self.d_model = d_model
self.n_heads = n_heads
self.d_head = d_model // n_heads
self.qkv = nn.Linear(d_model, 3 * d_model)
self.o = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
# 後續會用到下三角遮罩,避免看到未來
self.register_buffer("mask", None, persistent=False)
def _causal_mask(self, T):
if self.mask is None or self.mask.size(0) < T:
m = torch.tril(torch.ones(T, T, dtype=torch.bool))
self.mask = m
return self.mask[:T, :T]
def forward(self, x):
B, T, C = x.shape
qkv = self.qkv(x).view(B, T, 3, self.n_heads, self.d_head)
q, v, k = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
# 形狀: (B, heads, T, d_head)
q = q.permute(0, 2, 1, 3)
k = k.permute(0, 2, 1, 3)
v = v.permute(0, 2, 1, 3)
att = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_head)
mask = self._causal_mask(T).to(att.device)
att = att.masked_fill(~mask, float('-inf'))
att = F.softmax(att, dim=-1)
att = self.dropout(att)
y = att @ v # (B, heads, T, d_head)
y = y.transpose(1, 2).contiguous().view(B, T, C)
return self.o(y)
class FeedForward(nn.Module):
def __init__(self, d_model: int, mult: int = 4, dropout: float = 0.0):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_model, mult * d_model),
nn.GELU(),
nn.Linear(mult * d_model, d_model),
nn.Dropout(dropout),
)
def forward(self, x):
return self.net(x)
class Block(nn.Module):
def __init__(self, d_model, n_heads, dropout):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.attn = MultiHeadSelfAttention(d_model, n_heads, dropout)
self.ln2 = nn.LayerNorm(d_model)
self.ff = FeedForward(d_model, dropout=dropout)
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.ff(self.ln2(x))
return x
class MiniGPT(nn.Module):
def __init__(self, vocab_size, d_model=512, n_layers=6, n_heads=8, max_len=1024, dropout=0.1):
super().__init__()
self.tok_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_len, d_model)
self.blocks = nn.ModuleList([
Block(d_model, n_heads, dropout) for _ in range(n_layers)
])
self.ln_f = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size, bias=False)
self.max_len = max_len
def forward(self, idx, targets=None):
B, T = idx.shape
assert T <= self.max_len
pos = torch.arange(0, T, device=idx.device).unsqueeze(0)
x = self.tok_emb(idx) + self.pos_emb(pos)
for blk in self.blocks:
x = blk(x)
x = self.ln_f(x)
logits = self.head(x)
loss = None
if targets is not None:
# shift one for next-token prediction
loss = F.cross_entropy(logits[:, :-1].contiguous().view(-1, logits.size(-1)),
targets[:, 1:].contiguous().view(-1))
return logits, loss
@torch.no_grad()
def generate(self, idx, max_new_tokens=64, temperature=1.0, top_k=None):
for _ in range(max_new_tokens):
idx_cond = idx[:, -self.max_len:]
logits, _ = self.forward(idx_cond)
logits = logits[:, -1, :] / max(temperature, 1e-6)
if top_k is not None:
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [-1]]] = -float('inf')
probs = F.softmax(logits, dim=-1)
next_id = torch.multinomial(probs, num_samples=1)
idx = torch.cat([idx, next_id], dim=1)
return idx
# train_minigpt.py
import torch, random
from torch.utils.data import Dataset, DataLoader
from minimal_gpt import MiniGPT
class TextDataset(Dataset):
def __init__(self, ids, block_size=256):
self.ids = ids
self.block = block_size
def __len__(self):
return max(1, len(self.ids) - self.block - 1)
def __getitem__(self, i):
x = torch.tensor(self.ids[i:i+self.block], dtype=torch.long)
y = torch.tensor(self.ids[i+1:i+self.block+1], dtype=torch.long)
return x, y
# 假設你已有 encode 後的 ids(例如用 SentencePiece)
ids = [random.randint(0, 31999) for _ in range(200000)] # demo:用隨機數代替
train_ds = TextDataset(ids, block_size=256)
loader = DataLoader(train_ds, batch_size=16, shuffle=True, drop_last=True)
model = MiniGPT(vocab_size=32000, d_model=384, n_layers=4, n_heads=6, max_len=256, dropout=0.1)
model = model.cuda()
opt = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95), weight_decay=0.1)
scaler = torch.cuda.amp.GradScaler()
for step, (x, y) in enumerate(loader, start=1):
x, y = x.cuda(), y.cuda()
opt.zero_grad(set_to_none=True)
with torch.cuda.amp.autocast(dtype=torch.float16):
_, loss = model(x, y)
scaler.scale(loss).backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(opt)
scaler.update()
if step % 100 == 0:
print(f"step {step} loss {loss.item():.4f}")
if step == 2000:
break
# 生成測試
start = torch.randint(0, 32000, (1, 1), device='cuda')
out = model.generate(start, max_new_tokens=50, temperature=0.8, top_k=50)
print(out)
重點:
loss
是預測下token交叉熵;使用AMP混合精度、梯度裁剪、AdamW
from datasets import load_dataset
from transformers import (
AutoTokenizer, AutoModelForCausalLM,
DataCollatorForLanguageModeling, TrainingArguments, Trainer
)
model_id = "gpt2" # 可換成適合中文的基座,例如 Qwen1.5-0.5B、TinyLlama 等
# 你的中文資料集(例)
dataset = load_dataset("json", data_files={
"train": "train.jsonl", # 每行 {"text": "..."}
"validation": "val.jsonl"
})
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
block_size = 1024
def tokenize_fn(ex):
out = tokenizer(ex["text"], truncation=True, max_length=block_size)
return out
dataset = dataset.map(tokenize_fn, batched=True, remove_columns=dataset["train"].column_names)
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
model = AutoModelForCausalLM.from_pretrained(model_id)
args = TrainingArguments(
output_dir="./ckpt",
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=2e-5,
num_train_epochs=2,
weight_decay=0.1,
logging_steps=50,
evaluation_strategy="steps",
eval_steps=200,
save_steps=200,
save_total_limit=2,
bf16=True, # 若 GPU 支援
lr_scheduler_type="cosine",
warmup_ratio=0.03,
report_to="none",
)
trainer = Trainer(
model=model,
args=args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
data_collator=collator,
)
trainer.train()
# 推論
text = "請用三點總結 Transformer 的核心概念:"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
若你的資料是 {"instruction": "...", "input": "...", "output": "..."}
,可在 tokenize_fn
先拼接成單一 text
:
TEMPLATE = """你是助理。\n# 指令\n{instruction}\n# 輸入\n{input}\n# 回覆\n{output}\n"""
def to_text(ex):
inst = ex.get("instruction", "")
inp = ex.get("input", "")
out = ex.get("output", "")
return {"text": TEMPLATE.format(instruction=inst, input=inp, output=out)}
sft_ds = load_dataset("json", data_files={"train": "sft_train.jsonl", "validation": "sft_val.jsonl"})
sft_ds = sft_ds.map(to_text)
sft_ds = sft_ds.map(tokenize_fn, batched=True, remove_columns=sft_ds["train"].column_names)
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
base = "Qwen1.5-0.5B" # 範例,可替換
sft = load_dataset("json", data_files={"train": "sft_train.jsonl", "validation": "sft_val.jsonl"})
tokenizer = AutoTokenizer.from_pretrained(base, use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
TEMPLATE = "[INST]{instruction}\n{input}[/INST]\n{output}"
def to_text(ex):
return {"text": TEMPLATE.format(**{k: ex.get(k, "") for k in ["instruction","input","output"]})}
def tok(ex):
return tokenizer(ex["text"], truncation=True, max_length=1024)
sft = sft.map(to_text)
sft = sft.map(tok, batched=True, remove_columns=sft["train"].column_names)
model = AutoModelForCausalLM.from_pretrained(base)
lora_cfg = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters() # 檢查只有 LoRA 權重可訓練
args = TrainingArguments(
output_dir="./lora_ckpt",
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=2e-4,
num_train_epochs=2,
logging_steps=50,
save_steps=200,
evaluation_strategy="no",
bf16=True,
)
trainer = Trainer(
model=model,
args=args,
train_dataset=sft["train"],
data_collator=lambda f: {k: torch.tensor([d[k] for d in f]) for k in ["input_ids","attention_mask"]},
)
trainer.train()
# 合併 LoRA 權重(可選),或在推論時載入 LoRA adaptor
提示:若顯存吃緊,可搭配 8-bit/4-bit 量化**
bitsandbytes
**與prepare_model_for_kbit_training
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import DPOTrainer
base = "Qwen1.5-0.5B"
prefs = load_dataset("json", data_files={"train": "pref_train.jsonl", "eval": "pref_eval.jsonl"})
# 每行:{"prompt": "...", "chosen": "...", "rejected": "..."}
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)
args = TrainingArguments(
output_dir="./dpo_ckpt",
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=1e-6,
num_train_epochs=1,
bf16=True,
)
trainer = DPOTrainer(
model,
ref_model=None, # 也可指定凍結的參考模型
args=args,
beta=0.1,
train_dataset=prefs["train"],
eval_dataset=prefs["eval"],
tokenizer=tok,
)
trainer.train()
ORPO
與 DPO 類似,將 KL 正則改寫入單一目標函式;TR L 也提供 API,可類比套用
困惑度(Perplexity)
:對驗證集計算 exp(loss)
指令評測
:自建指令集(中文任務覆蓋),計分規則(ROUGE/BLEU/Exact match)與人評安全性
:越權、隱私、幻覺測試import math, torch
from torch.utils.data import DataLoader
@torch.no_grad()
def eval_ppl(model, dataset, batch_size=2):
model.eval()
dl = DataLoader(dataset, batch_size=batch_size)
losses, n = 0.0, 0
for x, y in dl:
x, y = x.to(model.device), y.to(model.device)
_, loss = model(x, y)
losses += loss.item()
n += 1
return math.exp(losses / max(1, n))
accelerate config && accelerate launch train.py
:自動分散式啟動Trainer
/accelerate
可整合設定片段 TrainingArguments
gradient_checkpointing=True,
fp16=False, bf16=True,
optim="adamw_torch",
deepspeed="ds_config.json", # 若使用 DeepSpeed
ds_config.json
範例(簡化)
{
"zero_optimization": {"stage": 2},
"gradient_accumulation_steps": 16,
"train_batch_size": 32,
"bf16": {"enabled": true}
}
資料品質比模型大小更重要
:去重、去噪、統一標點與全半形序列長度
:中文長文請提高 block_size
,配合 FlashAttention/高效注意力庫(實務)學習率
:小模型 1e-3 ~ 3e-4,大模型微調 2e-5 ~ 5e-6;配合 warmup(3%~5%)漂移/崩潰
:loss 突增→降低 LR、增大權重衰減、開啟梯度裁剪過擬合
:提早停止、資料增量、正則(dropout/weight decay)tokenizer 不匹配
:更換基座時,務必用對應 tokenizer 重新處理資料量化
:8-bit/4-bit(bnb)、GPTQ、AWQ;注意精度–吞吐折衷服務
:vLLM、TGI、Text-Generation-WebUI;批次合併(batching)與 KV cache監控
:延遲、吞吐、失敗率;離線/線上評測與回饋數據回收一個成功的語言模型是龐大知識(預訓練)和精細引導(對齊)結合體。它不僅僅是一個統計學上的詞語預測機器,更是一個經過精心雕琢、與人類協作並服務於人類複雜系統