iT邦幫忙

2025 iThome 鐵人賽

DAY 11
0
佛心分享-IT 人自學之術

學習 LLM系列 第 11

Day 11 準備中文資料集 (2)

  • 分享至 

  • xImage
  •  

六、把資料轉成 Hugging Face Dataset

  • Sentiment(CSV → HF dataset)
from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("sentiment.csv")  # id,text,label
train_val, test = train_test_split(df, test_size=0.1, stratify=df["label"], random_state=42)
train, val = train_test_split(train_val, test_size=0.1111, stratify=train_val["label"], random_state=42)  # -> 0.8/0.1/0.1
  • FAQ documents(JSONL 每行一 doc)
from datasets import load_dataset
data = load_dataset('csv', data_files={'train':'train.csv','validation':'val.csv','test':'test.csv'})
print(data)

七、Tokenize / chunk / 存 embeddings(RAG 或檢索)

  1. tokenize / preprocess(預處理器)
from transformers import AutoTokenizer

model_name = "bert-base-chinese"  # 或 distilbert、roberta 的中文變體
tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess_fn(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

tokenized = data.map(preprocess_fn, batched=True)
# 對 label 進行 int 編碼(若 label 是文字)
def label_fn(example):
    example["label"] = int(example["label"])  # 視情況
    return example
tokenized = tokenized.map(label_fn)
  1. FAQ / RAG 特別處理:chunking(切段)與 metadata
  • RAG 的文件通常需要 chunk(以 token 數量為基準)並加 overlap(例如 chunk_size=512, overlap=50)
  • 存成 JSONL,每個 chunk 包含 id, text, title, meta
    用 tokenizer 依 token 數切
def chunk_text(text, tokenizer, chunk_size=512, overlap=50):
    		ids = tokenizer.encode(text, add_special_tokens=False)
   		chunks = []
    		for i in range(0, len(ids), chunk_size - overlap):
        		chunk_ids = ids[i:i+chunk_size]
       chunk_text = tokenizer.decode(chunk_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        chunks.append(chunk_text)
    return chunks

# 使用範例
chunks = chunk_text(long_doc_text, tokenizer, chunk_size=512, overlap=50)
# 再把每個 chunk 存成 jsonl,附上原 doc id + chunk index
  1. 製作 embeddings(sentence-transformers)與儲存
from sentence_transformers import SentenceTransformer
import numpy as np

embedder = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
docs = [d["text"] for d in docs_list]  # docs_list 為 chunked documents
embeddings = embedder.encode(docs, convert_to_numpy=True, show_progress_bar=True)

# 存檔
np.save("doc_embeddings.npy", embeddings)
# 也可把每個 embedding 與 doc metadata 寫到 sqlite/json 或上傳到 vector DB (FAISS/Chroma/Weaviate)

八、評估

  • 情感分類:Accuracy, Precision, Recall, F1 (macro/micro), Confusion matrix
  • FAQ/Retrieval:Recall@k (是否把正解段落包含在 top-k),MAP,MRR
  • QA(span):EM (exact match), F1 (token-level)
from sklearn.metrics import classification_report
y_true = [...]
y_pred = [...]
print(classification_report(y_true, y_pred, digits=4))

上一篇
Day 10 準備中文資料集 (1)
系列文
學習 LLM11
圖片
  熱門推薦
圖片
{{ item.channelVendor }} | {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言