Day 11 準備中文資料集 (2)

2025 iThome 鐵人賽

DAY 11

佛心分享-IT 人自學之術

學習 LLM系列第 11 篇

17th鐵人賽

yu_ting

2025-09-25 14:08:37

122 瀏覽

分享至

六、把資料轉成 Hugging Face Dataset

Sentiment（CSV → HF dataset）

from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("sentiment.csv")  # id,text,label
train_val, test = train_test_split(df, test_size=0.1, stratify=df["label"], random_state=42)
train, val = train_test_split(train_val, test_size=0.1111, stratify=train_val["label"], random_state=42)  # -> 0.8/0.1/0.1

FAQ documents（JSONL 每行一 doc）

from datasets import load_dataset
data = load_dataset('csv', data_files={'train':'train.csv','validation':'val.csv','test':'test.csv'})
print(data)

七、Tokenize / chunk / 存 embeddings（RAG 或檢索）

tokenize / preprocess(預處理器)

from transformers import AutoTokenizer

model_name = "bert-base-chinese"  # 或 distilbert、roberta 的中文變體
tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess_fn(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

tokenized = data.map(preprocess_fn, batched=True)
# 對 label 進行 int 編碼（若 label 是文字）
def label_fn(example):
    example["label"] = int(example["label"])  # 視情況
    return example
tokenized = tokenized.map(label_fn)

FAQ / RAG 特別處理：chunking（切段）與 metadata

RAG 的文件通常需要 chunk（以 token 數量為基準）並加 overlap（例如 chunk_size=512, overlap=50）
存成 JSONL，每個 chunk 包含 id, text, title, meta
用 tokenizer 依 token 數切

def chunk_text(text, tokenizer, chunk_size=512, overlap=50):
    		ids = tokenizer.encode(text, add_special_tokens=False)
   		chunks = []
    		for i in range(0, len(ids), chunk_size - overlap):
        		chunk_ids = ids[i:i+chunk_size]
       chunk_text = tokenizer.decode(chunk_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        chunks.append(chunk_text)
    return chunks

# 使用範例
chunks = chunk_text(long_doc_text, tokenizer, chunk_size=512, overlap=50)
# 再把每個 chunk 存成 jsonl，附上原 doc id + chunk index

製作 embeddings（sentence-transformers）與儲存

from sentence_transformers import SentenceTransformer
import numpy as np

embedder = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
docs = [d["text"] for d in docs_list]  # docs_list 為 chunked documents
embeddings = embedder.encode(docs, convert_to_numpy=True, show_progress_bar=True)

# 存檔
np.save("doc_embeddings.npy", embeddings)
# 也可把每個 embedding 與 doc metadata 寫到 sqlite/json 或上傳到 vector DB (FAISS/Chroma/Weaviate)

八、評估

情感分類：Accuracy, Precision, Recall, F1 (macro/micro), Confusion matrix
FAQ/Retrieval：Recall@k (是否把正解段落包含在 top-k)，MAP，MRR
QA（span）：EM (exact match), F1 (token-level)

from sklearn.metrics import classification_report
y_true = [...]
y_pred = [...]
print(classification_report(y_true, y_pred, digits=4))

Day 10 準備中文資料集 (1)

學習 Prompt Engineering (提示工程)

系列文

學習 LLM 共 30 篇

RSS系列文訂閱系列文

0 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19838 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

學習 LLM系列 第 11 篇

Day 11 準備中文資料集 (2)

尚未有邦友留言

標記使用者

學習 LLM系列第 11 篇