**為什麼要準備 : **
在專門領域或特定任務上需要專屬的中文資料集,英文資料相對多,中文資料很多時候需要自己準備
流程
一、資料來源與收集情感分析:
情感分析 (CSV)
id,text,label 1,"這台手機很棒,電池很耐用",1 2,"介面亂七八糟,功能有 bug",0 3,"還可以,不過價格稍高",2
FAQ
{"id":"doc_001","title":"退貨政策","text":"本商店的退貨政策是...","meta":{"source":"helpcenter","category":"shipping"}}
{
"version":"1.0",
"data":[
{
"title":"退貨政策",
"paragraphs":[
{
"context":"我們的退貨政策是 ...",
"qas":[
{"id":"q1","question":"如何申請退貨?","answers":[{"text":"填寫退貨表單","answer_start":12}],"is_impossible":false}
]
}
]
}
]
}
id,question,answer,topic q1,"如何退款?","請聯絡客服並提供訂單編號。","payment"
去除 HTML / 標籤、URLs、Email
全形/半形轉換、數字單位規一化(1000→1000/一千視情況)
簡繁轉換(決定統一為繁體或簡體)→ 可用 OpenCC
去除或保留 emoji(依任務;情感分析可保留)
移除多餘空白、重複句(deduplicate)
控制最小與最大長度(太短或太長可過濾或 chunk)
import re
# pip install opencc-python-reimplemented # 若要簡繁轉換
from opencc import OpenCC
cc = OpenCC('t2s') # 繁轉簡,改成 's2t' 可簡轉繁
def clean_text(text, to_simplified=False):
if not isinstance(text, str):
return ""
# 去 HTML
text = re.sub(r"<[^>]+>", " ", text)
# 去 URL
text = re.sub(r"http\S+|www\.\S+", " ", text)
# 去 Email
text = re.sub(r"\S+@\S+", " ", text)
# 簡繁轉換(視你需求)
if to_simplified:
text = cc.convert(text)
# 移除多餘空白
text = re.sub(r"\s+", " ", text).strip()
return text
四、標註流程與工具
from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv("sentiment.csv") # id,text,label
train_val, test = train_test_split(df, test_size=0.1, stratify=df["label"], random_state=42)
train, val = train_test_split(train_val, test_size=0.1111, stratify=train_val["label"], random_state=42) # -> 0.8/0.1/0.1