我們把昨天的 dataset 做好分詞之後,就可以來訓練自己的模型啦!
AutoModelForSequenceClassification
來載入 pre-trained model。需要特別注意的是,這裡我們要設定 label 的數量,要符合 dataset 裡的 label 數量。也建議指定好 id2label
和 label2id
,之後在做 inference 時結果才會比較易讀。最後要記得加上 .to(device)
。from transformers import AutoModelForSequenceClassification
import torch
num_labels = 4
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = (AutoModelForSequenceClassification
.from_pretrained(model_name, num_labels=num_labels
,id2label={"0": "negative",
"1": "positive",
"2": "no_impact",
"3": "mixed"}
,label2id={"negative": "0",
"positive": "1",
"no_impact": "2",
"mixed": "3"})
.to(device))
TrainingArguments
來設定參數。output_dir
建議要設定,會幫你建議一個資料夾然後把 checkpoint 和最後跑完的模型存在裡面。TrainingArguments
有九十幾個參數可以設定,極其複雜,但也很方便,很多功能你想有沒有會有,去讀 source code 和[文件]report_to
這個欄位要特別注意,如果你是用 MLFlow 之類的工具,可以設定為 mlflow
,這邊我們用 Azure Machine Learning ,所以設定為 azure_ml
,建議要設定,否則預設值是 all
,可能會出現參數過多的 bug。from transformers import Trainer, TrainingArguments
batch_size = 64
logging_steps = len(sentiment_encoded["train"]) // batch_size
model_name = "poem_model"
training_args = TrainingArguments(output_dir=model_name,
num_train_epochs=40,
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
evaluation_strategy="epoch",
disable_tqdm=False,
label_names= labels,
report_to = "azure_ml",
logging_steps=logging_steps)
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels, preds, average="weighted")
acc = accuracy_score(labels, preds)
return {"accuracy": acc, "f1": f1}
from transformers import Trainer
trainer = Trainer(model=model, args=training_args,
compute_metrics=compute_metrics,
train_dataset=sentiment_encoded["train"],
eval_dataset=sentiment_encoded["validation"],
tokenizer=tokenizer)
trainer.train()
pipeline
的方式,來載入模型吧!程式碼如下。from transformers import pipeline
classifier = pipeline(task= 'sentiment-analysis',
model= "poem_model")
classifier(
[
"Only those who will risk going too far can possibly find out how far one can go.",
"Baby shark, doo doo doo doo doo doo, Baby shark!"
]
)
會得到類似於下面的結果:
[{'label': 'no_impact', 'score': 0.7432655692100525},
{'label': 'no_impact', 'score': 0.9643214344978333}]
好的,顯然這兩句話,在經過這個資料集訓練後,就變得不太重要了。不過也可能是這個資料夾嚴重的資料偏差所導致的哦!
以上就是 Transformer 拿自己的資料來做 fine-tune 的方法了,明天我們來講 Transformer 的幾種型態吧!