# Day15- Fine-tune Transformer --- 資料處理篇

2022 iThome 鐵人賽

DAY 15

AI & Data

變形金剛與抱臉怪---NLP 應用開發之實戰系列第 15 篇

14th鐵人賽 azure machine learning hugging face transformer

大魔術熊貓工程師

2022-09-30 23:07:22

3998 瀏覽

分享至

這幾天我們做完了一個完整的文本分類的 transformer 了，但是我們做的內容，都是直接呼叫人家做好的 pre-trained model。其訓練的資料內容都是別人的，都不是自己的。今天我們就來用自己的 dataset，來微調別人的 pre-trained model ，這樣子就可以得到屬於自己 domain knowhow 的 model 了。

今天講的內容就是會之前在談 dataset 和 tokenizer library 的應用，如果不熟的話，可以回去看前面的內容。

先來載入 Dataset，我們就用之前提到過 poem-sentiment dataset 吧！

from datasets import load_dataset
sentiment = load_dataset("poem_sentiment")
sentiment

可以看到這個 dataset 長成這樣

DatasetDict({
    train: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 892
    })
    validation: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 105
    })
    test: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 104
    })
})

接著我們把這個 dataset 轉成 pandas。

import pandas as pd

sentiment.set_format(type="pandas")
df = sentiment["train"][:]
df.head()

會看到下面的結果。

id	verse_text	label
0	0	with pale blue berries. in these peaceful shad...	1
1	1	it flows so long as falls the rain,	2
2	2	and that is why, the lonesome day,	0
3	3	when i peruse the conquered fame of heroes, an...	3
4	4	of inward strife for truth and liberty.	3

再來用 int2str 來看看 labels 長什麼樣。

def label_int2str(row):
    return sentiment["train"].features["label"].int2str(row)

df["label_name"] = df["label"].apply(label_int2str)
df.head()

會得到：

	id	verse_text	label	label_name
0	0	with pale blue berries. in these peaceful shad...	1	positive
1	1	it flows so long as falls the rain,	2	no_impact
2	2	and that is why, the lonesome day,	0	negative
3	3	when i peruse the conquered fame of heroes, an...	3	mixed
4	4	of inward strife for truth and liberty.	3	mixed

接著直接把 lebels 指定為變數。

labels = sentiment["train"].features["label"].names
print(labels)

把這個 dataset 的分布用 matplotlib 印出來。

import matplotlib.pyplot as plt

df["label_name"].value_counts(ascending=True).plot.barh()
plt.title("Number of labels")
plt.show()

會看到這是一個很不平均的 dataset。
koko hugging face azure machine learning

記得把 dataset 的格式 reset 回來！

sentiment.reset_format()

接著我們就來進行呼叫分詞吧！

from transformers import AutoTokenizer

model_name = "distilbert-base-uncased" # 第三天預設的distilbert-base-uncased-finetuned-sst-2-english用這個
tokenizer = AutoTokenizer.from_pretrained(model_name)

然後把這個 tokenizer 包成一個 function，之前提到這種寫法是方便 map() 的慣例。

def tokenize(batch):
    return tokenizer(batch["verse_text"], padding=True, truncation=True)

接著用 map() 把資料集做分詞

sentiment_encoded = sentiment.map(tokenize, batched=True, batch_size=None)
next(iter(sentiment_encoded["train"])) #忘記這裡為什麼要用 next(iter())才能看到印出來的資料，可以回去看載入極巨大資料篇

可以看到印出來這樣子的結果，代表已經做完分詞啦：

{'id': 0,
 'verse_text': 'with pale blue berries. in these peaceful shades--',
 'label': 1,
 'input_ids': [101,
  2007,
  5122,
  2630,
  22681,
  1012,
  1999,
  2122,
  9379,
  13178,
  1011,
  1011,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

我們再回過頭來看 dataset validate 的部份，會發現裡面都沒有類別 3 。這個可能要注意一下，未來可能會在做 validation 的時候產生 bug。

valid_ds = sentiment["validation"]
valid_ds["label"][:]

以上就是資料處理的部份，是不是很簡單呢！明天就來把這個 dataset 丟進去 transformer 做訓練吧！

# Day14-Hugging Face Transformer Pipeline 和 TF model

# Day16- Fine-tune Transformer --- 訓練模型篇

系列文

變形金剛與抱臉怪---NLP 應用開發之實戰共 30 篇

RSS系列文訂閱系列文

38 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22208 篇

完賽人數

602 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

變形金剛與抱臉怪---NLP 應用開發之實戰系列 第 15 篇