(會分為上下兩半)
在BERT上進行的微調
是指將預先訓練好的BERT模型進一步訓練以適應特定任務或領域的需求。BERT 是一個在大規模文本語言庫上進行預訓練的深度學習模型,但是它是一個通用的語言模型,可能需要進一步的訓練來適應特定的自然語言處理任務,那這個動作就稱為微調
中文的 wikiann
bert-base-chinese
輸入的詞句經過模型處理標示出具有實體代表的詞彙
( 一樣打開 Colab )
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
accelerate
用於加速 PyTorch 模型訓練evaluate
用於評估模型性能from datasets import load_dataset
datasets = load_dataset('wikiann', 'zh')
DatasetDict({
validation: Dataset({
features: ['tokens', 'ner_tags', 'langs', 'spans'],
num_rows: 10000
})
test: Dataset({
features: ['tokens', 'ner_tags', 'langs', 'spans'],
num_rows: 10000
})
train: Dataset({
features: ['tokens', 'ner_tags', 'langs', 'spans'],
num_rows: 20000
})
})
label_list = datasets["train"].features["ner_tags"].feature.names
# ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
(這裡取七筆資料)
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples=7):
picks = []
for _ in range(num_examples):
pick = random.randint(0, len(dataset)-1)
while pick in picks:
pick = random.randint(0, len(dataset)-1)
picks.append(pick)
df = pd.DataFrame(dataset[picks])
for column, typ in dataset.features.items():
if isinstance(typ, ClassLabel):
df[column] = df[column].transform(lambda i: typ.names[i])
elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
display(HTML(df.to_html()))
show_random_elements(datasets["train"])
from transformers import BertTokenizerFast
model_checkpoint = "bert-base-chinese"
tokenizer = BertTokenizerFast.from_pretrained(model_checkpoint)
bert-base-chinese
label_all_tokens = True
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i)
previous_word_idx = None
label_ids = []
for word_idx in word_ids:
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx:
label_ids.append(label[word_idx])
else:
label_ids.append(label[word_idx] if label_all_tokens else -100)
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs["labels"] = labels
return tokenized_inputs
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
truncation=True
表示針對超過模型的最大長度會進行截斷,is_split_into_words=True
表示原本的輸出文本已經是單詞狀態word_ids = tokenized_inputs.word_ids(batch_index=i)
ex: [None, 0, 1, 2, 3, 4, None]
label_ids = []
for word_idx in word_ids:
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx:
label_ids.append(label[word_idx])
else:
label_ids.append(label[word_idx] if label_all_tokens else -100)
previous_word_idx = word_idx
word_ids
標記中的每個索引與標籤的對應關係,將標籤轉換為label_ids
。ex: [-100, 0, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, -100]
tokenized_inputs["labels"] = labels
labels列表
加入tokenized_inputs
字典中的鍵labels
下,以便將標籤與分詞後的輸入關聯起來。tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)