我們昨天講了一大堆的分詞理論,如果無法消化吸收也沒有關係,就當做是惡夢一場,忘了吧!今天我們來用 Hugging Face Tokenizer Library,你們就知道分詞可以是一件很簡單的事情,因為 Hugging Face 都幫我們包好了!那麼就打開你 Azure Machine Learning ,我們來寫程式吧!
from transformers import AutoTokenizer
string = "Only those who will risk going too far can possibly find out how far one can go."
model_name = "distilbert-base-uncased-finetuned-sst-2-english" #直接叫model名字
tokenizer = AutoTokenizer.from_pretrained(model_name)
from transformers import DistilBertTokenizer
distilbert_tokenizer = DistilBertTokenizer.from_pretrained(model_name)
encoded_str = tokenizer(string, padding=True, truncation=True)
encoded_str
我們可以看到印出這樣子的內容
{'input_ids': [101, 2069, 2216, 2040, 2097, 3891, 2183, 2205, 2521, 2064, 4298, 2424, 2041, 2129, 2521, 2028, 2064, 2175, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
input_ids
就是把我們的文字做了 numericalization,但是你會注意到,字數對不起來,其實是頭尾多出了 101 和 102。這兩個是 special token id,CLS 就是分類,SEP 就是終止符號。UNK 就是未知,而 0 就是 padding ,把不足的長度的地方補上零。
Special Token ID 可以參考下表:
|Special Token | [PAD] | [UNK] | [CLS] | [SEP] | [MASK] |
| ---- | ---- | ---- | ---- |---- |
|Special Token ID | 0 | 100 | 101 | 102 | 103 |
attention_mask
我們可以簡單理解為1的部份就要做 self-attention 的地方,而0就不做。
tokens = tokenizer.convert_ids_to_tokens(encoded_str.input_ids)
tokens
會得到下面的結果:
['[CLS]',
'only',
'those',
'who',
'will',
'risk',
'going',
'too',
'far',
'can',
'possibly',
'find',
'out',
'how',
'far',
'one',
'can',
'go',
'.',
'[SEP]']
print(tokenizer.convert_tokens_to_string(tokens))
會得到:
[CLS] only those who will risk going too far can possibly find out how far one can go. [SEP]
string_array = [
string,
"Baby shark, doo doo doo doo doo doo, Baby shark!"
]
encoded_str_arr = tokenizer(string_array, padding=True, truncation=True)
encoded_str_arr
會得到:
{'input_ids': [[101, 2069, 2216, 2040, 2097, 3891, 2183, 2205, 2521, 2064, 4298, 2424, 2041, 2129, 2521, 2028, 2064, 2175, 1012, 102], [101, 3336, 11420, 1010, 20160, 20160, 20160, 20160, 20160, 20160, 1010, 3336, 11420, 999, 102, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]]}
我們可以注意到長度不足的地方補零。
from datasets import load_dataset
sentiment = load_dataset("poem_sentiment")
def tokenize(batch):
return tokenizer(batch["verse_text"], padding=True, truncation=True)
print(tokenize(sentiment["train"][:3]))
會得到:
{'input_ids': [[101, 2007, 5122, 2630, 22681, 1012, 1999, 2122, 9379, 13178, 1011, 1011, 102], [101, 2009, 6223, 2061, 2146, 2004, 4212, 1996, 4542, 1010, 102, 0, 0], [101, 1998, 2008, 2003, 2339, 1010, 1996, 10459, 14045, 2154, 1010, 102, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]]}
Map()
來把整個資料集都做分詞吧!sentiment_encoded = sentiment.map(tokenize, batched=True, batch_size=None)
print(sentiment_encoded["train"].column_names)
會得到:
{'id': [0, 1, 2], 'verse_text': ['with pale blue berries. in these peaceful shades--', 'it flows so long as falls the rain,', 'and that is why, the lonesome day,'], 'label': [1, 2, 0], 'input_ids': [[101, 2007, 5122, 2630, 22681, 1012, 1999, 2122, 9379, 13178, 1011, 1011, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 2009, 6223, 2061, 2146, 2004, 4212, 1996, 4542, 1010, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1998, 2008, 2003, 2339, 1010, 1996, 10459, 14045, 2154, 1010, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}
以上就是 Tokenizer 大部分的內容了。明天開始我們就要進去 Transformer,整個當代 NLP 最主軸的核心。