iT邦幫忙

2022 iThome 鐵人賽

DAY 16
0

這是我少數用正經的標題,因為並沒有很確切的生活概念可以來清楚地描述。雖然說字首字根很接近,但不全然是,也可以簡單說是種編碼的方式,但這說法又太籠統。比較理論的說法可以參考這篇,我們直接從例子來看它的效果是什麼。

同樣是接續前天IMDB的評語,我們原本用Tokenizer來創建辭典。我們將單字量設為10000時,讀3個例句:

["this was an absolutely terrible movie don't be {OOV} in by christopher walken or michael {OOV} both are great actors but this must simply be their worst role in history even their great acting could not redeem this movie's ridiculous storyline this movie is an early nineties us propaganda piece the most pathetic scenes were those when the {OOV} rebels were making their cases for {OOV} maria {OOV} {OOV} appeared phony and her pseudo love affair with walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning i am disappointed that there are movies like this ruining actor's like christopher {OOV} good name i could barely sit through it"]
['i have been known to fall asleep during films but this is usually due to a combination of things including really tired being warm and comfortable on the {OOV} and having just eaten a lot however on this occasion i fell asleep because the film was rubbish the plot development was constant constantly slow and boring things seemed to happen but with no explanation of what was causing them or why i admit i may have missed part of the film but i watched the majority of it and everything just seemed to happen of its own {OOV} without any real concern for anything else i cant recommend this film at all']
['mann photographs the {OOV} rocky mountains in a superb fashion and jimmy stewart and walter brennan give enjoyable performances as they always seem to do br br but come on hollywood a {OOV} telling the people of dawson city {OOV} to {OOV} themselves a {OOV} yes a {OOV} and to {OOV} the law themselves then {OOV} battling it out on the streets for control of the town br br nothing even remotely resembling that happened on the canadian side of the border during the {OOV} gold rush mr mann and company appear to have mistaken dawson city for {OOV} the canadian north for the american wild west br br canadian viewers be prepared for a {OOV} madness type of enjoyable {OOV} with this ludicrous plot or to shake your head in disgust']

可以看到有很多OOV,因為它完整的字彙量有88583。
但我們如果用它所提供的另一組subwords8k的資料,同樣先load進來,然後它本身已經是編碼成代號的了,讀2句為例:

import tensorflow_datasets as tfds

# Download the subword encoded pretokenized dataset
imdb_subwords, info_subwords = tfds.load("imdb_reviews/subwords8k", with_info=True, as_supervised=True)

# Take 2 training examples and print its contents
for example in imdb_subwords['train'].take(2):
  print(example)

(<tf.Tensor: shape=(163,), dtype=int64, numpy=
array([ 62, 18, 41, 604, 927, 65, 3, 644, 7968, 21, 35,
5096, 36, 11, 43, 2948, 5240, 102, 50, 681, 7862, 1244,
3, 3266, 29, 122, 640, 2, 26, 14, 279, 438, 35,
79, 349, 384, 11, 1991, 3, 492, 79, 122, 188, 117,
33, 4047, 4531, 14, 65, 7968, 8, 1819, 3947, 3, 62,
27, 9, 41, 577, 5044, 2629, 2552, 7193, 7961, 3642, 3,
19, 107, 3903, 225, 85, 198, 72, 1, 1512, 738, 2347,
102, 6245, 8, 85, 308, 79, 6936, 7961, 23, 4981, 8044,
3, 6429, 7961, 1141, 1335, 1848, 4848, 55, 3601, 4217, 8050,
2, 5, 59, 3831, 1484, 8040, 7974, 174, 5773, 22, 5240,
102, 18, 247, 26, 4, 3903, 1612, 3902, 291, 11, 4,
27, 13, 18, 4092, 4008, 7961, 6, 119, 213, 2774, 3,
12, 258, 2306, 13, 91, 29, 171, 52, 229, 2, 1245,
5790, 995, 7968, 8, 52, 2948, 5240, 8039, 7968, 8, 74,
1249, 3, 12, 117, 2438, 1369, 192, 39, 7975])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(142,), dtype=int64, numpy=
array([ 12, 31, 93, 867, 7, 1256, 6585, 7961, 421, 365, 2,
26, 14, 9, 988, 1089, 7, 4, 6728, 6, 276, 5760,
2587, 2, 81, 6118, 8029, 2, 139, 1892, 7961, 5, 5402,
246, 25, 1, 1771, 350, 5, 369, 56, 5397, 102, 4,
2547, 3, 4001, 25, 14, 7822, 209, 12, 3531, 6585, 7961,
99, 1, 32, 18, 4762, 3, 19, 184, 3223, 18, 5855,
1045, 3, 4232, 3337, 64, 1347, 5, 1190, 3, 4459, 8,
614, 7, 3129, 2, 26, 22, 84, 7020, 6, 71, 18,
4924, 1160, 161, 50, 2265, 3, 12, 3983, 2, 12, 264,
31, 2545, 261, 6, 1, 66, 2, 26, 131, 393, 1,
5846, 6, 15, 5, 473, 56, 614, 7, 1470, 6, 116,
285, 4755, 2088, 7961, 273, 119, 213, 3414, 7961, 23, 332,
1019, 3, 12, 7667, 505, 14, 32, 44, 208, 7975])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)

而subword編碼表總共有7928個,節錄前幾個如下:

['the_', ', ', '. ', 'a_', 'and_', 'of_', 'to_', 's_', 'is_', 'br', 'in_', 'I_', 'that_', 'this_', 'it_', ' /><', ' />', 'was_', 'The_', 'as_', 't_', 'with_', 'for_', '.<', 'on_', 'but_', 'movie_', ' (', 'are_', 'his_', …

而用這個編碼去解讀同樣剛那3個例句:

# Encode the first plaintext sentence using the subword text encoder
for i in range(3):
    tokenized_string = tokenizer_subwords.encode(training_sentences[i])
    # Decode the sequence
    original_string = tokenizer_subwords.decode(tokenized_string)
    # Print the result
    print(training_sentences[i])
    print(original_string)

可以看到與原句是吻合的,也就是用比較少的代號可以表示更完整的文字內容:

This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.
This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.
I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.
I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.
Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. But come on Hollywood - a Mountie telling the people of Dawson City, Yukon to elect themselves a marshal (yes a marshal!) and to enforce the law themselves, then gunfighters battling it out on the streets for control of the town? Nothing even remotely resembling that happened on the Canadian side of the border during the Klondike gold rush. Mr. Mann and company appear to have mistaken Dawson City for Deadwood, the Canadian North for the American Wild West.Canadian viewers be prepared for a Reefer Madness type of enjoyable howl with this ludicrous plot, or, to shake your head in disgust.
Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. But come on Hollywood - a Mountie telling the people of Dawson City, Yukon to elect themselves a marshal (yes a marshal!) and to enforce the law themselves, then gunfighters battling it out on the streets for control of the town? Nothing even remotely resembling that happened on the Canadian side of the border during the Klondike gold rush. Mr. Mann and company appear to have mistaken Dawson City for Deadwood, the Canadian North for the American Wild West.Canadian viewers be prepared for a Reefer Madness type of enjoyable howl with this ludicrous plot, or, to shake your head in disgust.

來個更簡單的例子,讓兩種分別讀”TensorFlow, from basics to mastery”,用我們建的10000單字庫的讀到的是:

['{OOV} from {OOV} to {OOV}']

合理,因為像TensorFlow 這種特殊字不常見,但用subwords8k可以讀到一樣的句子,而對應的編碼為:

6307 ----> "Ten"
2327 ----> "sor"
4043 ----> "Fl"
2120 ----> "ow"
2 ----> ", "
48 ----> "from "
4249 ----> "basi"
4429 ----> "cs "
7 ----> "to "
2652 ----> "master"
8050 ----> "y"

由於subword的資料已經是代碼了,所以要餵進model訓練前需要將資料用 padded_batch讓它方正:

BUFFER_SIZE = 10000
BATCH_SIZE = 64

# Get the train and test splits
train_data, test_data = imdb_subwords['train'], imdb_subwords['test'], 

# Shuffle the training data
train_dataset = train_data.shuffle(BUFFER_SIZE)

# Batch and pad the datasets to the maximum length of the sequences
train_dataset = train_dataset.padded_batch(BATCH_SIZE)
test_dataset = test_data.padded_batch(BATCH_SIZE)

import tensorflow as tf

# Define dimensionality of the embedding
embedding_dim = 64

# Build the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(tokenizer_subwords.vocab_size, embedding_dim),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])


上一篇
Day14 菜鳥的練功課程-小心眼?
下一篇
Day16 菜鳥的練功課程-上下文
系列文
來創造一個AI角色吧-新手的探尋之路30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言