These values define the architecture and behavior of the transformer model:
src_vocab_size, tgt_vocab_size: Vocabulary sizes for source and target sequences, both set to 5000.
d_model: Dimensionality of the model's embeddings, set to 512.
num_heads: Number of attention heads in the multi-head attention mechanism, set to 8.
num_layers: Number of layers for both the encoder and the decoder, set to 6.
d_ff: Dimensionality of the inner layer in the feed-forward network, set to 2048.
max_seq_length: Maximum sequence length for positional encoding, set to 100.
dropout: Dropout rate for regularization, set to 0.1.
src_vocab_size = 5000
tgt_vocab_size = 5000
d_model = 512
num_heads = 8
num_layers = 6
d_ff = 2048
max_seq_length = 100
dropout = 0.1

This line creates an instance of the Transformer class, initializing it with the given
hyperparameters. The instance will have the architecture and behavior defined by these 
transformer = Transformer(src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout)

Generate random sample data.
src_data: Random integers between 1 and src_vocab_size, representing a batch of source sequences with shape (64, max_seq_length).
tgt_data: Random integers between 1 and tgt_vocab_size, representing a batch of target sequences with shape (64, max_seq_length).
These random sequences can be used as inputs to the transformer model, simulating a batch of data with 64 examples and sequences of length 100.
src_data = torch.randint(1, src_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)
tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)

此程式碼片段示範如何初始化Transformer模型並產生可輸入模型的隨機來源序列和目標序列。 所選的超參數決定了變壓器的具體結構和屬性。 此設定可以是較大腳本的一部分,其中模型根據實際的序列到序列任務(例如機器翻譯或文字摘要)進行訓練和評估。


criterion = nn.CrossEntropyLoss(ignore_index=0): Defines the loss function as cross-entropy loss. The ignore_index argument is set to 0, meaning the loss will not consider targets with an index of 0 (typically reserved for padding tokens).
optimizer = optim.Adam(...): Defines the optimizer as Adam with a learning rate of 0.0001 and specific beta values.
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

transformer.train(): Sets the transformer model to training mode, enabling behaviors like dropout that only apply during training.

The code snippet trains the model for 100 epochs using a typical training loop:

for epoch in range(100): Iterates over 100 training epochs.
optimizer.zero_grad(): Clears the gradients from the previous iteration.
output = transformer(src_data, tgt_data[:, :-1]): Passes the source data and the target data (excluding the last token in each sequence) through the transformer. This is common in sequence-to-sequence tasks where the target is shifted by one token.
loss = criterion(...): Computes the loss between the model's predictions and the target data (excluding the first token in each sequence). The loss is calculated by reshaping the data into one-dimensional tensors and using the cross-entropy loss function.
loss.backward(): Computes the gradients of the loss with respect to the model's parameters.
optimizer.step(): Updates the model's parameters using the computed gradients.
print(f"Epoch: {epoch+1}, Loss: {loss.item()}"): Prints the current epoch number and the loss value for that epoch.
for epoch in range(100):
    output = transformer(src_data, tgt_data[:, :-1])
    loss = criterion(output.contiguous().view(-1, tgt_vocab_size), tgt_data[:, 1:].contiguous().view(-1))
    print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

此程式碼片段在隨機產生的來源序列和目標序列上訓練 Transformer 模型 100 個時期。 它使用 Adam 優化器和交叉熵損失函數。 每個時期都會顯示出損失,以便您監控訓練進度。 在現實場景中,您可以將隨機來源序列和目標序列替換為任務(例如機器翻譯)中的實際資料。


transformer.eval(): Puts the transformer model in evaluation mode. This is important because
it turns off certain behaviors like dropout that are only used during training.

Generate random sample validation data.
val_src_data: Random integers between 1 and src_vocab_size, representing a batch of validation source sequences with shape (64, max_seq_length).
val_tgt_data: Random integers between 1 and tgt_vocab_size, representing a batch of validation target sequences with shape (64, max_seq_length).
val_src_data = torch.randint(1, src_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)
val_tgt_data = torch.randint(1, tgt_vocab_size, (64, max_seq_length))  # (batch_size, seq_length)

Validation Loop:

with torch.no_grad(): Disables gradient computation, as we don't need to compute gradients during validation. This can reduce memory consumption and speed up computations.
val_output = transformer(val_src_data, val_tgt_data[:, :-1]): Passes the validation source data and the validation target data (excluding the last token in each sequence) through the transformer.
val_loss = criterion(...): Computes the loss between the model's predictions and the validation target data (excluding the first token in each sequence). The loss is calculated by reshaping the data into one-dimensional tensors and using the previously defined cross-entropy loss function.
print(f"Validation Loss: {val_loss.item()}"): Prints the validation loss value.
with torch.no_grad():

    val_output = transformer(val_src_data, val_tgt_data[:, :-1])
    val_loss = criterion(val_output.contiguous().view(-1, tgt_vocab_size), val_tgt_data[:, 1:].contiguous().view(-1))
    print(f"Validation Loss: {val_loss.item()}")

此程式碼片段在隨機生成的驗證資料集上評估Transformer模型,計算驗證損失並顯示它。 在現實場景中,隨機驗證資料應替換為您正在處理的任務中的實際驗證資料。 驗證損失可以讓您了解模型在未見過的數據上的表現如何,這是模型泛化能力的關鍵衡量標準。

