DAY23 - 如何訓練HuBERT模型 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 23

AI & Data

利用SeamlessM4T學習語音辨識架構及應用系列第 23 篇

DAY23 - 如何訓練HuBERT模型

15th鐵人賽 seamlessm4t

AlbertShiu

2023-10-08 10:36:51

614 瀏覽

分享至

訓練自己的HuBERT模型，試著使用自己的資料集，練習HuBERT的過程。

資料集準備

遵循 ./simple_kmeans 的步驟(連結)準備以下資料：

{train,valid}.tsv 音檔列表
{train,valid}.km幀數對齊的偽標檔案註列表
dict.km.txt 偽類別字典集合。其 label_rate標註率與用來分群的特徵幀率相同，如針對MFCC的100Hz及HuBERT特徵的100Hz。

預訓練HuBERT模型

假設 {train,valid}.tsv 儲存於 /path/to/data路徑， {train,valid}.km 儲存於 /path/to/labels路徑，且標註率為100Hz，跑以下程式碼訓練一個基本的模型(12層的Transformer)：

$ python fairseq_cli/hydra_train.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/pretrain \
  --config-name hubert_base_librispeech \
  task.data=/path/to/data task.label_dir=/path/to/labels task.labels='["km"]' model.label_rate=100

利用CTC損失微調HuBERT

假設 {train,valid}.tsv 儲存於路徑 /path/to/data，和與他們對應的字串文本 {train,valid}.ltr 儲存於路徑 /path/to/trans。在 /path/to/checkpoint路徑下，跑以下程式碼來微調一個預訓練HuBERT模型：

$ python fairseq_cli/hydra_train.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/finetune \
  --config-name base_10h \
  task.data=/path/to/data task.label_dir=/path/to/trans \
  model.w2v_path=/path/to/checkpoint

*與傳統的聲學模型訓練相比，採用CTC作為損失函式的聲學模型訓練，是一種完全端到端(end to end)的聲學模型訓練，不需要預先對資料做對齊，只需要一個輸入序列和一個輸出序列即可以訓練。這樣就不需要對資料對齊和一一標註，並且CTC直接輸出序列預測的概率，不需要額外的後處理。

解碼HuBERT模型

為了解碼，test.tsv 和 test.ltr 分別為音檔列表和字串文本存放於 /path/to/data路徑，以及微調的模型存放於路徑 /path/to/checkpoint。而解碼模式有三種：

Viterbi decoding：不需要語言模型的貪婪解碼
KenLM decoding：使用KenLM(Language Model)，一個arpa格式的n元語法(n-gram)語言模型進行解碼
Fairseq-LM deocding：使用 Fairseq 神經語言模型進行解碼

Viterbi decoding

task.normalize 在微調時須保持一樣的值。解碼結果將存在路徑 /path/to/experiment/directory/decode/viterbi/test.

$ python examples/speech_recognition/new/infer.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/decode \
  --config-name infer_viterbi \
  task.data=/path/to/data \
  task.normalize=[true|false] \
  decoding.exp_dir=/path/to/experiment/directory \
  common_eval.path=/path/to/checkpoint
  dataset.gen_subset=test \

KenLM / Fairseq-LM decoding

發音字典及n元語法語言模型分別存放於 /path/to/lexicon and /path/to/arpa 路徑，解碼結果將存放於路徑 /path/to/experiment/directory/decode/kenlm/test。

$ python examples/speech_recognition/new/infer.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/decode \
  --config-name infer_kenlm \
  task.data=/path/to/data \
  task.normalize=[true|false] \
  decoding.exp_dir=/path/to/experiment/directory \
  common_eval.path=/path/to/checkpoint
  dataset.gen_subset=test \
  decoding.decoder.lexicon=/path/to/lexicon \
  decoding.decoder.lmpath=/path/to/arpa

上面的程式碼使用到預設的解碼超參數可以在 examples/speech_recognition/hydra/decoder.py找到，也可以在命令提示字元找到，例如要利用beam=500搜尋，可以打命令 decoding.decoder.beam=500。重要的超參數包含：