SeamlessM4T使用的數據集為Seamless_align,包含用來訓練模型數據集的metadata,其格式與 NLLB(No Language Left Behind) 格式類似,而且是以製表符(Tab seperated)分隔的 gzip 文件,每個文件對應同一個對齊順序方向。
- 文本用三個字母代表,如`fra`, `eng`, `tur`
- 語音則用兩個字母加上大寫A,如`frA`, `enA`, `trA`
例如,eng-trA表示為英語文本轉譯為土耳其語語音。
- `cc_warc`: The warc file reference containing the public audio url
- `cc_sha`: not used
- `audio_speeh_segment_url`: space separated audio reference. See below.
- `cc_lineno`: not used
- `paragraph_digest`: not used
- `sentence_digest`: not used
- `text_lid_score`: not used
- `laser_score`: score of the alignment
- `direction`: direction, e.g. `enA-jpn`
- `side`: side, e.g. `enA` or `jpn`
- `line_no`: alignment number
audio_speeh_segment_url
的格式為<url> <start_frame> <end_frame>
,其中 start_frame
及 end_frame
為<url>
路徑下語音數據擷取的起始-結束取樣片段,並以 16000 Hz 重新採樣。
cc_warc
: the reference to the Common Crawl WET filecc_sha
: the document sha1 in the WET filecc_document_url
: the url of the document referenced in the WET filecc_lineno
: the line number in the document referenced in the WET fileparagraph_digest
: xxhash.xxh3_64_intdigest of the paragraphsentence_digest
: xxhash.xxh3_64_intdigest of the sentencetext_lid_score
: language identification score, when availablelaser_score
: score of the alignmentdirection
: direction, e.g. enA-jpn
side
: side, e.g. enA
or jpn
line_no
: alignment number如果文本數據來自於其他語料庫,則引數如下:
corpus
: corpus namecc_sha
: not usedcc_document_url
: not usedlineno
: line number in the documentparagraph_digest
: xxhash.xxh3_64_intdigest of the paragraphsentence_digest
: xxhash.xxh3_64_intdigest of the sentencetext_lid_score
: language identification score, when availablelaser_score
: score of the alignmentdirection
: direction, e.g. enA-jpn
side
: side, e.g. enA
or jpn
line_no
: alignment numberSeamlessM4T的訓練用數據集來自於Seamless_align,與NLLB格式差不多,內含超過270,000小時的語音數據,其中語音數據的欄位audio_speeh_segment_url
可以選擇起始-結束片段來擷取需要用的語音數據。若往後要訓練自己的模型可以下載參考使用。