DAY04 - SeamlessM4T使用的數據集Seamless_align - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 4

0

AI & Data

利用SeamlessM4T學習語音辨識架構及應用系列第 4 篇

DAY04 - SeamlessM4T使用的數據集Seamless_align

15th鐵人賽 seamlessm4t

2023-09-19 16:12:00

769 瀏覽

分享至

SeamlessM4T使用的數據集為Seamless_align，包含用來訓練模型數據集的metadata，其格式與 NLLB(No Language Left Behind) 格式類似，而且是以製表符(Tab seperated)分隔的 gzip 文件，每個文件對應同一個對齊順序方向。

Seamless_align數據集

數據集檔案命名方式：

- 文本用三個字母代表，如`fra`, `eng`, `tur`
- 語音則用兩個字母加上大寫A，如`frA`, `enA`, `trA`
例如，eng-trA表示為英語文本轉譯為土耳其語語音。

語音數據內含11個欄位：

- `cc_warc`: The warc file reference containing the public audio url
- `cc_sha`: not used
- `audio_speeh_segment_url`: space separated audio reference. See below.
- `cc_lineno`: not used
- `paragraph_digest`: not used
- `sentence_digest`: not used
- `text_lid_score`: not used
- `laser_score`: score of the alignment
- `direction`: direction, e.g. `enA-jpn`
- `side`: side, e.g. `enA` or `jpn`
- `line_no`: alignment number

audio_speeh_segment_url 的格式為<url> <start_frame> <end_frame>，其中 start_frame 及 end_frame 為<url>路徑下語音數據擷取的起始-結束取樣片段，並以 16000 Hz 重新採樣。

而文本數據格式與NLLB相似，如果metadata來自於爬蟲，引數如下：

cc_warc: the reference to the Common Crawl WET file
cc_sha: the document sha1 in the WET file
cc_document_url: the url of the document referenced in the WET file
cc_lineno: the line number in the document referenced in the WET file
paragraph_digest: xxhash.xxh3_64_intdigest of the paragraph
sentence_digest: xxhash.xxh3_64_intdigest of the sentence
text_lid_score: language identification score, when available
laser_score: score of the alignment
direction: direction, e.g. enA-jpn
side: side, e.g. enA or jpn
line_no: alignment number

如果文本數據來自於其他語料庫，則引數如下：

corpus: corpus name
cc_sha: not used
cc_document_url: not used
lineno: line number in the document
paragraph_digest: xxhash.xxh3_64_intdigest of the paragraph
sentence_digest: xxhash.xxh3_64_intdigest of the sentence
text_lid_score: language identification score, when available
laser_score: score of the alignment
direction: direction, e.g. enA-jpn
side: side, e.g. enA or jpn
line_no: alignment number

總結

SeamlessM4T的訓練用數據集來自於Seamless_align，與NLLB格式差不多，內含超過270,000小時的語音數據，其中語音數據的欄位audio_speeh_segment_url 可以選擇起始-結束片段來擷取需要用的語音數據。若往後要訓練自己的模型可以下載參考使用。

DAY03 - SeamlessM4T 官方所用的效能評估標準

DAY05 - SeamlessM4T所引用的程式庫

系列文

利用SeamlessM4T學習語音辨識架構及應用共 30 篇

目錄

RSS系列文訂閱系列文

3 人訂閱

完整目錄

熱門推薦

{{ item.subject }}

{{ item.channelVendor }} | {{ item.webinarstarted }} |

{{ formatDate(item.duration) }}

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19830 篇

完賽人數

528 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙