在這個數位資訊爆發的時代,誰能掌握資訊,就如同在淘金潮中掌握金礦的勝利者一般,在這數位巨山中挖出有價值的資料。
在目前的聲音應用浪潮中,有能力將將聲音數據轉換成文字,這將大大地將資料分析和應用的範圍大大的拓展。在這篇文章中,我們將探索如何利用 Google Colab 和相關的學習模型來建立一個雲端的語音轉文字 (Speech-to-Text, STT) 系統。
Speech-to-Text (STT) 是一種自動將語音轉換為文字的技術。這項技術在許多領域都有專門的應用場景,例如轉錄字幕、語音識別和虛擬助理......等等。
Google Colab 是一個免費的雲端 Jupyter Notebook 環境,它提供了免費的 GPU 和 TPU 資源,讓你可以在雲端運行你的機器學習和資料分析專案,提供了一個很好的 UI
在開始之前,要先來認識一下選用的技術細節,在這邊選用的是 openai 自家出產的 whisper
。
Whisper 是個由 OpenAI 開發的自動語音辨識系統 (ASR, Automatic Speech Recognition)。OpenAI 透過收集網路上 68 萬小時的多語言(98 種語言)和多任務監督資料,對 Whisper 進行訓練。Whisper 的訓練資料包含了來自世界各地的各種口音、背景雜音和技術術語。OpenAI 認為,如此龐大而多元的資料集可以提高 Whisper 的辨識能力。
除了語音辨識,Whisper 還能進行多種語言的轉錄,以及將這些語言翻譯成英文。OpenAI 將 Whisper 的模型和推理程式碼開放給開發者,希望 Whisper 能作為建立有用的應用程式和進一步研究語音處理技術的基礎。
openai whisper 的 overview
以下是幾個主要的改良版本的比較,來自於 Whisper JAX
OpenAI | Transformers | Whisper JAX | Whisper JAX | |
---|---|---|---|---|
Framework | PyTorch | PyTorch | JAX | JAX |
Backend | GPU | GPU | GPU | TPU |
1 min | 13.8 | 4.54 | 1.72 | 0.45 |
10 min | 108.3 | 20.2 | 9.38 | 2.01 |
1 hour | 1001.0 | 126.1 | 75.3 | 13.8 |
實際上從 openai 發表 Whisper 起近一年,社群各方都有人想辦法提升轉換精準度跟速度。像是 faster-whisper 與 whisper.cpp ,甚至是可以跑在瀏覽器中的 whisper.wasm。
在建構完整的流程之前,我們先來測試執行的流程是否可以符合需求的預期。
先連結 google drive 到 notebook 使用的 VM disk
上
from google.colab import drive
drive.mount('/content/drive')
安裝好 whisper
之後可以直接讀取檔案轉換
!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg
!whisper "output.wav" --model large --language en
輸出結果:
[00:00.000 --> 00:24.800] At the end of the Cold War, the United States made a policy decision that may be one of
[00:24.800 --> 00:27.880] the biggest mistakes of the 20th century.
[00:27.880 --> 00:32.600] It's contributed to chaos and uncertainty in this current day.
[00:32.600 --> 00:37.280] And it's not based on politics, it's based on games.
[00:37.280 --> 00:40.640] In game theory, there are two types of games.
[00:40.640 --> 00:44.600] There are finite games and there are infinite games.
[00:44.600 --> 00:51.320] A finite game is defined as known players, fixed rules, and agreed upon objective.
[00:51.320 --> 00:53.600] Baseball, right?
[00:53.600 --> 00:57.280] An infinite game is defined as known and unknown players.
[00:57.280 --> 01:02.800] The rules are changeable and the objective is to perpetuate the game.
[01:02.800 --> 01:07.160] When you pit a finite player versus a finite player, the system is stable.
[01:07.160 --> 01:08.360] Baseball is stable.
[01:08.360 --> 01:10.800] So is conventional war for that matter.
[01:10.800 --> 01:15.800] When you pit an infinite player versus an infinite player, the system is also stable.
[01:15.800 --> 01:17.520] The Cold War was stable.
看起來轉錄出來的效果,跟我們預期的準確度很接近了。
接下來可以試用 faster-whisper
看看效果
!pip install "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/master.tar.gz"
# Convert model for faster-whisper
!ct2-transformers-converter --model openai/whisper-large-v2 --output_dir whisper-large-v2-ct2 --copy_files tokenizer.json --quantization float16
!ct2-transformers-converter --model openai/whisper-medium --output_dir whisper-medium-ct2 --copy_files tokenizer.json --quantization float16
!ct2-transformers-converter --model openai/whisper-small --output_dir whisper-small-ct2 --copy_files tokenizer.json --quantization float16
from faster_whisper import WhisperModel
import csv
model_path = "whisper-large-v2-ct2/"
# Run on GPU with FP16
model = WhisperModel(model_path, device="cuda", compute_type="float16")
# or run on GPU with INT8
# model = WhisperModel(model_path, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_path, device="cpu", compute_type="int8")
segments, info = model.transcribe(audio="output.wav", beam_size=5, language="en")
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
captions = []
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
line = {'start':segment.start, 'end':segment.end, 'text': segment.text}
captions.append(line)
with open("captions.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["start", "end", "text"])
for caption in captions:
writer.writerow(['%.2f' % caption["start"], '%.2f' % caption["end"], caption["text"]])
[0.00s -> 16.60s] Thanks very much.
[16.60s -> 24.82s] At the end of the Cold War, the United States made a policy decision that may be one of
[24.82s -> 27.90s] the biggest mistakes of the 20th century.
[27.90s -> 32.58s] It's contributed to chaos and uncertainty in this current day.
[32.58s -> 37.26s] And it's not based on politics, it's based on games.
[37.26s -> 40.66s] In game theory, there are two types of games.
[40.66s -> 44.58s] There are finite games and there are infinite games.
[44.58s -> 51.34s] A finite game is defined as known players, fixed rules, and agreed upon objective.
[51.34s -> 53.58s] Baseball, right?
[53.58s -> 57.26s] An infinite game is defined as known and unknown players.
[57.26s -> 62.78s] The rules are changeable and the objective is to perpetuate the game.
[62.78s -> 67.14s] When you pit a finite player versus a finite player, the system is stable.
[67.14s -> 68.34s] Baseball is stable.
[68.34s -> 70.78s] So is conventional war, for that matter.
[70.78s -> 75.78s] When you pit an infinite player versus an infinite player, the system is also stable.
[75.78s -> 77.50s] The Cold War was stable.
明天我們就可以來到把整個流程串接,確認實際可行的流程與程式碼。
Carwler
Colab