#16 用 Colab 打造你的雲端機器學習運算平台 (1/2)

2023 iThome 鐵人賽

DAY 17

SideProject30

Laravel 擴展宇宙：從 1 到 100 十倍速打造產品獨角獸系列第 17 篇

15th鐵人賽 colab openai

Bill

團隊所以隊名要叫什麼

2023-10-02 09:53:02

1323 瀏覽

分享至

cover

在這個數位資訊爆發的時代，誰能掌握資訊，就如同在淘金潮中掌握金礦的勝利者一般，在這數位巨山中挖出有價值的資料。

在目前的聲音應用浪潮中，有能力將將聲音數據轉換成文字，這將大大地將資料分析和應用的範圍大大的拓展。在這篇文章中，我們將探索如何利用 Google Colab 和相關的學習模型來建立一個雲端的語音轉文字 (Speech-to-Text, STT) 系統。

什麼是 Speech-to-Text (STT)？

Speech-to-Text (STT) 是一種自動將語音轉換為文字的技術。這項技術在許多領域都有專門的應用場景，例如轉錄字幕、語音識別和虛擬助理......等等。
stt workflow

Colab 是啥

Google Colab 是一個免費的雲端 Jupyter Notebook 環境，它提供了免費的 GPU 和 TPU 資源，讓你可以在雲端運行你的機器學習和資料分析專案，提供了一個很好的 UI

Beginning

在開始之前，要先來認識一下選用的技術細節，在這邊選用的是 openai 自家出產的 whisper。

Whisper 是個由 OpenAI 開發的自動語音辨識系統 (ASR, Automatic Speech Recognition)。OpenAI 透過收集網路上 68 萬小時的多語言（98 種語言）和多任務監督資料，對 Whisper 進行訓練。Whisper 的訓練資料包含了來自世界各地的各種口音、背景雜音和技術術語。OpenAI 認為，如此龐大而多元的資料集可以提高 Whisper 的辨識能力。

除了語音辨識，Whisper 還能進行多種語言的轉錄，以及將這些語言翻譯成英文。OpenAI 將 Whisper 的模型和推理程式碼開放給開發者，希望 Whisper 能作為建立有用的應用程式和進一步研究語音處理技術的基礎。

flow from openai whisper
openai whisper 的 overview

Benchmarks

以下是幾個主要的改良版本的比較，來自於 Whisper JAX

	OpenAI	Transformers	Whisper JAX	Whisper JAX

Framework	PyTorch	PyTorch	JAX	JAX
Backend	GPU	GPU	GPU	TPU

1 min	13.8	4.54	1.72	0.45
10 min	108.3	20.2	9.38	2.01
1 hour	1001.0	126.1	75.3	13.8

實際上從 openai 發表 Whisper 起近一年，社群各方都有人想辦法提升轉換精準度跟速度。像是 faster-whisper 與 whisper.cpp ，甚至是可以跑在瀏覽器中的 whisper.wasm。

開始實作

在建構完整的流程之前，我們先來測試執行的流程是否可以符合需求的預期。

先連結 google drive 到 notebook 使用的 VM disk 上

from google.colab import drive
drive.mount('/content/drive')

安裝好 whisper 之後可以直接讀取檔案轉換

!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg
!whisper "output.wav" --model large --language en

輸出結果：

[00:00.000 --> 00:24.800]  At the end of the Cold War, the United States made a policy decision that may be one of
[00:24.800 --> 00:27.880]  the biggest mistakes of the 20th century.
[00:27.880 --> 00:32.600]  It's contributed to chaos and uncertainty in this current day.
[00:32.600 --> 00:37.280]  And it's not based on politics, it's based on games.
[00:37.280 --> 00:40.640]  In game theory, there are two types of games.
[00:40.640 --> 00:44.600]  There are finite games and there are infinite games.
[00:44.600 --> 00:51.320]  A finite game is defined as known players, fixed rules, and agreed upon objective.
[00:51.320 --> 00:53.600]  Baseball, right?
[00:53.600 --> 00:57.280]  An infinite game is defined as known and unknown players.
[00:57.280 --> 01:02.800]  The rules are changeable and the objective is to perpetuate the game.
[01:02.800 --> 01:07.160]  When you pit a finite player versus a finite player, the system is stable.
[01:07.160 --> 01:08.360]  Baseball is stable.
[01:08.360 --> 01:10.800]  So is conventional war for that matter.
[01:10.800 --> 01:15.800]  When you pit an infinite player versus an infinite player, the system is also stable.
[01:15.800 --> 01:17.520]  The Cold War was stable.

看起來轉錄出來的效果，跟我們預期的準確度很接近了。

接下來可以試用 faster-whisper 看看效果

!pip install "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/master.tar.gz"

# Convert model for faster-whisper
!ct2-transformers-converter --model openai/whisper-large-v2 --output_dir whisper-large-v2-ct2 --copy_files tokenizer.json --quantization float16
!ct2-transformers-converter --model openai/whisper-medium --output_dir whisper-medium-ct2 --copy_files tokenizer.json --quantization float16
!ct2-transformers-converter --model openai/whisper-small --output_dir whisper-small-ct2 --copy_files tokenizer.json --quantization float16

from faster_whisper import WhisperModel
import csv

model_path = "whisper-large-v2-ct2/"

# Run on GPU with FP16
model = WhisperModel(model_path, device="cuda", compute_type="float16")
# or run on GPU with INT8
# model = WhisperModel(model_path, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_path, device="cpu", compute_type="int8")

segments, info = model.transcribe(audio="output.wav", beam_size=5,  language="en")
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

captions = []
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
    line = {'start':segment.start, 'end':segment.end, 'text': segment.text}
    captions.append(line)

with open("captions.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["start", "end", "text"])
    for caption in captions:
        writer.writerow(['%.2f' % caption["start"], '%.2f' % caption["end"], caption["text"]])

[0.00s -> 16.60s]  Thanks very much.
[16.60s -> 24.82s]  At the end of the Cold War, the United States made a policy decision that may be one of
[24.82s -> 27.90s]  the biggest mistakes of the 20th century.
[27.90s -> 32.58s]  It's contributed to chaos and uncertainty in this current day.
[32.58s -> 37.26s]  And it's not based on politics, it's based on games.
[37.26s -> 40.66s]  In game theory, there are two types of games.
[40.66s -> 44.58s]  There are finite games and there are infinite games.
[44.58s -> 51.34s]  A finite game is defined as known players, fixed rules, and agreed upon objective.
[51.34s -> 53.58s]  Baseball, right?
[53.58s -> 57.26s]  An infinite game is defined as known and unknown players.
[57.26s -> 62.78s]  The rules are changeable and the objective is to perpetuate the game.
[62.78s -> 67.14s]  When you pit a finite player versus a finite player, the system is stable.
[67.14s -> 68.34s]  Baseball is stable.
[68.34s -> 70.78s]  So is conventional war, for that matter.
[70.78s -> 75.78s]  When you pit an infinite player versus an infinite player, the system is also stable.
[75.78s -> 77.50s]  The Cold War was stable.

明天我們就可以來到把整個流程串接，確認實際可行的流程與程式碼。

預計的流程

Carwler

Get podcast episodes pending list and download audio files.
Save the files to google drive

Colab

Connect google drive to colab as a folder
Install and setup openai whisper
Loop the list and get audio files to transcribe
Save the captions to google drive

Referrences

Ggerganov. (n.d.). GitHub - ggerganov/whisper.cpp: Port of OpenAI’s Whisper model in C/C++. GitHub. https://github.com/ggerganov/whisper.cpp
Openai. (n.d.). GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision. GitHub. https://github.com/openai/whisper
Softcatala. (n.d.). GitHub - Softcatala/whisper-ctranslate2: Whisper command line client compatible with original OpenAI client based on CTranslate2. GitHub. https://github.com/Softcatala/whisper-ctranslate2
Whisper – open source speech recognition by OpenAI | Hacker News. (n.d.). https://news.ycombinator.com/item?id=32927360