#17 用 Colab 打造你的雲端機器學習運算平台 (2/2)

2023 iThome 鐵人賽

DAY 18

SideProject30

Laravel 擴展宇宙：從 1 到 100 十倍速打造產品獨角獸系列第 18 篇

15th鐵人賽 colab openai

Bill

團隊所以隊名要叫什麼

2023-10-03 11:35:25

617 瀏覽

分享至

cover

昨天我們實驗完了使用 colab 設定 openai 的 whisper，今天就要開始設定一整套的流程。

Flow

我們的流程將分為兩個部分：

爬蟲部分：從 ReadCast API 取得尚未有字幕的 Podcast 音檔清單，並將音檔下載到 Google Drive 中。
Colab 部分：使用 Whisper 將音檔轉為文字，並將字幕上傳回 API。

取得素材

首先我們需要先把 poadcast 音檔下載下來，是為了可以重新執行或是失敗重試，這樣就可以避免重複下載音檔浪費流量之餘，還可以加速流程進行。

目前我們先開了一個 API 可以列出目前沒有字幕的 episodes: https://readcast.app/api/pending-list。
然後得到了一份清單，有 id 跟 url。

Example:

{
    "data": [
        {
            "id": "xxxxxxx",
            "url": "https://aaa.bbb.com/nnn/m.mp3"
        }
        // ...
    ]
}

所以我們使用 Bash 腳本來執行。腳本首先會建立一個 Google Drive 資料夾來存放音檔，然後會使用 curl 從 API 取得音檔清單。清單中包含每個音檔的 ID 和 URL。腳本會使用 for 迴圈來循環遍歷清單，並將將音訊檔案保存到 Google Drive 中以便稍後處理。

#!/bin/bash

# set folder for audio files
output_dir="/content/drive/podcast_audio"
if [[ ! -d "$output_dir" ]]; then
    mkdir -p "$output_dir"
fi

# get list
response=$(curl -s https://readcast.app/api/pending-list)
data=$(echo "$response" | jq -r '.data[]')

# download list
for item in $data; do
    id=$item.id
    url=$item.url
    output_path="$output_dir/$id.mp3"

    # download file
    echo "Downloading file from: $url"
    curl -o "$output_path" "$url"
done

echo "Download finished"

Colab

Colab 部分使用 Python 程式碼來執行。首先連結 google drive 到 notebook 使用的 VM disk 上

from google.colab import drive
drive.mount('/content/drive')

然後，程式碼會使用 pip 命令安裝 Whisper 和 ffmpeg 套件。

!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg

接下來，程式碼會使用 for 迴圈來循環遍歷 Google Drive 中的音檔。對於每個音檔，程式碼會使用 Whisper 將音檔轉為文字。最後，程式碼會使用 curl 命令將字幕上傳回 API。

#!/bin/bash

# set workspace
output_dir="/content/drive/podcast_audio"

# get files
files=$(ls -1 ${output_dir})

# loop for all files
for file in $files; do
  id=$(echo $file | cut -d. -f1)
  output_path="${output_dir}/${file}"

  # transcribe audio
  whisper ${output_path} --model large --language en)
  srt_path="${output_dir}/${id}.srt"
  transcription=$(echo $srt_path)

  # upload srt
  url="https://readcast.app/api/epsoides/${id}/caption"
  headers="Content-Type: text/plain"
  response=$(curl -X POST -H "${headers}" -d "${transcription}" "${url}")

  # show current progess
  sleep 0.5
  echo "${id}: ${response}"
done