全端 LLM 應用開發-Day08-Hugging Face 入門與 poetry

15th鐵人賽

大魔術熊貓工程師

2023-09-23 02:41:55

2510 瀏覽

分享至

全端 LLM 應用開發-Day08-Hugging Face 入門與 poetry

環境安裝

開來用 poetry new 一個 project。

poetry new huggingface_intro
cd huggingface_intro

安裝 transformers。使用指令 poetry add transformers

就可以看到如下圖安裝了 poetry 的依賴了。

下載到本地端的模型

這裡有一個坑，是如果你沒有安裝 pytorch 的話，也是跑不起來了。為什麼 poetry 沒有一起裝起 pytorch 呢？因為 transformers 套件本身不依賴 pytorch，但是很多東西例如說 AI 模型、往往需要 pytorch 才能跑的。所以我們再來安裝 pytoch 吧。使用下面指令：
poetry add torch torchvision
接著我們就開始寫程式吧！先建立一個 intro.py，然後複製貼上下面的程式碼：

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch


def get_sentiments(model_name, string_arr):
    # Initialize tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Tokenize input strings
    inputs = tokenizer(string_arr, padding=True,
                       truncation=True, return_tensors="pt")

    # Initialize model
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    # Make predictions
    outputs = model(**inputs)

    # Softmax to convert logits to probabilities
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

    return predictions


if __name__ == "__main__":
    model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"

    string_arr = [
        "我會披星戴月的想你，我會奮不顧身的前進，遠方煙火越來越唏噓，凝視前方身後的距離",
        "鯊魚寶寶 doo doo doo doo doo doo, 鯊魚寶寶"
    ]

    predictions = get_sentiments(model_name, string_arr)
    print(predictions)

使用 poetry 讓這個程式碼跑起來：poetry run python huggingface_intro/intro.py

接著我們來說明這段程式碼。

AutoTokenizer 用於自動選擇和載入適當的分詞器（Tokenizer）而 AutoModelForSequenceClassification 用於自動選擇和載入適當的用於序列分類的模型。注意這兩個都是自動的，我們只要給一個模型名字，也就是 model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"，Hugging Face 會自動幫我們找到這個模型，還有其對應的分詞器。

這裡的流程是這樣子的：
1. 先用 AutoTokenizer 來把句字做分詞，再把分詞後的句子交給 AutoModelForSequenceClassification 來做情感分類。
2. 將模型的輸出（logits）通過 Softmax 函數轉換為機率，並印出這些概率以觀察模型的預測。
3. 這樣子我們就知道輸入的那兩句話，情緒是「正面、中性、負面」了。

是不是非常簡單呢？但是大家可能會看到 logits 和 Softmax 會覺得怕吧？雖然這些是 Machine Learning 的基礎，但是在現在這個世界，你們不懂的話，也可以暫時先放著，之後再來補學習就好了。

明天我們就來用 Hugging Face 另一個更簡單的方法，並且整合成 web api 吧！

關於今天使用的模型

關於今天使用的模型資訊是一個情感分類的模型，可以把句子分類成「正面、中性、負面」。支援高達12種不同語言，包括但不限於英語、馬來語和日語。該模型是使用 Apache-2.0 許可證釋出的，基於 Transformers 和 PyTorch 框架建立。

並且使用「知識蒸餾」（Knowledge Distillation）的技術。簡單來說，知識蒸餾是用一個更大、更複雜的「教師模型」將其知識轉移給一個較小、運行更快的「學生模型」，可以提高 inference 的速度，但是有可能準確度會下降。在這個案例中，教師模型是 MoritzLaurer/mDeBERTa-v3-base-mnli-xnli，而學生模型則是 distilbert-base-multilingual-cased。

更多資訊可以去 Hugging Face 官網參考。

https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student