Day15：小而強VLM-SmolVLM2

2025 iThome 鐵人賽

DAY 15

生成式 AI

VLM系列第 15 篇

17th鐵人賽

皮二仔

2025-09-29 23:23:03

147 瀏覽

分享至

近年來提到生成式AI，常常通過擴展參數數量來提昇智能，但目前已有新的趨勢，通過蒸餾等各種方法縮小更大的模型，因而降低了運算成本，簡化了部署，解鎖本地執行使用，並增強了資料隱私。
在Hugging face 的文章中，指出到當我們提到小型視覺語言模型時，通常指的是參數小於 2B 的模型，且這些模型可以在消費類 GPU 上運行。

後續幾天將介紹並測試幾個SOTA 小而強的VLM。

SmolVLM2

SmolVLM2 是由 Hugging Face 研究團隊於2025年6月提出的小型視覺語言模型VLM，是SmolVLM（第一代）的延伸與改進版，目標是在記憶體與算力有限的環境下（例如邊緣設備、手機、嵌入式系統等），模型仍能提供可運行的多模態理解能力；旨在從需要大量運算資源的海量模型轉向可以在任何地方運行的高效模型。

SmolVLM2 共推出三種規模的模型：2.2B 模型是視覺和視頻任務的首選，而 500M 和 256M 模型則代表了有史以來發佈的最小視頻語言模型。

SmolVLM2 為「MLX ready」，指的是模型已經預先調整好、可以直接在 Apple 的 MLX（Machine Learning eXchange）框架上運行，特別針對 Apple Silicon（M1/M2/M3 晶片）進行最佳化。這讓開發者可以在 macOS 或 iOS 裝置上本地運行 VLM（視覺語言模型），不依賴雲端 GPU。

動手玩SmolVLM2

參考Hugging face的實作範例，在Colab T4運行時，會出現一些錯誤，下面是修改過後已可運行的程式碼。

!pip install num2words

FlashAttention2 的主要用途就是「加速 Transformer Attention，降低顯存消耗，支援更長序列」，是大模型（LLM / VLM）能跑得更快、更省記憶體的關鍵技術之一。
SmolVLM2 的輸入可能包含圖像 patch token + 文字 token + 影片 frame token，序列長度很長，如果不用 FlashAttention2，在推理或訓練時很容易爆顯存，或速度變慢，所以 Hugging Face 官方程式碼預設 _attn_implementation="flash_attention_2"，如果環境支援就會加速。
但 T4 GPU 不支援 FlashAttention2，所以下面跑這段程式碼時要修改。

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_path = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,    
    dtype=torch.bfloat16,
    # _attn_implementation="flash_attention_2"   #T4不支援, 若在T4跑這一行要刪除
).to("cuda")

範例一：圖片描述
使用這張測試圖片

進行推論

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Can you describe this image?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

（輸出結果）
User:
Can you describe this image?
Assistant: The image depicts a close-up view of a vibrant pink flower, which appears to be a cosmos flower, with a bee on it. The bee is positioned in the center of the flower, and it seems to be collecting nectar from the flower. The flower has multiple petals, which are a bright pink color, and a yellow center. The bee is small and has a fuzzy body, with black and yellow stripes on its wings.
Surrounding the main flower are other flowers and plants

範例二：物件計數

使用這張測試圖片

url = "https://lh7-rt.googleusercontent.com/docsz/AD_4nXcbYir46zJ7NlIx-p9u2DdYlW95oUY6r3M4lQwm8sd-jgB7cTjbL9Sc_3zCiFRCMAHUUOFPQIEkuKudCBFJLwsuOBaI3gt2SKtCI60kngILMMfI4OX1hEQ4QeNeY-h0msEeyAN96w"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url},
            {"type": "text", "text": "How many coins do I have?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=100)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

（輸出結果）

User:How many coins do I have?
Assistant: 4

參考：