Day7：動手玩LLaVA

2025 iThome 鐵人賽

DAY 7

生成式 AI

VLM系列第 7 篇

17th鐵人賽

皮二仔

2025-09-21 23:57:35

157 瀏覽

分享至

以下的範例可以在Colab T4環境中運行。

載入LLaVA 處理器與模型，使用 llava-1.5-7b-hf，可以在 T4 上良好運行，模型載入花了大約5分鐘。

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
import matplotlib.pyplot as plt
import requests

# 載入 LLaVA 處理器與模型
model_id = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16
).to(0) # 將模型載入到 GPU

範例一：

載入測試圖片

# 從 URL 載入圖片
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
raw_image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
# 顯示圖片
plt.imshow(raw_image)
plt.show()

請AI描述圖片內容，指令測試輸入英文和中文，回覆的內容不太一樣，英文回答的似乎比較好
輸入用英文：

# 準備對話模板
prompt = "USER: <image>\nWhat is this?\nASSISTANT:"
# 處理輸入
inputs = processor(text=prompt, images=raw_image, return_tensors="pt").to(0)
# 生成回應
output = model.generate(**inputs, max_new_tokens=200)
# 解碼並印出結果
print(processor.decode(output[0], skip_special_tokens=True))

(輸出內容)
USER:
What is this?
ASSISTANT: The image features a vintage, light blue Volkswagen Beetle parked on a brick road. The car is positioned in front of a yellow building, and there is a doorway visible in the background. The scene appears to be set in a foreign country, giving the impression of a unique and interesting location.

輸入指令改用中文，回覆是中文，但回覆的有些錯誤。我試了其他案例，有的仍是用英文回覆

# 準備對話模板
prompt = "USER: <image>\n這是什麼?\nASSISTANT:"
# 處理輸入
inputs = processor(text=prompt, images=raw_image, return_tensors="pt").to(0)
# 生成回應
output = model.generate(**inputs, max_new_tokens=200)
# 解碼並印出結果
print(processor.decode(output[0], skip_special_tokens=True))

(輸出內容）

USER:
這是什麼?
ASSISTANT: 這是一輛藍色的舊式跑車，它的車牌是“VW”，這是德國汽車製造商Volkswagen的代表。這輛車是在一條灰色的街道上停在一個角落，這個角落有一扇大門和一扇小門。

範例二：詢問需要推理的問題

# 準備另一張圖片
image_url_2 = "http://images.cocodataset.org/val2017/000000039769.jpg" 
raw_image_2 = Image.open(requests.get(image_url_2, stream=True).raw).convert('RGB')

#顯示圖片
plt.imshow(raw_image_2)
plt.show()

這次用中文問，回覆是英文

# 詢問一個需要推理的問題
prompt_2 = "USER: <image>\n請描述這張圖片?\nASSISTANT:"
# 處理輸入
inputs_2 = processor(text=prompt_2, images=raw_image_2, return_tensors="pt").to(0)
# 生成回應
output_2 = model.generate(**inputs_2, max_new_tokens=200)
# 解碼並印出結果
print(processor.decode(output_2[0], skip_special_tokens=True))

(輸出內容)

USER:
請描述這張圖片?
ASSISTANT: The image features two cats lying on a pink couch, both of them sleeping. One cat is located on the left side of the couch, while the other cat is on the right side. The couch is covered with a pink blanket, providing a cozy and comfortable environment for the cats.

這次用中文問，並要求回覆要用繁體中文
（VLM模型架構包括LLM，所以回應使用不同的語言或許本就是LLM具備的能力吧）

# 詢問一個需要推理的問題
prompt_2 = "USER: <image>\n請用繁體中文，描述這張圖片?\nASSISTANT:"
# 處理輸入
inputs_2 = processor(text=prompt_2, images=raw_image_2, return_tensors="pt").to(0)
# 生成回應
output_2 = model.generate(**inputs_2, max_new_tokens=200)
# 解碼並印出結果
print(processor.decode(output_2[0], skip_special_tokens=True))

(輸出內容)

USER:
請用繁體中文，描述這張圖片?
ASSISTANT: 這張圖片展示了兩隻貓在一個橘色的沙發上休息。它們都在沙發上，並且有一個遙控器掛在沙發上，可能是用來操作電視。這兩隻貓看起來都很舒服，享受著溫暖的沙發。

範例三：更複雜的推理

# 載入範例圖像（包含多物體）
image_url_3 = "http://images.cocodataset.org/val2017/000000000139.jpg" 
raw_image_3 = Image.open(requests.get(image_url_3, stream=True).raw)

#顯示圖片
plt.imshow(raw_image_3)
plt.show()

# 建構複雜對話提示（LLaVA 亮點：支援多層次問題，模擬人類推理）
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "USER: 這張圖片中主要有哪些物體？請描述它們的顏色、位置關係（例如 A 在 B 旁邊），以及這個場景可能代表什麼活動？如果有食物，請指出它們的種類和狀態。"},
            {"type": "image"},
        ],
    },
    {"role": "assistant", "content": "ASSISTANT: "},
]

# 應用聊天模板
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# 處理輸入
inputs = processor(images=raw_image_3, text=prompt, return_tensors="pt").to("cuda", torch.float16)
# 生成回應（增加 max_new_tokens 以支援詳細輸出）
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=300, do_sample=False, temperature=0.1)

response = processor.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print("LLaVA 的複雜 VQA 回答：\n", response)

(輸出內容)
LLaVA 的複雜 VQA 回答：
這張圖片中主要有一個餐廳和一個廚房。餐廳中有一個長桌，四個植物（包括花）和三個植物盆，這些植物都是綠色的。廚房中有一個冰箱和一個桌子。這個場景可能代表一個家庭或餐廳的晚餐活動，人們可能在這裡坐下享用美食和聊天。

補充記錄：
之前在測試範例三因為輸出結果不太好，我把輸入對話改用了processor.apply_chat_template，回覆內容有比較好。
但後來我把指令改成英文，使用之前範例一及二簡單的格式，回覆就不錯，所以，指令輸入可能仍是用英文比較話，我想可能是模型訓練都是用英文吧？
另外，因為要推理，所以max_new_tokens要設定大一點，不然文字輸出一半就直接結束了。

# 詢問一個需要推理的問題
prompt_3 = "USER: <image>\nWhat are the main objects in this picture? Describe their colors, their relationships (e.g., A is next to B), and what activities this scene might represent? If there are food items, indicate their type and condition?\nASSISTANT:"

# 處理輸入
inputs_3 = processor(text=prompt_3, images=raw_image_3, return_tensors="pt").to(0)
# 生成回應
output_3 = model.generate(**inputs_3, max_new_tokens=300)
# 解碼並印出結果
print(processor.decode(output_3[0], skip_special_tokens=True))

(輸出內容)
USER:
What are the main objects in this picture? Describe their colors, their relationships (e.g., A is next to B), and what activities this scene might represent? If there are food items, indicate their type and condition?
ASSISTANT: The main objects in this picture are a dining table, chairs, a refrigerator, a television, and a potted plant. The dining table is surrounded by chairs, and there is a refrigerator nearby. The television is placed on the left side of the room. The potted plant is located in the middle of the room, adding a touch of greenery to the space.

The colors in the scene are predominantly yellow, which gives the room a warm and inviting atmosphere. The chairs are red, adding a pop of color to the room. The dining table is white, which contrasts with the red chairs and the yellow walls.

This scene might represent a family gathering or a casual meal shared among friends. The presence of the refrigerator and the potted plant suggests that the room is a part of a home, and the dining table and chairs indicate that it is a space for eating and socializing. The television could be used for entertainment during meals or while waiting for others to join the gathering.