Day14 ：動手玩Florence-2

2025 iThome 鐵人賽

DAY 14

生成式 AI

VLM系列第 14 篇

17th鐵人賽

皮二仔

2025-09-28 23:53:17

247 瀏覽

分享至

Florence-2 在 Hugging Face 上有 “microsoft/Florence-2-base”，“microsoft/Florence-2-large” 預訓練模型，在Colab T4均可運行。

但目前在載入模型，會遇到錯誤 AttributeError: 'Florence2ForConditionalGeneration' object has no attribute '_supports_sdpa' , 是由於 Hugging Face Transformers 庫的版本不相容引起的。Florence-2 模型的注意力機制（attention implementation）在較新版本的 Transformers（例如 4.52.1 以上，尤其是 4.54.0+）中引入了 SDPA（Scaled Dot-Product Attention）的支援檢查，但模型尚未實現 _supports_sdpa 屬性。這導致在模型初始化時，Transformers 嘗試檢查該屬性但失敗。
這個問題在 Google Colab（T4 GPU）環境中存在，因為 Colab 的預設 Transformers 版本可能已更新到不兼容的版本。

測試結果，降級 Transformers 庫到兼容版本 4.51.3，它已確認與 Florence-2 穩定工作。

!pip show transformers

!pip uninstall -y transformers
!pip install transformers==4.51.3

載入模型

import torch
from transformers import AutoProcessor, AutoModelForCausalLM 

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)

將推理過程寫成函數方便後續執行

# 定義推理函數
def run_example(task_prompt, text_input=None):
    if text_input:
        prompt = task_prompt + text_input
    else:
        prompt = task_prompt

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=4096,
        num_beams=3,
        do_sample=False
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

載入測試圖片

from PIL import Image
import requests
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# 顯示圖像
plt.imshow(image)
plt.axis('off')
plt.show()

任務指示使用統一的提示詞格示

圖文描述

caption = run_example("<CAPTION>")
print("簡潔描述:", caption["<CAPTION>"])

detailed_caption = run_example("<DETAILED_CAPTION>")
print("詳細描述:", detailed_caption["<DETAILED_CAPTION>"])

more_detailed = run_example("<MORE_DETAILED_CAPTION>")
print("更詳細描述:", more_detailed["<MORE_DETAILED_CAPTION>"])

（輸出結果）

簡潔描述: a green volkswagen beetle parked in front of a yellow building
詳細描述: The image shows a green Volkswagen Beetle parked in front of a yellow building with two brown doors, surrounded by trees and a clear blue sky.
更詳細描述: The image shows a vintage Volkswagen Beetle car parked on a cobblestone street in front of a yellow building with two wooden doors. The car is painted in a bright turquoise color and has a white stripe running along the side. The doors are made of wood and have a rustic, weathered look. The building behind the car has a small window and a door handle. The sky is blue and there are trees in the background. The overall atmosphere of the image is peaceful and serene.

物件辨識

od_result = run_example("<OD>")
print("偵測結果:", od_result["<OD>"])

（輸出結果）
偵測結果: {'bboxes': [[33.599998474121094, 160.55999755859375, 596.7999877929688, 371.7599792480469], [271.67999267578125, 242.1599884033203, 302.3999938964844, 246.95999145507812]], 'labels': ['car', 'door handle']}

視覺化邊界框

bboxes = od_result["<OD>"]["bboxes"]
labels = od_result["<OD>"]["labels"]

fig, ax = plt.subplots(1, 1, figsize=(8, 6))
ax.imshow(image)
for i, bbox in enumerate(bboxes):
    rect = Rectangle((bbox[0], bbox[1]), bbox[2], bbox[3], linewidth=2, edgecolor='r', facecolor='none')
    ax.add_patch(rect)
    ax.text(bbox[0], bbox[1]-10, labels[i], color='r', fontsize=10)
ax.axis('off')
plt.show()