iT邦幫忙

2025 iThome 鐵人賽

DAY 14
0
生成式 AI

VLM系列 第 14

Day14 :動手玩Florence-2

  • 分享至 

  • xImage
  •  

Florence-2 在 Hugging Face 上有 “microsoft/Florence-2-base”,“microsoft/Florence-2-large” 預訓練模型,在Colab T4均可運行。

但目前在載入模型,會遇到錯誤 AttributeError: 'Florence2ForConditionalGeneration' object has no attribute '_supports_sdpa' , 是由於 Hugging Face Transformers 庫的版本不相容引起的。Florence-2 模型的注意力機制(attention implementation)在較新版本的 Transformers(例如 4.52.1 以上,尤其是 4.54.0+)中引入了 SDPA(Scaled Dot-Product Attention)的支援檢查,但模型尚未實現 _supports_sdpa 屬性。這導致在模型初始化時,Transformers 嘗試檢查該屬性但失敗。
這個問題在 Google Colab(T4 GPU)環境中存在,因為 Colab 的預設 Transformers 版本可能已更新到不兼容的版本。

測試結果,降級 Transformers 庫到兼容版本 4.51.3,它已確認與 Florence-2 穩定工作。

!pip show transformers
!pip uninstall -y transformers
!pip install transformers==4.51.3
  • 載入模型
import torch
from transformers import AutoProcessor, AutoModelForCausalLM 

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)
  • 將推理過程寫成函數方便後續執行
# 定義推理函數
def run_example(task_prompt, text_input=None):
    if text_input:
        prompt = task_prompt + text_input
    else:
        prompt = task_prompt

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=4096,
        num_beams=3,
        do_sample=False
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer
  • 載入測試圖片
from PIL import Image
import requests
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# 顯示圖像
plt.imshow(image)
plt.axis('off')
plt.show()

任務指示使用統一的提示詞格示

  • 圖文描述
caption = run_example("<CAPTION>")
print("簡潔描述:", caption["<CAPTION>"])

detailed_caption = run_example("<DETAILED_CAPTION>")
print("詳細描述:", detailed_caption["<DETAILED_CAPTION>"])

more_detailed = run_example("<MORE_DETAILED_CAPTION>")
print("更詳細描述:", more_detailed["<MORE_DETAILED_CAPTION>"])

(輸出結果)

簡潔描述: a green volkswagen beetle parked in front of a yellow building
詳細描述: The image shows a green Volkswagen Beetle parked in front of a yellow building with two brown doors, surrounded by trees and a clear blue sky.
更詳細描述: The image shows a vintage Volkswagen Beetle car parked on a cobblestone street in front of a yellow building with two wooden doors. The car is painted in a bright turquoise color and has a white stripe running along the side. The doors are made of wood and have a rustic, weathered look. The building behind the car has a small window and a door handle. The sky is blue and there are trees in the background. The overall atmosphere of the image is peaceful and serene.

  • 物件辨識
od_result = run_example("<OD>")
print("偵測結果:", od_result["<OD>"])

(輸出結果)
偵測結果: {'bboxes': [[33.599998474121094, 160.55999755859375, 596.7999877929688, 371.7599792480469], [271.67999267578125, 242.1599884033203, 302.3999938964844, 246.95999145507812]], 'labels': ['car', 'door handle']}

視覺化邊界框

bboxes = od_result["<OD>"]["bboxes"]
labels = od_result["<OD>"]["labels"]

fig, ax = plt.subplots(1, 1, figsize=(8, 6))
ax.imshow(image)
for i, bbox in enumerate(bboxes):
    rect = Rectangle((bbox[0], bbox[1]), bbox[2], bbox[3], linewidth=2, edgecolor='r', facecolor='none')
    ax.add_patch(rect)
    ax.text(bbox[0], bbox[1]-10, labels[i], color='r', fontsize=10)
ax.axis('off')
plt.show()

https://ithelp.ithome.com.tw/upload/images/20250928/20178920ijJIGdiKh1.png

物件辨識結果,邊界框的準確度目前似乎進步空間還很大。


上一篇
Day13:VLM的最新趨勢與進展
系列文
VLM14
圖片
  熱門推薦
圖片
{{ item.channelVendor }} | {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言