我的 AI 學習之路：第15天 Gemma 與 Gemini - Gemma3 圖片 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 15

生成式 AI

我的 AI 學習之路：30天 Gemma 與 Gemini系列第 15 篇

我的 AI 學習之路：第15天 Gemma 與 Gemini - Gemma3 圖片

17th鐵人賽

kevin_chiu

團隊AI 航海王

2025-09-16 22:42:36

421 瀏覽

分享至

Gemma 3 - 多模態

Gemma 3 在圖片處理方面的能力是其多模態特性的核心。它能夠理解圖片的內容、物件、情境，並基於這些視覺資訊與文字指令進行互動。這使得 Gemma 3 在許多視覺相關的任務中表現出色。

Gemma 3 圖片處理的核心能力
圖片理解與描述 (Image Captioning)：

模型可以接收一張圖片，並生成詳細的文字描述，說明圖片中包含了哪些物件、人物、地點，以及他們之間的關係或正在發生的動作。

範例：你提供一張公園裡有人在遛狗的圖片，Gemma 3 可能會描述為：「一位女士牽著一隻狗在公園散步，背景有綠樹和草地。

Gemma 3 圖片處理

範例code

# Install Transformers
!pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

# Import libraries and dependencies
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import cv2
from IPython.display import Markdown, HTML
from base64 import b64encode
import requests
import torch

# Choose the Gemma 3 model variant.
from google.colab import userdata
import os

model_name = 'gemma-3-4b-it' # @param ['gemma-3-1b-it', 'gemma-3-4b-it', 'gemma-3-12b-it', 'gemma-3-27b-it']
model_id = f"google/{model_name}"

model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.bfloat16, token=hf_token
).eval()

processor = AutoProcessor.from_pretrained(model_id, token=hf_token)

# Define helper functions
def resize_image(image_path):
    img = Image.open(image_path)

    target_width, target_height = 640, 640
    # Calculate the target size (maximum width and height).
    if target_width and target_height:
        max_size = (target_width, target_height)
    elif target_width:
        max_size = (target_width, img.height)
    elif target_height:
        max_size = (img.width, target_height)

    img.thumbnail(max_size)

    return img


def get_model_response(img: Image, prompt: str, model, processor):
    # Prepare the messages for the model.
    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant. Reply only with the answer to the question asked, and avoid using additional text in your response like 'here's the answer'."}]
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "image": img},
                {"type": "text", "text": prompt}
            ]
        }
    ]

    # Tokenize inputs and prepare for the model.
    inputs = processor.apply_chat_template(
        messages, add_generation_prompt=True, tokenize=True,
        return_dict=True, return_tensors="pt"
    ).to(model.device, dtype=torch.bfloat16)

    input_len = inputs["input_ids"].shape[-1]

    # Generate response from the model.
    with torch.inference_mode():
        generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
        generation = generation[0][input_len:]

    # Decode the response.
    response = processor.decode(generation, skip_special_tokens=True)
    return response

Describe an image

image_file = 'image_5.jpg' # @param {type: 'string'}
prompt = "Describe the image." # @param {type: 'string'}


img = resize_image(image_file)
display(img)
response = get_model_response(img, prompt, model, processor)
display(Markdown(response))

執行結果

/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:634: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.95` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/transformers/generation/configuration_utils.py:651: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `64` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.
  warnings.warn(
DevFest Taipei 2025, December 6th