Day5：實作-用CLIP進行圖文相似度計算與 Zero-Shot 分類

2025 iThome 鐵人賽

DAY 5

生成式 AI

VLM系列第 5 篇

17th鐵人賽

皮二仔

2025-09-19 23:58:59

203 瀏覽

分享至

圖片與文字的相似度計算

1.載入CLIP模型

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
import matplotlib.pyplot as plt

# 載入模型 (ViT-B/32)
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

這段程式碼的作用就是載入 CLIP 模型 + 前處理工具，讓你可以後續：

用 processor 把文字和圖片轉成模型可接受的輸入
用 model 計算它們的相似度

torch：PyTorch 深度學習框架，用來處理張量運算和推論。
PIL.Image：Python 影像處理套件，用來開啟與處理圖片。
transformers 的 CLIPProcessor 與 CLIPModel：Hugging Face 提供的 CLIP 模型與前處理工具。

CLIPModel.from_pretrained(...)：下載並載入 CLIP 的權重與架構，這裡用的是 OpenAI 的 CLIP ViT-B/32 版本（Vision Transformer Base, patch size 32）
CLIPProcessor.from_pretrained(...)：下載並載入前處理工具（包含 tokenizer + image processor）

讀取測試圖片並顯示

#讀取cat
image = Image.open('/content/cat.jpg')
plt.imshow(image)
plt.axis("off")
plt.show()

圖片及文字的相似度計算 (Image ↔ Text)
這裡定義了三個候選的文字描述（prompts），CLIP 會把圖片和這些文字逐一比較，看哪一個語義最接近，例如：如果圖片是貓，模型應該會判斷 "a photo of a cat" 的相似度最高

# 測試文字
texts = ["a photo of a cat", "a photo of a dog", "a photo of a car"]

# 前處理，processor將圖片+文字標籤均轉成模型能理解的輸入，確保格式正確
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

# 推論
with torch.no_grad():
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)

# 顯示結果
print("=== 相似度計算 (Image vs Text) ===")
for text, p in zip(texts, probs[0]):
    print(f"{text:20s} : {p.item():.4f}")

（輸出結果）
=== 相似度計算 (Image vs Text) ===
a photo of a cat : 0.9894
a photo of a dog : 0.0105
a photo of a car : 0.0000

model會負責真正的前向推論

CLIP 的 image encoder（ResNet 或 ViT）會把輸入圖片轉換成語義向量（image embedding）
CLIP 的 text encoder（Transformer）會把輸入文字轉換成語義向量（text embedding）
CLIP 會把這兩個向量投影到同一個共享空間，然後計算 cosine similarity 來比較語義相近程度

Zero-Shot 分類

讀取測試圖片並顯示
與上一例子比較，這裡增加了一個分類的文字描述，定義了四個候選的文字描述（prompts）

#讀取person
image = Image.open('/content/person.jpg')
plt.imshow(image)
plt.axis("off")
plt.show()

# 增加一個分類
candidate_labels = [
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a car",
    "a photo of a person"
]

# 前處理，processor將圖片+文字標籤均轉成模型能理解的輸入，確保格式正確
inputs = processor(text=candidate_labels, images=image, return_tensors="pt", padding=True)

# 推論
with torch.no_grad():
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)

# 顯示結果
print("\n=== Zero-Shot 分類結果 ===")
for label, p in zip(candidate_labels, probs[0]):
    print(f"{label:20s} : {p.item():.4f}")

pred = candidate_labels[probs[0].argmax()]
print(f"\nPredicted label: {pred}")

（輸出結果）
=== Zero-Shot 分類結果 ===
a photo of a cat : 0.0015
a photo of a dog : 0.0041
a photo of a car : 0.0008
a photo of a person : 0.9936

Predicted label: a photo of a person

可以將程式碼複製到Colab上執行，可以在網路上下載測試圖片：猫、狗、人物