Day 26：VLM QLoRA Fine-Tuning （3/4）- LLaVA微調

2025 iThome 鐵人賽

DAY 26

生成式 AI

VLM系列第 26 篇

17th鐵人賽

皮二仔

2025-10-10 23:51:55

102 瀏覽

分享至

LLaVA 在推理（inference）或對話生成時使用的提示模板為
"USER: \n{question}\nASSISTANT: {answer}"
它簡化為純文字，結構簡單，模擬真實聊天場景，方便即時互動，圖像由推理框架直接處理（例如傳入 PIL 物件或 URL）。
但是當模型微調訓練時的格式設計目標是結構化儲存，便於批次處理和模型微調。

LLaVA 微調訓練格式

LLaVA 的微調訓練格式（用於官方 LLaVA 儲存庫或 Hugging Face TRL/Transformers）是 JSONL（每行一筆 JSON），每筆樣本包含單一圖像路徑與多輪對話，。

id（選用）：樣本唯一識別碼。
image：圖像檔案路徑（字串，例如 "path/to/image.jpg"）。
conversations：對話列表（list of dict）。
- from：字串，"human"（使用者）或 "gpt"（助理）。
- value：字串，合併文字與 token 的完整提示（例如：\n誰寫了這本書？）。

載入資料集後，轉換成微調訓練格式的流程：

處理圖像：為每筆樣本的 images[0] 儲存為獨立 JPG/PNG 檔案（例如，在 images/ 資料夾下命名為 img_{index}.jpg），並記錄路徑。
轉換對話：將資料集 messages 轉換為 conversations ，對於每個訊息：

映射 role 至 from。
合併 content：如果有 type: "image"，在該位置插入 \n；否則連接所有 text。
若為多輪，確保圖像 token 只在第一個 human 訊息出現。

生成 JSONL：將轉換後的 dict 寫入檔案，每行一筆。

from tqdm import tqdm
img_dir = "./images/train"
os.makedirs(img_dir, exist_ok=True)

with open("llava_finetune.jsonl", "w", encoding="utf-8") as f:
    for idx, sample in tqdm(enumerate(small_train_dataset)):
        # 儲存圖像
        img = sample["images"][0]  # PIL.Image 物件
        # Convert RGBA to RGB if necessary
        if img.mode == 'RGBA':
            img = img.convert('RGB')
        img_path = f"{img_dir}/img_{idx}.jpg"
        img.save(img_path, "JPEG")  # 儲存為 JPG

        # 轉換對話
        conversations = []
        has_image = False
        for msg in sample["messages"]:
            from_role = "human" if msg["role"] == "user" else "gpt"
            value_parts = []
            for part in msg["content"]:
                if part["type"] == "image" and not has_image:
                    value_parts.append("<image>\n")
                    has_image = True
                elif part["text"]:
                    value_parts.append(part["text"])
            value = "".join(value_parts).strip()
            if value:
                conversations.append({"from": from_role, "value": value})

        # 構建 JSONL 樣本
        output = {
            "id": f"sample_{idx}",
            "image": img_path,
            "conversations": conversations
        }
        f.write(json.dumps(output, ensure_ascii=False) + "\n")

print("轉換完成！輸出檔案：llava_finetune.jsonl")