2025 iThome 鐵人賽

DAY 17

生成式 AI

阿，又是一個RAG系列第 17 篇

Day16: Pydantic 與 Structured Output

17th鐵人賽

poyuanchih

2025-10-01 18:26:48

152 瀏覽

分享至

Situation

今天是個全新的篇章，我們來探索一下新的主題: llm 的 Structured Output
這個系列預計會包含：
- 本篇 txt2json(I)：基礎知識、以及快速的測試一下 llama-index 的 Structured Output
- txt2json(II): 用 llm 完整的把一整份考卷 pdf 檔轉成 json
- txt2json(III): 用 llm 驗證 StructuredOutput 的結果
- txt2json(IV): 由於這個問題是有 ground-truth 的，我們可以驗證我們的驗證結果
- txt2json(V): 結果不行的話，我們再架個 workflow 來叫他反思

Task

在 day6 我們掌握了把 pdf 轉 txt 的方法，今天我們要進一步把這個 txt 轉成 json
- 釐清: 這邊完全是可以用 re 做的，但我們系列文的主要目的是產生一些不那麼 toy 但是又很 toy 的 dataset，來讓我們可以在自己熟悉的問題上評估 llm 的能力
然後我們要先來理解一下 llama-index 世界裡的 Structured Output
今天的測試 pdf 檔在這裡
- 是 114 年中醫師的考試，科目是: 中醫臨床醫學（四）(包括針灸科學)
- 如果連結壞了還是可以在考選部的網站上找到
- 我的猜想是因為 dataset 的關係，直接拿中醫相關的問題問 llm 有可能其實會做不好
  - 然後這時候就有我們 rag 的用武之地
最後今天會用到 ollama ，主要是這個任務我感覺滿簡單的不想花錢 call chatgpt
- 配置 ollama 的環境可以參考官方 Github
- 然後釐清一下怎麼 pull model: 我們用 llama-3.1:8b 還有 gemma3:12b
- 至於 llama-index 的 ollama: pip install llama-index-llms-ollama

Action:

1. 讓 llm 回我們 json 的三種 call 法

我們直接上圖：
- 圖片來源: llama_index_structured_otuputs

首先是最上面，我們直接 prompt 它，要他出 json
- 這包含了你可以把 json mode 開起來，那他就會被強制回給你 json
- 然後你也可以把 json schema 寫在 prompt 裡面
然後是中間，generic completion API
- 就是一般套件比如 llama-index 會幫我們做的事
- 他會有一個 OutputParser 給我們客製，做兩件事:
  - 在 prompt 裡面插入 json schema
  - llm 回傳的時候自動幫我們檢查是不是我們要的
- 有的時候這個 OutputParser 又會被拆出 ChatFormatter 來處理 prompt 的部分
- 我們在 Day15 實作的 ReActAgent 就是這種
- 由於在 ML 的世界裡: "只要有可能會出錯的事情，他就會出錯"，所以一般文檔會好像有點嫌棄這種方法
  - 比如像這邊他說：
    - This is notably less reliable, but supported by all text-based LLMs.
- 這個的額外好處是：他隨意的 model 都支持
最後就是支持 function calling API 的 model

我們就是在呼叫 llm 的時候把 tool 給他，API 會幫我們處理 json 的事情
我們在 Day14 實作的 FunctionAgent 就是這種
- 我們呼叫的時候是直接呼叫 chat_with_tools 然後把 tool 給他
釐清: 在本系列文中，tool calling 與 function_calling 我們視為同義
- 主要是 ollama 叫這個功能 tool calling
- 然後 OpenAI 叫 function calling
釐清: function/tool calling 與 JSON mode 的差異
- 比如就是有 model 可以開 json_mode 但是不支持 tool_calling
  - 我說的就是 gemma3:12b
那是不是全用 function calling 就好了？
- openai 說明過 function calling 預設就是會把 JSON 開啟: 這裡
- 另一方面，社群上也有人說 json mode 給了 model 太多限制，這會導致 model 的 performance 下降
- 綜合以上這兩個資訊的話，若是追求準確度的情況，到底要用哪個其實還是沒有定論
我們的目標：
- 在同一任務用不同方法來做，觀察準確率、穩定性、與開發便利性的差異，而不是預設其中一種一定更好
- In God we trust. All others must bring data.

2. 秒懂 Pydantic model

前面說到 prompt 裡面要放 json schema 來讓 llm 真的知道你要的 structured output 是什麼
那 json schema 怎麼來呢? 答案就是 Pydantic model

上 code

import pydantic  # pip show pydantic
print(f"our pydantic version: {pydantic.VERSION}")
from pprint import pprint
from typing import List, Optional, Tuple
from pydantic import BaseModel
from pydantic import Field

class Options(BaseModel):
    """單選題的選項物件，包含 A, B, C, D 四個選項"""
    A: str = Field(..., description='選項A')
    B: str = Field(..., description='選項B')
    C: str = Field(..., description='選項C')
    D: str = Field(..., description='選項D')

class MCQ(BaseModel):
    """單選題結構，包含題號、題幹、選項與答案"""
    qid: int = Field(..., description='題號')
    question: str = Field(..., description='題幹')
    options: Options = Field(..., description="本題的四個選項")
    ans: Optional[str] = Field(default=None, description='答案')

class Meta(BaseModel):
    """試題原始資訊，包含 年分、科目、第幾次考試"""
    year: Optional[int] = Field(default=None, description='第?年')
    subject: Optional[str] = Field(default=None, description='科目名稱')
    times: Optional[int] = Field(default=None, description='第?次考試')

class ExtractExam(BaseModel):
    """
    提取整份考卷

    - qset: 單選題考題集合
    - subject: 科目名稱
    - year: 考試年分
    - times: 第幾次考試
    """
    qset: List[MCQ] = Field(..., description='單選題考題')
    metadata: Meta = Field(..., description='考題資訊')

schema = MCQ.model_json_schema()
pprint(schema)

output
說明

3.0. 我們的 pydantic version 是 2.11.9（提醒：v1 和 v2 在 API 上有些差異）
3.1. 創建 Pydantic class ，就直接繼承 BaseModel: MCQ(BaseModel)
3.2. 一般型別就直接標註(int, str): qid: int
3.3. 嵌套的使用我們的 models: qset: List[MCQ]
3.4. 使用 docstring(""" """): 讓 llm 更容易理解結構
3.5. 使用 Field(..., description=): 一個是預設值，一個是描述
3.6. 用.model_json_schema() 轉成 json schema: 這就是你要塞進 prompt 的規格文件。
3.7. 釐清以上之後: 以後我們就請 chat-gpt 幫你寫就好，這個他真的很會

如果你想要看完整的說明可以參考: llama-index 的 Introduction
今天就聊到這裡，底下我們開始實驗

3. 準備資料

code

from llama_index.readers.file import PDFReader
from pathlib import Path
import time

file_path = Path("./data/114_針灸科學.pdf")
FULL_DOCUMENT=False

pdf_reader = PDFReader(return_full_document=FULL_DOCUMENT)
documents = pdf_reader.load_data(file=file_path)
print(f"len of documents: {len(documents)}")
text = documents[0].text
print(f"text len: {len(text)}")
print('---')
print(text)

結果
說明

我們這邊直接用 llama-index 的 PDFReader 來讀 pdf
- 裡面是用 pypdf source_code
我們假設 data/114_針灸科學.pdf 路徑下有載好的 pdf 考題
我們把 return_full_document 設 False 這樣就會分頁讀成 document
我們今天只會拿第一頁來測試

4. test1: calling tools directly

code

from llama_index.core.program.function_program import get_function_tool

exam_tool = get_function_tool(ExtractExam)
print(f"# tool info: ")
print(f"# name: {exam_tool.metadata.name}\n\n# description: {exam_tool.metadata.description}")
print('---')

# pip install llama-index-llms-ollama
from llama_index.llms.ollama import Ollama
llama = Ollama(
    model="llama3.1:8b",
    request_timeout=120.0,
    context_window=8000,
    temperature=0.0,
)

start = time.time()
resp = llama.chat_with_tools(
    [exam_tool],
    user_msg="請從下列文本中提取考試: " + text,
    tool_required=True,  # can optionally force the tool call
)
end = time.time()
print(f'dur: {end-start:.2f} sec')
tool_calls = llama.get_tool_calls_from_response(
    resp, error_on_no_tool_call=False
)
print(f"type: {type(tool_calls)}, len: {len(tool_calls)}, dtype: {type(tool_calls[0])}")
print('---')
pprint(tool_calls[0].tool_kwargs)

result
說明

首先我們把我們前面定義的 pydantic model 包成 tool 給他讓他呼叫
這個邏輯上有點像是說:
- 有一個工具叫 ExtractExam，你被強迫使用它
- 它的 argument 規範我已經給你了
- 請呼叫這個工具吧，藉此來達成結構化資料的提取
一般其實是定 Exam 就好了，這邊定 ExtractExam 就是比較好想像為什麼 structured output 就是 function calling
我們用的是 llama3.1:latest 就是 8B 的那隻
用 chat_with_tools 呼叫，這個我們之前也有做過
我的顯卡是筆電(天選6 pro)的 5070 8G，這樣一個問題要 13.52 sec
- ~~這個 gpt-oss:20b 沒辦法跑~~
可以看到它確實就是很好的提取了我們要的資訊
連跨頁的第 5 題也就是沒有去填沒出現的選項
阿這個是我試出來的結果，它也不是每次都這麼聽話的
- 簡直替我們之後的 reflection workflow 鋪路

5. test2: allow multiple tool calls

接著我們測試一下一次性的呼叫一個工具多次，所以我們把工具改為 MCQ
我的想像是這種一次性呼叫多個工具的 data 一定是比較少，所以效果應該會變差

code

mcq_tool = get_function_tool(MCQ)
print(f"# name: {mcq_tool.metadata.name}\n\n# description: {mcq_tool.metadata.description}")

start = time.time()
resp = llama.chat_with_tools(
    [mcq_tool],
    user_msg="你是一個無情的考題提取機器，負責從文本中盡可能多的提取 MCQ，以下是文本資訊：" + text,
    tool_required=True,  # can optionally force the tool call
    allow_parallel_tool_calls=True,
)
end = time.time()
print(f"dur: {end - start:.2f} sec")
tool_calls = llama.get_tool_calls_from_response(
    resp, error_on_no_tool_call=False
)
print(f'len of tool_call: {len(tool_calls)}')
print('---')
for tool_call in tool_calls:
    pprint(tool_call.tool_kwargs)

result

這邊我試了幾次都還是這樣，考慮到 llama-3.1 其實已經是上古時期的 model 了，我們就不要太為難他
- 實際上來說效果已經在預期之上

6. test3: gpt-5-mini with allow multiple tool calls

我們換 gpt-5-mini 來看一下 allow multiple tool calls 是不是壞了

code

print(f"# name: {mcq_tool.metadata.name}\n\n# description: {mcq_tool.metadata.description}")
import os
from dotenv import find_dotenv, load_dotenv
_ = load_dotenv(find_dotenv())
from llama_index.llms.openai import OpenAI
mini = OpenAI(model="gpt-5-mini")

start = time.time()
resp = mini.chat_with_tools(
    [mcq_tool],
    user_msg="你是一個無情的考題提取機器，負責從文本中盡可能多的提取 MCQ，以下是文本資訊：" + text,
    tool_required=True,  # can optionally force the tool call
    allow_parallel_tool_calls=True,
)
end = time.time()
print(f"dur: {end - start:.2f} sec")
tool_calls = mini.get_tool_calls_from_response(
    resp, error_on_no_tool_call=False
)
print(f'len of tool_call: {len(tool_calls)}')
print('---')
for tool_call in tool_calls:
    pprint(tool_call.tool_kwargs)

result

果然，你大爺還是你大爺，這種問題對 mini 來說是小菜一疊
- ~~果然不是我的 prompt 不夠好~~

7. test4: gemma3: 12B without json mode

前面用的都是第一段講的第三種方法，我們現在要回到第 1 種就是直接 prompt 他，因為我們要改用 gemma3
- btw: 用 gemma3 直接執行上面的 code 的話是會直接報錯的，就是因為他不支持 tool calling
complete

import json
schema = MCQ.model_json_schema()
prompt = "Here is a JSON schema for an Exam: " + json.dumps(
    schema, indent=2, ensure_ascii=False
)

gemma = Ollama(
    model="gemma3:12b",
    request_timeout=120.0,
    # Manually set the context window to limit memory usage
    context_window=8000,
    json_mode=False,
    temperature=0.0,
)

prompt += (
    """
  Extract an Exam from the following text.
  Format your output as a JSON object according to the schema above.
  Do not include any other text than the JSON object.
  Omit any markdown formatting. Do not include any preamble or explanation.
  請盡可能多的提取考題
"""
    + text
)

response = gemma.complete(prompt)

extract

import re

raw = response.text.strip()

# 把 ```json ... ``` 和 ``` 拿掉
if raw.startswith("```"):
    raw = re.sub(r"^```(?:json)?", "", raw)
    raw = re.sub(r"```$", "", raw)
    raw = raw.strip()

data = json.loads(raw)
pprint(data)

result
基本上是 5 題都跑出來了
但就是他回的是 markdown 的 json 所以還要再去掉就是了
- 這種量大的時候大概會變麻煩

8. test5: json_gemma

我們來把 json mode 開起來然後再跑一次
code:

json_gemma = Ollama(
    model="gemma3:12b",
    request_timeout=120.0,
    # Manually set the context window to limit memory usage
    context_window=8000,
    json_mode=True,
    temperature=0.0,
)
response = json_gemma.complete(prompt)
json.loads(response.text)

結果:

{'qid': 1,
 'question': '常見針灸配穴法中，所指的是「四關穴」，為下列何穴位之組合？',
 'options': {'A': '上星、日月', 'B': '合谷、太衝', 'C': '內關、外關', 'D': '上關、下關'},
 'ans': None}

這次只有出了一題，難道我們驗證了江湖傳聞開 json mode 效果會變差嗎?!

9. test6. json_gemma with exam schema

這個問題是我偶然有一次發現的
實際情況應該是: 如果你要 llm 回 list of dictionary，他通常沒辦法做到，而且似乎不是能力問題
我們把 schema 改回要他提取整份考卷，而不是單個 MCQ
code

schema = ExtractExam.model_json_schema()
prompt = "Here is a JSON schema for an Exam: " + json.dumps(
    schema, indent=2, ensure_ascii=False
)

json_gemma = Ollama(
    model="gemma3:12b",
    request_timeout=120.0,
    # Manually set the context window to limit memory usage
    context_window=8000,
    json_mode=True,
    temperature=0.0,
)


prompt += (
    """
  Extract an Exam from the following text.
  Format your output as a JSON object according to the schema above.
  Do not include any other text than the JSON object.
  Omit any markdown formatting. Do not include any preamble or explanation.
  請盡可能多的提取考題
"""
    + text
)

response = json_gemma.complete(prompt)
json.loads(response.text)

結果:
所以其實還是做的很好，5題都確實提取了
~~還以為我們找到絕佳範例了~~

Summary

我們今天學了讓 llm 回我們 json 的三種 call 法
還有一分鐘搞懂 Pydantic
我本來都只會開 json mode 然後瘋狂改 prompt
- ~~叫 chatgpt 改~~
簡易的測試了 6 種小情況
- llama 使用 toolcalling
- llama 使用多工具呼叫
- mini 使用多工具呼叫
- gemma 不開 json
- gemma 開 json
- json mode 的小問題
這邊都只是簡易的測試，我們明天來把量帶上去做真正的 benchmark

Others

都已經走到 day16 了，還沒有放棄，我覺得自己很棒
- ~~就不說當初規劃的主題只完成了(1/3)啦~~

Reference:

Day15: 用 llama-index 的 workflow 來把 ReActAgent 兜出來

Day17: exam_and_structured_output_dataset

系列文

阿，又是一個RAG 共 30 篇

RSS系列文訂閱系列文

1 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19838 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

阿，又是一個RAG系列 第 17 篇