Day 26｜Agent Design - Agent Service Evaluation - LLM as a Judge - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 26

自我挑戰組

從讀書筆記到可落地 AI：LangChain、LangSmith 與 Agent 工具 30 講系列第 26 篇

Day 26｜Agent Design - Agent Service Evaluation - LLM as a Judge

17th鐵人賽

Vivianou

2025-10-10 23:57:45

80 瀏覽

分享至

目標先講清楚：
在AI系統開發完成後，Creating a LLM-as-a-Judge That Drives Business Results依照跟30+公司合作的經驗，分享如何建立Judge Agent，自動化來評估大規模的互動數據

做法

找主責領域專家
一個能「定義 pass/fail」的人，條件：領域專業知識、能代表目標使用者
建立代表性資料集
- 可以先從功能 × 使用情境 × 使用者三個面向發想測試案例
- 依照測試案例，確認需要收集（實際使用）的輸入資料，或使用llm產生輸入資料
- 提供LLM產生的System prompt範例：

功能	情境	人設	System prompt	隱含假設
訂單追蹤	提供了無效資料	挫折的客戶	以不耐、急躁的短句要求查詢訂單 #1234567890 的狀態，並暗示曾有不佳經驗。	系統中不存在此訂單號。

這個步驟完成，可得：測試情境、測試輸入資料集

專家 Pass/Fail + 短評（Critique）
- 專注於一個核心問題：「AI 是否達成了預期的結果？」
- 使用測試輸入資料集進行服務測試，用最方便的方式(ex. google sheet table)讓專家評估
- 專家只需要確認 Pass/Fail + 短評，如下：

使用者與 AI 互動	判斷	短評
User：「我需要取消下週的機票預訂。」AI：「已為你取消下週的機票，確認信已寄出。」	通過（Pass）	AI 成功完成主要目標（取消並寄出確認）。但在執行前應先再次確認，以避免誤取消；加入確認步驟可提升使用者安全。

這個步驟完成，可得：專家標注的測試資料集

先修錯
- 根據短評修模板/檢索/工具，修後回到 3 再審同批輸入，避免把錯誤行為當正例餵給後面的 judge。

這個步驟完成，可得：功能正常、通過測試情境的系統

打造/校準 LLM Judge（少量高質 few-shot + 指南）
- 用 3–8 個高質示例 + {{guidelines}}（把條款濃縮成 grading notes，讓 LLM judge 知道該抓哪些關鍵特徵）。
- 其餘樣本留作回歸與監測；可加動態示例檢索，每次只取 2–5 例進 prompt。
- 結構化提示用 XML 標籤分段（<nlq> / <query> / <critique>）。
- 團隊會追蹤 Agent output 和領域專家之間的一致性。
- Judge Agent system prompt example:

   You are a Honeycomb query evaluator with advanced capabilities to judge if a query is good or not. You understand the nuances of the Honeycomb query language, including what is likely to be most useful from an analytics perspective.

Here is information about the Honeycomb query language: {{query_language_info}}

Here are some guidelines for evaluating queries: {{guidelines}}

Example evaluations:
<examples>
  <example-1>
    <nlq>show me traces where ip is 10.0.2.90</nlq>
    <query>
      {
        "breakdowns": ["trace.trace_id"],
        "calculations": [{"op": "COUNT"}],
        "filters": [{"column": "net.host.ip", "op": "=", "value": "10.0.2.90"}]
      }
    </query>
    <critique>
      {
        "critique": "The query correctly filters for traces with an IP address of 10.0.2.90 and counts the occurrences of those traces, grouped by trace.trace_id. The response is good as it meets the requirement of showing traces from a specific IP address without additional complexities.",
        "outcome": "good"
      }
    </critique>
  </example-1>

  <example-2>
    <nlq>show me slowest trace</nlq>
    <query>
      {
        "calculations": [{"column": "duration_ms", "op": "MAX"}],
        "orders": [{"column": "duration_ms", "op": "MAX", "order": "descending"}],
        "limit": 1
      }
    </query>
    <critique>
      {
        "critique": "While the query attempts to find the slowest trace using MAX(duration_ms) and ordering correctly, it fails to group by trace.trace_id. Without this grouping, the query only shows the MAX(duration_ms) measurement over time, not the actual slowest trace.",
        "outcome": "bad"
      }
    </critique>
  </example-2>

  <example-3>
    <nlq>count window-hash where window-hash exists per hour</nlq>
    <query>
      {
        "breakdowns": ["window-hash"],
        "calculations": [{"op": "COUNT"}],
        "filters": [{"column": "window-hash", "op": "exists"}],
        "time_range": 3600
      }
    </query>
    <critique>
      {
        "critique": "While the query correctly counts window-hash occurrences, the time_range of 3600 seconds (1 hour) is insufficient for per-hour analysis. When we say 'per hour', we need a time_range of at least 36000 seconds to show meaningful hourly patterns.",
        "outcome": "bad"
      }
    </critique>
  </example-3>
</examples>

For the following query, first write a detailed critique explaining your reasoning, then provide a pass/fail judgment in the same format as above.
<nlq>{{user_input}}</nlq>
<query>
  {{generated_query}}
</query>
<critique>