Day 16: agent-brain 的測試資料集 (一) - ToolHop - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 16

生成式 AI

agent-brain: 從 0 開始打造一個 python package系列第 16 篇

Day 16: agent-brain 的測試資料集 (一) - ToolHop

17th鐵人賽

aquila_w

2025-09-30 22:53:08

81 瀏覽

分享至

由於還有 15 天，接下來打算抽換 memory 與 net，並且比較效果，那我們勢必要有一個可以 客觀比較效果 的方法。

講到 evaluation 就該有 dataset !

那我想了一下對於測試 agent-brain 的 dataset 應該具備幾種特性:

內建 tools (functions)：最好是能 local execute 的，不需要依賴額外 LLM server。MCP server 先不考慮，放到 future work。
不使用 LLM 當 evaluator：雖然很多頂會 paper（ACL、ICLR）都會用 LLM-as-a-judge，但
- LLM 本身有不穩定性
- 評分很慢、而且貴

今天最主要 survey 了多個 paper 後，感覺有兩個 dataset 是比較適合的

ToolHop

背景: Accepted by ACL 2025 Main Conference，bytedance-research 字節跳動
paper: https://arxiv.org/abs/2501.02506
應該是有一定說服力的...
然後我也很順利的從 hugging face 上找到資料集了。

反正主要想測試，這種需要多次使用 tools，才能解決的問題

ToolHop 的核心想法是：有些問題必須透過多次 tool calls才能得到答案。
例如：
Which day of the week was the date of birth of the English inventor that developed the Richard Hornsby & Sons oil engine?
所以以這題來說，我們有三個步驟要做