Day16 - 504 Timeout 背後的真相：逐步拆解 Agent 回應延遲問題

17th鐵人賽

uncured7036

2025-08-30 20:13:19

74 瀏覽

分享至

昨天測試下來，我們可以同時發 20 個請求給 API Server ，但是 Agent 的回應還是一個一個回來，所以今天來看看可以怎麼調整。

分析 Log

先來仔細看一下 Log ：

上面的 504 是 API Server 收到請求後過了 300 秒沒回應而 Timeout，代表請求有抵達 API Server 但是裡面呼叫 Agent 等太久。繼續往下看可以看到很多 /api/stream_reasoning_engine，這些是呼叫 Agent 請他回應行程的請求，可以發現每個請求和回應大約隔了 20 秒然後才繼續下一個，問題應該就是出在呼叫 Agent 的地方，所以直接來測試這一段程式。

測試 Python 程式碼

我先將呼叫 Agent 的邏輯拆成一小個 Python 程式：

async def run_query(uid):
    print(f'{datetime.now()} - run query for {uid}')
    payload = QueryPayload(
        locations=['Tokyo'],
        startDate='2025-09-10',
        days=2,
        language='Chinese Tranditional',
    )
    prompt = (
        f'Please plan a {payload.days}-days trip starting from '
        f'{payload.startDate} in {", ".join(payload.locations)}. '
        f'Please give a title of this trip. '
        f'Use {payload.language} for value of title, location, note, and name. '
        f'All remaining values should be in English. '
    )
	app = agent_engines.get(AGENT_ID)
    full_text = ""
    async for event in app.async_stream_query(
        user_id=uid,
        message=prompt,
    ):
        for resp in event['content']['parts']:
            if 'text' in resp:
                full_text = resp['text']
                # trim markdown format
                first_brace = full_text.index('{')
                if first_brace > 0:
                    full_text = full_text[first_brace:-3]
                break
    print(f'{datetime.now()} - {uid} completed.')

async def run():
    await asyncio.gather(
        run_query('u1'),
        run_query('u2'),
        run_query('u3'),
    )

改用 curl 測試

交叉測試之後發現，主要卡住的地方是 agent_engines.get 和 async_stream_query 這兩個函數，可能是內部實作有等待的機制導致沒辦法同時發送多個請求。那就先不依靠 Python API ，直接用 curl 對服務進行測試：

#!/bin/bash

PROJECT_ID=...
LOCATION=asia-northeast1
RESOURCE_ID=... # deployed

create_session() {
  sid=`curl \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://$LOCATION-aiplatform.googleapis.com/v1/projects/$PROJECT_ID/locations/$LOCATION/reasoningEngines/$RESOURCE_ID:query" -d '{"class_method": "async_create_session", "input": {"user_id": "'$1'"},}' 2>/dev/null | jq .output.id`
  echo $1 $sid
}

run_query() {
  curl \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    https://$LOCATION-aiplatform.googleapis.com/v1/projects/$PROJECT_ID/locations/$LOCATION/reasoningEngines/$RESOURCE_ID:streamQuery?alt=sse -d '{
      "class_method": "async_stream_query",
      "input": {
        "user_id": "'$1'",
        "session_id": "'$2'",
        "message": "Please plan a 2-days trip starting from 2025-09-10 in Tokyo. Please give a title of this trip. Use Chinese Tranditional for value of title, location, note, and name. All remaining values should be in English.",
      }
  }' 2>/dev/null
}

#create_session u1 &
#create_session u2 &
#create_session u3 &

run_query u1 xxxxx &
run_query u2 xxxxx &
run_query u3 xxxxx &

wait

先用 create_session 建立 Session 後，呼叫 run_query 發送請求，最後用 wait 等所有背景程式執行結束。

Logs Explorer 驗證結果

在 Logs Explorer 可以看到請求不再是一個一個進來，而且會觸發 Auto-Scaling 建立更多的 Agent Engine Service ：

測試不同 Session ID 的影響

我另外做了一些測試，如果新增更多的 run_query 讓 User ID 一樣 Session ID 不同也能夠有同時請求的效果；但如果 User ID 一樣 Session ID 也一樣，回復的速度會比較久，看起來是等待上個問題解答後才繼續下個問題，這也符合文件上說的，可以使用 Session 來記憶對話紀錄。