Agent Evaluation - ihower's Notes

> [[Agent 開發知識庫]] * Agent 評估指標舉例 Metrics for Evaluating AI Agents https://www.galileo.ai/blog/metrics-for-evaluating-ai-agents * https://v2docs.galileo.ai/concepts/metrics/agentic/agentic-overview - https://docs.smith.langchain.com/evaluation/tutorials/agents - LangChain AgentEvals https://github.com/langchain-ai/agentevals - https://x.com/LangChainAI/status/1906759734516228514 (2025/4/1) - 用來評估 agent trajectory 的評估器 * AI agent testing 框架 https://github.com/plurai-ai/intellagent * https://x.com/NirDiamantAI/status/1882080786024628279 * Arize 課程 [[Arize Evaluating AI Agents]] * https://www.deeplearning.ai/short-courses/evaluating-ai-agents/ * The AI Agent Evaluation Blueprint (2025/5/8) * https://galileo.ai/blog/ai-agent-evaluation-blueprint-part-1 * How to evaluate AI agents with Braintrust (2025/6/11) * https://www.youtube.com/watch?v=tKInkwOwk8M * paper: Survey on Evaluation of LLM-based Agents * https://arxiv.org/abs/2503.16416 * https://x.com/omarsar0/status/1939691782477902313 (2025/6/30) * How to Setup Evals For Agents w/ Harrison Chase (2025/7/10) - https://maven.com/p/a58f3f/how-to-setup-evals-for-agents - https://claude.ai/public/artifacts/e6360a6e-1288-4d9d-83d1-d5b35c4a049d * AI Agent Evaluation | Pratik Bhavsar, Galileo (2025/7/23) * https://www.youtube.com/watch?v=c5wyHzPU4yE * https://x.com/omarsar0/status/1947738722755027266 * 逐字稿: https://claude.ai/public/artifacts/40a13373-855f-4b8b-af17-e1b5d0b3a32e * Anthropic: Writing effective tools for agents — with agents (2025/9/11) * https://www.anthropic.com/engineering/writing-tools-for-agents * LangChain & LangSmith 分享 (2026/1) * https://blog.langchain.com/evaluating-deep-agents-our-learnings/ * https://blog.aihao.tw/2026/02/17/traces-new-source-of-truth/ * https://blog.aihao.tw/2026/02/17/langsmith-insights-agent-deep-dive/ * https://www.langchain.com/conceptual-guides/agent-observability-powers-agent-evaluation * 摘要: https://blog.aihao.tw/2026/02/17/traces-new-source-of-truth/ * Anthropic: Demystifying evals for AI agents (2026/1/9) * https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents * 摘要: https://blog.aihao.tw/2026/02/17/demystifying-evals-for-ai-agents/ * 蠻好的綜述 * Building agents is easy. Knowing if they work is hard. Here are 5 tips for evaluating agents * https://x.com/_philschmid/status/2028528775873400919 (2026/3/3) * How we build evals for Deep Agents * https://x.com/Vtrivedy10/status/2037203679997018362 (2026/3/27) * https://x.com/Vtrivedy10/status/2037204138774167657 * https://x.com/Vtrivedy10/status/2037367147421200693 * Agent Evaluation Readiness Checklist * https://www.langchain.com/blog/agent-evaluation-readiness-checklist (2026/3/27) * Single-step vs. Full-turn vs. Multi-turn evals * **Evals Skills for Coding Agents** * https://hamel.dev/blog/posts/evals-skills/ (2026/3/2) * https://x.com/HamelHusain/status/2028894099483578872 * Hamel Husain 出的 Evals skils for coding agents 非常讚，包括: * eval-audit 檢查專案中 eval 架構的有什麼問題 * error-analysis 如何錯誤分析 * generate-synthetic-data 如何合成問題 * write-judge-prompt 如何寫 binary judge * validate-evaluator 如何評估對齊 binary judge * evaluate-rag 如何評估 RAG * build-review-interface 如何蓋人工標注介面 * **The Revenge of the Data Scientist: The Harness is Data Science** * https://hamel.dev/blog/posts/revenge/ (2026/3/26) * https://x.com/HamelHusain/status/2037184894540054974 * Hamel Husain 這篇很讚 * Pro tips: https://x.com/palashshah/status/2055716804006170925 (2026/5/17) * 很多人做 eval 時會想要把 Agent 用到的每一個 tool call、每一個 API request、每一個檔案都 mock 出來，試圖完美複製真實環境。但 Palash 認為這是過度工程化了。他建議找出「影響 Agent 軌跡的前 3 個關鍵外部請求」，只 mock 這些就好，其餘的東西雖然在 context 裡，但大概率不會真正影響 Agent 的行為路徑。 * 真正影響結果的外部 API 數量，通常比你以為的少很多 * 最好的 eval 通常刻意設計得很無聊。如果你需要一個龐大的 harness 才能讓 failure 出現，那你可能還沒有真正隔離出關鍵變數 * Building an Evaluation Harness for Production AI Agents: A 12-Metric Framework From 100+ Deployments * https://towardsdatascience.com/building-an-evaluation-harness-for-production-ai-agents-a-12-metric-framework-from-100-deployments/ * 一個醫療的評估案例 - lanfguse The AI Engineering Loop - https://langfuse.com/academy/ai-engineering-loop#the-ai-engineering-loop - https://x.com/lotte_verheyden/status/2056754091817361670 (2026/5/19) * Agent Evaluation: A Detailed Guide (Cameron R. Wolfe, Ph.D.) * https://cameronrwolfe.substack.com/p/agent-evals (2026/5/18) * https://x.com/cwolferesearch/status/2056399847553409301 * 綜述 * why we use Evals to measure agents before & after shipping to prod * https://x.com/Vtrivedy10/status/2057175860910964967 (2026/5/21) * Shreya Shankar: Exploring Agent-Assisted Qualitative Analysis * https://www.sh-reya.com/blog/ai-qual-analysis/ (2026/5/21) * https://x.com/HamelHusain/status/2057875320011882923 * 探討用 Agent 做質性分析(e.g. 錯誤分析) 的困難點 * 摘要 https://claude.ai/chat/cf820978-2957-43f5-91ec-da95d43ca30c * 讓 agent 邊讀邊即興生分析 Axial code,agent 永遠在「換句話說」而不是「歸納」 * 「agent 走一走就自己宣布分析完成」因此 Loop 的編排交給 code,不要交給 agent 判斷 * 對 trace 分析,檢驗標準很簡單: **每個 failure mode 能不能寫成一條清楚的 inclusion / exclusion 準則,並且能轉成一個自動 check(LLM-as-judge 或斷言)?** 如果不能,它就太抽象,只能拿來自我安慰,沒辦法 close the loop * Taxonomy 當成版本化 artifact,改了就回頭重跑 * 人介入的點放在 taxonomy 層,不要放在逐條標註層 * agent 提議 taxonomy,你做**判斷**(合併、刪除、調整類別、reweight),而不是一條一條編碼。Review 介面要以 cluster 為單位,每個 cluster 直接附幾條範例 trace 連結讓你抽查,並顯示每輪之間的 diff(哪些 mode 新增、哪些 code 搬家)。 * **Ben Hylak: How to evaluate AI agents (2026/5)** * https://www.howtoeval.com/ * 摘要 https://claude.ai/chat/2fa596b6-6468-4738-9a42-a48c3a1a1668 * **這篇蠻讚的，有觀點!** * 拉地板的本質就是 Error Analysis * 從 Production 學習,而且要隨流量分級 * 加 eval case 有邊際遞減,要狠心修剪 * 我在 https://blog.aihao.tw/2026/06/02/agent-trace-analysis/ 有引用