> [[Agent 開發知識庫]] * Agent 評估指標舉例 Metrics for Evaluating AI Agents https://www.galileo.ai/blog/metrics-for-evaluating-ai-agents * https://v2docs.galileo.ai/concepts/metrics/agentic/agentic-overview - https://docs.smith.langchain.com/evaluation/tutorials/agents - LangChain AgentEvals https://github.com/langchain-ai/agentevals - https://x.com/LangChainAI/status/1906759734516228514 (2025/4/1) - 用來評估 agent trajectory 的評估器 * AI agent testing 框架 https://github.com/plurai-ai/intellagent * https://x.com/NirDiamantAI/status/1882080786024628279 * Arize 課程 [[Arize Evaluating AI Agents]] * https://www.deeplearning.ai/short-courses/evaluating-ai-agents/ * The AI Agent Evaluation Blueprint (2025/5/8) * https://galileo.ai/blog/ai-agent-evaluation-blueprint-part-1 * How to evaluate AI agents with Braintrust (2025/6/11) * https://www.youtube.com/watch?v=tKInkwOwk8M * paper: Survey on Evaluation of LLM-based Agents * https://arxiv.org/abs/2503.16416 * https://x.com/omarsar0/status/1939691782477902313 (2025/6/30) * How to Setup Evals For Agents w/ Harrison Chase (2025/7/10) - https://maven.com/p/a58f3f/how-to-setup-evals-for-agents - https://claude.ai/public/artifacts/e6360a6e-1288-4d9d-83d1-d5b35c4a049d * AI Agent Evaluation | Pratik Bhavsar, Galileo (2025/7/23) * https://www.youtube.com/watch?v=c5wyHzPU4yE * https://x.com/omarsar0/status/1947738722755027266 * 逐字稿: https://claude.ai/public/artifacts/40a13373-855f-4b8b-af17-e1b5d0b3a32e * Anthropic: Writing effective tools for agents — with agents (2025/9/11) * https://www.anthropic.com/engineering/writing-tools-for-agents * LangChain & LangSmith 分享 (2026/1) * https://blog.langchain.com/evaluating-deep-agents-our-learnings/ * https://blog.aihao.tw/2026/02/17/traces-new-source-of-truth/ * https://blog.aihao.tw/2026/02/17/langsmith-insights-agent-deep-dive/ * https://www.langchain.com/conceptual-guides/agent-observability-powers-agent-evaluation * 摘要: https://blog.aihao.tw/2026/02/17/traces-new-source-of-truth/ * Anthropic: Demystifying evals for AI agents (2026/1/9) * https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents * 摘要: https://blog.aihao.tw/2026/02/17/demystifying-evals-for-ai-agents/ * 蠻好的綜述 * Building agents is easy. Knowing if they work is hard. Here are 5 tips for evaluating agents * https://x.com/_philschmid/status/2028528775873400919 (2026/3/3) * How we build evals for Deep Agents * https://x.com/Vtrivedy10/status/2037203679997018362 (2026/3/27) * https://x.com/Vtrivedy10/status/2037204138774167657 * https://x.com/Vtrivedy10/status/2037367147421200693 * Agent Evaluation Readiness Checklist * https://www.langchain.com/blog/agent-evaluation-readiness-checklist (2026/3/27) * Single-step vs. Full-turn vs. Multi-turn evals * Evals Skills for Coding Agents * https://hamel.dev/blog/posts/evals-skills/ (2026/3/2) * https://x.com/HamelHusain/status/2028894099483578872 * Hamel Husain 出的 Evals skils for coding agents * The Revenge of the Data Scientist: The Harness is Data Science * https://hamel.dev/blog/posts/revenge/ (2026/3/26) * https://x.com/HamelHusain/status/2037184894540054974 * Hamel Husain 這篇很讚 * Pro tips: https://x.com/palashshah/status/2055716804006170925 (2026/5/17) * 很多人做 eval 時會想要把 Agent 用到的每一個 tool call、每一個 API request、每一個檔案都 mock 出來,試圖完美複製真實環境。但 Palash 認為這是過度工程化了。他建議找出「影響 Agent 軌跡的前 3 個關鍵外部請求」,只 mock 這些就好,其餘的東西雖然在 context 裡,但大概率不會真正影響 Agent 的行為路徑。 * 真正影響結果的外部 API 數量,通常比你以為的少很多 * 最好的 eval 通常刻意設計得很無聊。如果你需要一個龐大的 harness 才能讓 failure 出現,那你可能還沒有真正隔離出關鍵變數 * Building an Evaluation Harness for Production AI Agents: A 12-Metric Framework From 100+ Deployments * https://towardsdatascience.com/building-an-evaluation-harness-for-production-ai-agents-a-12-metric-framework-from-100-deployments/ * 一個醫療的評估案例 - lanfguse The AI Engineering Loop - https://langfuse.com/academy/ai-engineering-loop#the-ai-engineering-loop - https://x.com/lotte_verheyden/status/2056754091817361670 (2026/5/19)