> [[Agent 開發知識庫]]
* Agent 評估指標舉例 Metrics for Evaluating AI Agents https://www.galileo.ai/blog/metrics-for-evaluating-ai-agents
* https://v2docs.galileo.ai/concepts/metrics/agentic/agentic-overview
- https://docs.smith.langchain.com/evaluation/tutorials/agents
- LangChain AgentEvals https://github.com/langchain-ai/agentevals
- https://x.com/LangChainAI/status/1906759734516228514 (2025/4/1)
- 用來評估 agent trajectory 的評估器
* AI agent testing 框架 https://github.com/plurai-ai/intellagent
* https://x.com/NirDiamantAI/status/1882080786024628279
* Arize 課程 [[Arize Evaluating AI Agents]]
* https://www.deeplearning.ai/short-courses/evaluating-ai-agents/
* The AI Agent Evaluation Blueprint (2025/5/8)
* https://galileo.ai/blog/ai-agent-evaluation-blueprint-part-1
* How to evaluate AI agents with Braintrust (2025/6/11)
* https://www.youtube.com/watch?v=tKInkwOwk8M
* paper: Survey on Evaluation of LLM-based Agents
* https://arxiv.org/abs/2503.16416
* https://x.com/omarsar0/status/1939691782477902313 (2025/6/30)
* How to Setup Evals For Agents w/ Harrison Chase (2025/7/10)
- https://maven.com/p/a58f3f/how-to-setup-evals-for-agents
- https://claude.ai/public/artifacts/e6360a6e-1288-4d9d-83d1-d5b35c4a049d
* AI Agent Evaluation | Pratik Bhavsar, Galileo (2025/7/23)
* https://www.youtube.com/watch?v=c5wyHzPU4yE
* https://x.com/omarsar0/status/1947738722755027266
* 逐字稿: https://claude.ai/public/artifacts/40a13373-855f-4b8b-af17-e1b5d0b3a32e
* Anthropic: Writing effective tools for agents — with agents (2025/9/11)
* https://www.anthropic.com/engineering/writing-tools-for-agents
* LangChain & LangSmith 分享 (2026/1)
* https://blog.langchain.com/evaluating-deep-agents-our-learnings/
* https://blog.aihao.tw/2026/02/17/traces-new-source-of-truth/
* https://blog.aihao.tw/2026/02/17/langsmith-insights-agent-deep-dive/
* https://www.langchain.com/conceptual-guides/agent-observability-powers-agent-evaluation
* 摘要: https://blog.aihao.tw/2026/02/17/traces-new-source-of-truth/
* Anthropic: Demystifying evals for AI agents (2026/1/9)
* https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
* 摘要: https://blog.aihao.tw/2026/02/17/demystifying-evals-for-ai-agents/
* 蠻好的綜述
* Building agents is easy. Knowing if they work is hard. Here are 5 tips for evaluating agents
* https://x.com/_philschmid/status/2028528775873400919 (2026/3/3)
* How we build evals for Deep Agents
* https://x.com/Vtrivedy10/status/2037203679997018362 (2026/3/27)
* https://x.com/Vtrivedy10/status/2037204138774167657
* https://x.com/Vtrivedy10/status/2037367147421200693
* Agent Evaluation Readiness Checklist
* https://www.langchain.com/blog/agent-evaluation-readiness-checklist (2026/3/27)
* Single-step vs. Full-turn vs. Multi-turn evals
* Evals Skills for Coding Agents
* https://hamel.dev/blog/posts/evals-skills/ (2026/3/2)
* https://x.com/HamelHusain/status/2028894099483578872
* Hamel Husain 出的 Evals skils for coding agents
* The Revenge of the Data Scientist: The Harness is Data Science
* https://hamel.dev/blog/posts/revenge/ (2026/3/26)
* https://x.com/HamelHusain/status/2037184894540054974
* Hamel Husain 這篇很讚
* Pro tips: https://x.com/palashshah/status/2055716804006170925 (2026/5/17)
* 很多人做 eval 時會想要把 Agent 用到的每一個 tool call、每一個 API request、每一個檔案都 mock 出來,試圖完美複製真實環境。但 Palash 認為這是過度工程化了。他建議找出「影響 Agent 軌跡的前 3 個關鍵外部請求」,只 mock 這些就好,其餘的東西雖然在 context 裡,但大概率不會真正影響 Agent 的行為路徑。
* 真正影響結果的外部 API 數量,通常比你以為的少很多
* 最好的 eval 通常刻意設計得很無聊。如果你需要一個龐大的 harness 才能讓 failure 出現,那你可能還沒有真正隔離出關鍵變數
* Building an Evaluation Harness for Production AI Agents: A 12-Metric Framework From 100+ Deployments
* https://towardsdatascience.com/building-an-evaluation-harness-for-production-ai-agents-a-12-metric-framework-from-100-deployments/
* 一個醫療的評估案例
- lanfguse The AI Engineering Loop
- https://langfuse.com/academy/ai-engineering-loop#the-ai-engineering-loop
- https://x.com/lotte_verheyden/status/2056754091817361670 (2026/5/19)