* 本條目專注在 Prompt 和 Application 層面的評估
* 相關條目
* [[LLM Evaluation]]
* [[RAG Evaluation]]
* [[Aligning Evaluation with Human Preferences]]
* Evaluation Metrics for LLM Applications In Production 介紹各種方式
* https://docs.parea.ai/blog/eval-metrics-for-llm-apps-in-prod
* RAG 任務
* Agent 任務 e.g 達到目標和訊息量
* 摘要任務
* Evaluation of LLMs (2024/2)
* https://blog.premai.io/evaluation-of-llms-part-1/ 各種指標
* https://blog.premai.io/evaluation-of-llms-part-2/ 用 LLM 評分
* LLM as a Judge: Numeric Score Evals are Broken!!!
* https://twitter.com/aparnadhinak/status/1748368364395721128
* 推荐人工编写多个评分细则,让模型针对某个细项打0/1分或者0/0.5/1分,打分前先输出reason再给出分数
* https://twitter.com/9hills/status/1787439509665403252 2024/5/6
* eval/llm-as-judge 資料整理 https://x.com/eugeneyan/status/1803224103782064170 2024/6/19
* Claude 的 Create strong empirical evaluations 文件
* https://docs.anthropic.com/en/docs/build-with-claude/develop-tests
* LLM Evaluation doesn't need to be complicated (2024/7/11)
* https://www.philschmid.de/llm-evaluation
* LLM as a Judge: 用語言模型來評估好壞 (2024/8/25)
* https://ywctech.net/ml-ai/paper-llm-as-a-judge/
* G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (2023/3)
* https://arxiv.org/abs/2303.16634
* Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models (2024/4)
* https://arxiv.org/abs/2404.18796
* https://cohere.com/research/papers/replacing-judges-with-juries-evaluating-llm-generations-with-a-panel-of-diverse-models-2024-04-29
* 用多個小模型而非一個大模型來做評估
* PoLL = Command-R + GPT-3.5 + Claude-Haiku
* Tips: 評估也需要 CoT 思考過程,不應該只直接輸出結果分數
* 在 [[Waves in AI 2024]] 利用大型語言模型自動評分學生作業 中,李宏毅 老師做的實驗也是這麼認為的!!
* 後來出 paper 了: A Closer Look into Using Large Language Models for Automatic Evaluation
* https://aclanthology.org/2023.findings-emnlp.599/
* Enhancing LLM-as-a-Judge with Grading Notes (2024/7/22)
* https://www.databricks.com/blog/enhancing-llm-as-a-judge-with-grading-notes
* 評估 prompt 裡面附上 Grading Notes 評分指南
* Using LLMs for Evaluation (2024/7/22)
* https://cameronrwolfe.substack.com/p/llm-as-a-judge
* https://x.com/cwolferesearch/status/1815405425866518846
* Claude 的 Prompt evaluations 課程
* https://github.com/anthropics/courses/tree/master/prompt_evaluations
* 推薦用 https://www.promptfoo.dev/ 工具
* Task-Specific LLM Evals that Do & Don't Work (2024/3)
* 針對 分類、摘要、翻譯、版權、毒性 的任務評估
* https://eugeneyan.com/writing/evals/
* Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge) (2024/8)
* https://eugeneyan.com/writing/llm-evaluators/
* Creating a LLM-as-a-Judge That Drives Business Results (2024/10/29)
* https://hamel.dev/blog/posts/llm-judge/
## App 相關
* Evaluating LLM Applications (2024/2/6)
* https://humanloop.com/blog/evaluating-llm-apps
* 比較 overview 講 LLM app 評估
* Your AI Product Needs Evals (2024/3/29)
* https://hamel.dev/blog/posts/evals/
* 評估實戰
* Data Flywheels for LLM Applications (2024/7/1)
* https://www.sh-reya.com/blog/ai-engineering-flywheel/
* The Definitive Guide to Testing (2024/8/13)
* langchain 出的一本小冊子
* https://www.langchain.com/testing-guide-ebook
* https://x.com/LangChainAI/status/1823388812975956100
* Large Language Model System Evals in the Wild (2024/9) 電子書
* https://forestfriends.gumroad.com/l/001_llm_system_evals_in_the_wild?layout=profile
## Prompt
* from https://blog.langchain.dev/the-prompt-landscape/
* https://smith.langchain.com/hub/simonp/model-evaluator?ref=blog.langchain.dev
* https://smith.langchain.com/hub/wfh/automated-feedback-example?ref=blog.langchain.dev
* https://smith.langchain.com/hub/smithing-gold/assumption-checker?ref=blog.langchain.dev