* 本條目專注在 Prompt 和 Application 層面的評估 * 相關條目 * [[LLM Evaluation]] * [[RAG Evaluation]] * [[Aligning Evaluation with Human Preferences]] * Evaluation Metrics for LLM Applications In Production 介紹各種方式 * https://docs.parea.ai/blog/eval-metrics-for-llm-apps-in-prod * RAG 任務 * Agent 任務 e.g 達到目標和訊息量 * 摘要任務 * Evaluation of LLMs (2024/2) * https://blog.premai.io/evaluation-of-llms-part-1/ 各種指標 * https://blog.premai.io/evaluation-of-llms-part-2/ 用 LLM 評分 * LLM as a Judge: Numeric Score Evals are Broken!!! * https://twitter.com/aparnadhinak/status/1748368364395721128 * 推荐人工编写多个评分细则,让模型针对某个细项打0/1分或者0/0.5/1分,打分前先输出reason再给出分数 * https://twitter.com/9hills/status/1787439509665403252 2024/5/6 * eval/llm-as-judge 資料整理 https://x.com/eugeneyan/status/1803224103782064170 2024/6/19 * Claude 的 Create strong empirical evaluations 文件 * https://docs.anthropic.com/en/docs/build-with-claude/develop-tests * LLM Evaluation doesn't need to be complicated (2024/7/11) * https://www.philschmid.de/llm-evaluation * LLM as a Judge: 用語言模型來評估好壞 (2024/8/25) * https://ywctech.net/ml-ai/paper-llm-as-a-judge/ * G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (2023/3) * https://arxiv.org/abs/2303.16634 * Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models (2024/4) * https://arxiv.org/abs/2404.18796 * https://cohere.com/research/papers/replacing-judges-with-juries-evaluating-llm-generations-with-a-panel-of-diverse-models-2024-04-29 * 用多個小模型而非一個大模型來做評估 * PoLL = Command-R + GPT-3.5 + Claude-Haiku * Tips: 評估也需要 CoT 思考過程,不應該只直接輸出結果分數 * 在 [[Waves in AI 2024]] 利用大型語言模型自動評分學生作業 中,李宏毅 老師做的實驗也是這麼認為的!! * 後來出 paper 了: A Closer Look into Using Large Language Models for Automatic Evaluation * https://aclanthology.org/2023.findings-emnlp.599/ * Enhancing LLM-as-a-Judge with Grading Notes (2024/7/22) * https://www.databricks.com/blog/enhancing-llm-as-a-judge-with-grading-notes * 評估 prompt 裡面附上 Grading Notes 評分指南 * Using LLMs for Evaluation (2024/7/22) * https://cameronrwolfe.substack.com/p/llm-as-a-judge * https://x.com/cwolferesearch/status/1815405425866518846 * Claude 的 Prompt evaluations 課程 * https://github.com/anthropics/courses/tree/master/prompt_evaluations * 推薦用 https://www.promptfoo.dev/ 工具 * Task-Specific LLM Evals that Do & Don't Work (2024/3) * 針對 分類、摘要、翻譯、版權、毒性 的任務評估 * https://eugeneyan.com/writing/evals/ * Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge) (2024/8) * https://eugeneyan.com/writing/llm-evaluators/ * Creating a LLM-as-a-Judge That Drives Business Results (2024/10/29) * https://hamel.dev/blog/posts/llm-judge/ ## App 相關 * Evaluating LLM Applications (2024/2/6) * https://humanloop.com/blog/evaluating-llm-apps * 比較 overview 講 LLM app 評估 * Your AI Product Needs Evals (2024/3/29) * https://hamel.dev/blog/posts/evals/ * 評估實戰 * Data Flywheels for LLM Applications (2024/7/1) * https://www.sh-reya.com/blog/ai-engineering-flywheel/ * The Definitive Guide to Testing (2024/8/13) * langchain 出的一本小冊子 * https://www.langchain.com/testing-guide-ebook * https://x.com/LangChainAI/status/1823388812975956100 * Large Language Model System Evals in the Wild (2024/9) 電子書 * https://forestfriends.gumroad.com/l/001_llm_system_evals_in_the_wild?layout=profile ## Prompt * from https://blog.langchain.dev/the-prompt-landscape/ * https://smith.langchain.com/hub/simonp/model-evaluator?ref=blog.langchain.dev * https://smith.langchain.com/hub/wfh/automated-feedback-example?ref=blog.langchain.dev * https://smith.langchain.com/hub/smithing-gold/assumption-checker?ref=blog.langchain.dev