LLM Evaluation - ihower's Notes

* 本條目專注在 LLM 模型層面的評估 * LLM Model Evals vs LLM Task Evals 的說明 * https://x.com/aparnadhinak/status/1752763354320404488 (2024/2/1) * https://arize.com/blog-course/large-language-model-evaluations-vs-llm-task-evaluations-in-llm-application-development/ * 相關條目 * [[Prompt Evaluation]] * [[RAG Evaluation]] * [[Aligning Evaluation with Human Preferences]] * Evaluate LLMs and RAG a practical example using Langchain and Hugging Face * https://www.philschmid.de/evaluate-llm * https://twitter.com/virattt/status/1765362096793931779 (2024/3/6) * paper: Leveraging Large Language Models for NLG Evaluation: A Survey (2024/1) * https://arxiv.org/abs/2401.07103 * OpenAI simple-eval * https://github.com/openai/simple-evals * Successful language model evals (2024/5/24) * https://www.jasonwei.net/blog/evals * MixEval * https://mixeval.github.io/ * https://www.philschmid.de/evaluate-llm-mixeval * Open LLM Leaderboard v2 * https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard * https://www.latent.space/p/benchmarks-201 2024/7/12 * The LLM Evaluation guidebook * https://github.com/huggingface/evaluation-guidebook * https://x.com/clefourrier/status/1844323838517252172 (2024/10/10) * HuggingFace 出的 guidebook * The Ultimate LLM Benchmark list * https://x.com/scaling01/status/1919092778648408363 (2025/5/5) * https://simple-bench.com/index.html * https://aidanbench.com/ * Understanding the 4 Main Approaches to LLM Evaluation (From Scratch) 2025/10/5 * https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches ## 九原客 https://twitter.com/9hills/status/1781841760538493360 4/21 > 随着越来越多的模型性能接近GPT-4，几个主要的评测手段都已经无法进行有效区分。 1. MMLU，分数均80+分，已缺乏区分度。 2. MT-Bench，裁判员是GPT-4，能力不足以分辨模型之间的能力差别。 3. Arena Elo：主要是普通对话类任务，让Elo分数受到和人类对齐程度的极大影响，且问题难度不足以分辨这个级别的模型。 https://x.com/9hills/status/1801496139902161266 6/14 推薦了一些框架 ## Paper Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena https://twitter.com/cwolferesearch/status/1782453549223321660 https://arxiv.org/abs/2306.05685 Leveraging Large Language Models for NLG Evaluation: A Survey https://twitter.com/omarsar0/status/1748016227090305167 https://twitter.com/helloiamleonie/status/1785293595521511867 2024/4/30 https://twitter.com/op7418/status/1785653861757452458 多位 LLM Judges --- 有標準答案的沒有標準答案的另外，各種 Eval 工具，如果是針對 LLM 的，其實我們不太關心，例如 OpenAI 出的那個 eval 是針對 LLM 的，不是針對你的 prompt 的 * Evals 技術 - [x] 學 Evals https://github.com/openai/evals - Q: 這跟 https://autoevaluator.langchain.com/ 是做一樣的事情嗎? A: 不一樣 - https://github.com/langchain-ai/auto-evaluator/tree/main - https://github.com/rlancemartin/auto-evaluator - https://huggingface.co/spaces/rlancemartin/auto-evaluator - langchain plus 再用仔細看看 * 要提我 PE talk 中，提到的那些把 eval 當作 unit testing 的那些工具 ### Can Language Models Resolve Real-World GitHub Issues? https://www.swebench.com/