* 本條目專注在 LLM 模型層面的評估
* LLM Model Evals vs LLM Task Evals 的說明
* https://x.com/aparnadhinak/status/1752763354320404488 (2024/2/1)
* https://arize.com/blog-course/large-language-model-evaluations-vs-llm-task-evaluations-in-llm-application-development/
* 相關條目
* [[Prompt Evaluation]]
* [[RAG Evaluation]]
* [[Aligning Evaluation with Human Preferences]]
* Evaluate LLMs and RAG a practical example using Langchain and Hugging Face
* https://www.philschmid.de/evaluate-llm
* https://twitter.com/virattt/status/1765362096793931779 (2024/3/6)
* paper: Leveraging Large Language Models for NLG Evaluation: A Survey (2024/1)
* https://arxiv.org/abs/2401.07103
* OpenAI simple-eval
* https://github.com/openai/simple-evals
* Successful language model evals (2024/5/24)
* https://www.jasonwei.net/blog/evals
* MixEval
* https://mixeval.github.io/
* https://www.philschmid.de/evaluate-llm-mixeval
* Open LLM Leaderboard v2
* https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
* https://www.latent.space/p/benchmarks-201 2024/7/12
* The LLM Evaluation guidebook
* https://github.com/huggingface/evaluation-guidebook
* https://x.com/clefourrier/status/1844323838517252172 (2024/10/10)
* HuggingFace 出的 guidebook
* The Ultimate LLM Benchmark list
* https://x.com/scaling01/status/1919092778648408363 (2025/5/5)
* https://simple-bench.com/index.html
* https://aidanbench.com/
* Understanding the 4 Main Approaches to LLM Evaluation (From Scratch) 2025/10/5
* https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches
## 九原客
https://twitter.com/9hills/status/1781841760538493360 4/21
> 随着越来越多的模型性能接近GPT-4,几个主要的评测手段都已经无法进行有效区分。
1. MMLU,分数均80+分,已缺乏区分度。
2. MT-Bench,裁判员是GPT-4,能力不足以分辨模型之间的能力差别。
3. Arena Elo:主要是普通对话类任务,让Elo分数受到和人类对齐程度的极大影响,且问题难度不足以分辨这个级别的模型。
https://x.com/9hills/status/1801496139902161266 6/14 推薦了一些框架
## Paper
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
https://twitter.com/cwolferesearch/status/1782453549223321660
https://arxiv.org/abs/2306.05685
Leveraging Large Language Models for NLG Evaluation: A Survey
https://twitter.com/omarsar0/status/1748016227090305167
https://twitter.com/helloiamleonie/status/1785293595521511867 2024/4/30
https://twitter.com/op7418/status/1785653861757452458
多位 LLM Judges
---
有標準答案的
沒有標準答案的
另外,各種 Eval 工具,如果是針對 LLM 的,其實我們不太關心,例如 OpenAI 出的那個 eval 是針對 LLM 的,不是針對你的 prompt 的
* Evals 技術
- [x] 學 Evals https://github.com/openai/evals
- Q: 這跟 https://autoevaluator.langchain.com/ 是做一樣的事情嗎? A: 不一樣
- https://github.com/langchain-ai/auto-evaluator/tree/main
- https://github.com/rlancemartin/auto-evaluator
- https://huggingface.co/spaces/rlancemartin/auto-evaluator
- langchain plus 再用仔細看看
* 要提我 PE talk 中,提到的那些把 eval 當作 unit testing 的那些工具
### Can Language Models Resolve Real-World GitHub Issues?
https://www.swebench.com/