可以做 Evaluation 的 LLMOps ## LangWatch https://langwatch.ai/ server-side 有開源 有強調支援 DSPy ## Langfuse https://langfuse.com/ server-side 有開源 ## Opik https://www.comet.com/site/products/opik/ * 有開源 - UI 簡單漂亮 - 功能都有但都比較陽春,但看起來都有 API 可以用 - dataset 缺少編輯功能???? ## LangSmith https://docs.smith.langchain.com/old/cookbook/testing-examples/rag_eval ## TruLens https://www.trulens.org/ ## Ragas https://github.com/explodinggradients/ragas notebook: https://colab.research.google.com/github/explodinggradients/ragas/blob/main/docs/quickstart.ipynb blog https://blog.langchain.dev/evaluating-rag-pipelines-with-ragas-langsmith/ https://cobusgreyling.medium.com/combining-ragas-rag-assessment-tool-with-langsmith-e46078001f95 Florian 的介紹文 https://ai.plainenglish.io/advanced-rag-03-using-ragas-llamaindex-for-rag-evaluation-84756b82dca7 ## Deepeval https://github.com/confident-ai/deepeval ## continuous-eval https://github.com/relari-ai/ https://www.relari.ai/ ## braintrust jason liu 推薦 https://www.braintrust.dev/ ## UpTrain https://www.llamaindex.ai/blog/supercharge-your-llamaindex-rag-pipeline-with-uptrain-evaluations (2024/3/19) https://uptrain.ai/ 內建了很多指標,還有做介面 YC 投資的公司 ## Parea.ai https://www.parea.ai/ ## Azure 的 Evaluation 功能 * https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-approach-gen-ai * https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in * 各種指標,也有 RAG 用的 * 有 prompt 可以參考 ## Arize Phoenix - https://github.com/Arize-ai/phoenix - https://app.phoenix.arize.com/ ## Comet ## Weights & Biases ## 其他 - https://github.com/lmnr-ai/lmnr - https://github.com/helicone/helicone - https://github.com/Scale3-Labs/langtrace - https://www.confident-ai.com/ ## 傳統 Evaluation 工具 * https://github.com/cvangysel/pytrec_eval * https://ir-measur.es/en/latest/ * https://x.com/jobergum/status/1794996654958854219 * https://pyterrier.readthedocs.io/en/latest/installation.html