Synthetic Data 合成資料

* https://github.com/argilla-io/distilabel * https://github.com/bespokelabsai/curator * https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data ## 用於 RAG 的 QA * 相關條目 [[RAG Benchmark]] * 在 Evaluating Chunking Strategies for Retrieval https://research.trychroma.com/evaluating-chunking * 這篇有合成問題跟參考資料，但沒有合成答案 * 在 [[RAG Benchmark]] 的 MultiHop-RAG https://github.com/yixuantt/MultiHop-RAG * 是用合成問題，但似乎沒有詳細的 prompt 跟過程? github 直接給 dataset * 答案是用短語，完成比對的方式計算準確率 * Bulk Generation of Synthetic Data * 用 Instructor 產生 * https://python.useinstructor.com/examples/batch_job_oai/ * [ ] ragas 有內附一個合成 dataset，這個可以細看好像不錯! * TestsetGenerator * [ ] 找找其他 RAG 評估框架，有沒有內附的問題合成方式? * RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework (2024/8) * https://x.com/omarsar0/status/1820507831491239978 * https://arxiv.org/abs/2408.01262v1 * 在 [[Systematically Improving RAG Applications]] 課程中第一講有範例 * DSPy Synthetic data * https://twitter.com/ndzfs/status/1764249868347072845 (2024/3/3) * https://medium.com/thoughts-on-machine-learning/pure-dspy-based-synthetic-prompt-optimization-e11520c61382 (2024/3/2) * Giskard 有個 Testset Generation * 針對任何文件生成多樣化的問題，而不僅僅是簡單的問題。這些問題的範圍從簡單到困難，從精確到高層次到對話式 * https://x.com/llama_index/status/1809625112477593654 2024/7/7 * https://x.com/llama_index/status/1831370329777959273 2024/9/5 * https://docs.giskard.ai/en/stable/open_source/testset_generation/testset_generation/index.html * paper: Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems (2024/11) * 不同類型的問題，檢索效果差異很大，因此當你在評估 RAG 系統時，如果資料集組成不均衡，可能會得出完全錯誤的結論 * https://arxiv.org/abs/2411.19710v1 ## 用於 [[Fine-Tune]] * from https://blog.langchain.dev/the-prompt-landscape/ * QA pair https://smith.langchain.com/hub/homanp/question-answer-pair?ref=blog.langchain.dev * Generation of Q/A Pair Training Data with AI Personality Injection https://smith.langchain.com/hub/gitmaxd/synthetic-training-data?ref=blog.langchain.dev * How to Generate and Use Synthetic Data for Finetuning * https://eugeneyan.com/writing/synthetic/ ## 用於 LLM * paper: Best Practices and Lessons Learned on Synthetic Data for Language Models * https://arxiv.org/abs/2404.07503 * https://twitter.com/omarsar0/status/1778804848038683066 2024/4/12 * 感覺主要是用在模型訓練 ## 其他 * Can LLMs Design Good Questions Based on Context? (2025/1) * https://arxiv.org/abs/2501.03491 * https://x.com/omarsar0/status/1877008618207560049 (2025/1/8) * 在不特別限制題型的情況下，觀察 LLM 都會產生哪些類型的問題