* https://github.com/argilla-io/distilabel
* https://github.com/bespokelabsai/curator
* https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data
## 用於 RAG 的 QA
* 相關條目 [[RAG Benchmark]]
* 在 Evaluating Chunking Strategies for Retrieval https://research.trychroma.com/evaluating-chunking
* 這篇有合成問題 跟 參考資料,但沒有合成 答案
* 在 [[RAG Benchmark]] 的 MultiHop-RAG https://github.com/yixuantt/MultiHop-RAG
* 是用合成問題,但似乎沒有詳細的 prompt 跟過程? github 直接給 dataset
* 答案是用短語,完成比對的方式計算準確率
* Bulk Generation of Synthetic Data
* 用 Instructor 產生
* https://python.useinstructor.com/examples/batch_job_oai/
* [ ] ragas 有內附一個合成 dataset,這個可以細看好像不錯!
* TestsetGenerator
* [ ] 找找其他 RAG 評估框架,有沒有內附的問題合成方式?
* RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework (2024/8)
* https://x.com/omarsar0/status/1820507831491239978
* https://arxiv.org/abs/2408.01262v1
* 在 [[Systematically Improving RAG Applications]] 課程中第一講 有範例
* DSPy Synthetic data
* https://twitter.com/ndzfs/status/1764249868347072845 (2024/3/3)
* https://medium.com/thoughts-on-machine-learning/pure-dspy-based-synthetic-prompt-optimization-e11520c61382 (2024/3/2)
* Giskard 有個 Testset Generation
* 針對任何文件生成多樣化的問題,而不僅僅是簡單的問題。這些問題的範圍從簡單到困難,從精確到高層次到對話式
* https://x.com/llama_index/status/1809625112477593654 2024/7/7
* https://x.com/llama_index/status/1831370329777959273 2024/9/5
* https://docs.giskard.ai/en/stable/open_source/testset_generation/testset_generation/index.html
* paper: Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems (2024/11)
* 不同類型的問題,檢索效果差異很大,因此當你在評估 RAG 系統時,如果資料集組成不均衡,可能會得出完全錯誤的結論
* https://arxiv.org/abs/2411.19710v1
## 用於 [[Fine-Tune]]
* from https://blog.langchain.dev/the-prompt-landscape/
* QA pair https://smith.langchain.com/hub/homanp/question-answer-pair?ref=blog.langchain.dev
* Generation of Q/A Pair Training Data with AI Personality Injection https://smith.langchain.com/hub/gitmaxd/synthetic-training-data?ref=blog.langchain.dev
* How to Generate and Use Synthetic Data for Finetuning
* https://eugeneyan.com/writing/synthetic/
## 用於 LLM
* paper: Best Practices and Lessons Learned on Synthetic Data for Language Models
* https://arxiv.org/abs/2404.07503
* https://twitter.com/omarsar0/status/1778804848038683066 2024/4/12
* 感覺主要是用在模型訓練
## 其他
* Can LLMs Design Good Questions Based on Context? (2025/1)
* https://arxiv.org/abs/2501.03491
* https://x.com/omarsar0/status/1877008618207560049 (2025/1/8)
* 在不特別限制題型的情況下,觀察 LLM 都會產生哪些類型的問題