* 原先的 raw chunk 可能不是最佳的檢索 embedding 內容,我們可以針對同一份 document 存好幾種變形,例如: * Smaller chunks (也就是 [[Parent Document Retriever]]) * 摘要 Summary of document (document summary index) * 假設性的問題 Hypothetical Questions * 特別是基於問見的問答系統,用"問題"找"最相似的 問題",比"問題"找"最相似的答案"更能提高準確性 * 做法: 針對每份文件或段落,請 AI 生成一個可以回答的問題清單 * Manually specified text snippets https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector?ref=blog.langchain.dev * Multi-Vector Retriever for RAG on tables, text, and images * https://blog.langchain.dev/semi-structured-multi-modal-rag/ * Hierarchical Indices * https://medium.com/@nirdiamant21/hierarchical-indices-enhancing-rag-systems-43c06330c085 * https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/hierarchical_indices.ipynb * https://github.com/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques_runnable_scripts/hierarchical_indices.py ## Llamaindex 的 Document Summary Index * 摘要索引 Document Summary Index * https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary.html * https://medium.com/llamaindex-blog/a-new-document-summary-index-for-llm-powered-qa-systems-9a32ece2f9ec * 每份文件做個摘要,針對摘要做 embedding * retriever 階段: 針對摘要做搜尋,返回原來的整份文件給 LLM