對複雜的文件和影像進行 QA * https://medium.aiplanet.com/multimodal-rag-using-langchain-expression-language-and-gpt4-vision-8a94c8b02d21 (2023/12) * 三種作法: * 圖像embedding(OpenCLIP) * 圖像摘要成文字,生成時用摘要文本 * 圖像摘要成文字,生成時用原圖像 * 用 Unstructured 從 PDF 中提取文字和圖像 * https://levelup.gitconnected.com/multimodal-rag-pipeline-three-ways-to-build-it-169782cffaca (2024/11) * paper: Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications * https://arxiv.org/abs/2410.21943 * https://x.com/omarsar0/status/1851479149690642456 (2024/10/30) * https://huggingface.co/blog/Kseniase/html-multimodal-agentic-rag ## langchain https://twitter.com/RLanceMartin/status/1713638963255366091 https://github.com/langchain-ai/langchain/blob/master/cookbook/Multi_modal_RAG.ipynb https://www.analyticsvidhya.com/blog/2024/09/guide-to-building-multimodal-rag-systems/ (2024/9/30) ## 投影片實驗 * 2023/12/8 * https://twitter.com/rlancemartin/status/1732862263026135197 * https://blog.langchain.dev/multi-modal-rag-template/ * 圖片 OpenCLIP embedding * GPT-4V 圖片摘要 ([[Multi-Vector Retriever]]),推薦這種效果最好,只是需要先生成摘要有成本 ## llamaindex - [ ] https://docs.llamaindex.ai/en/latest/module_guides/models/multi_modal.html# 最下方有一些資料整理 ![圖片](https://pbs.twimg.com/media/F-mW2HnbQAA2LPz?format=jpg&name=4096x4096) https://twitter.com/jerryjliu0/status/1723076174698672417 * blog https://blog.llamaindex.ai/multi-modal-rag-621de7525fea * 教學 tweet https://twitter.com/clusteredbytes/status/1727741878970314964 (2023/11/24) * ![圖片](https://pbs.twimg.com/media/F_osFE7WQAAvlof?format=jpg&name=4096x4096) * 一個 MultimodalQueryEngine 範例 https://x.com/llama_index/status/1835813824005583313 (2024/9/17) * https://levelup.gitconnected.com/exploring-multimodal-rag-with-llamaindex-and-gpt-4-or-the-new-anthropic-sonnet-model-96705c877dbb * llamaparse 就處理了圖片轉成 md 文字 * 生成階段是 文字+圖片 一起給模型 * 做成 LlamaCloud 服務了 https://www.llamaindex.ai/blog/multimodal-rag-in-llamacloud ## ColPali see [[Colpali]] 條目