Long Context Window - ihower's Notes

https://www.facebook.com/ihower/posts/10160974050143971 https://twitter.com/ihower/status/1763746152846163972 * Please Stop Saying Long Context Windows Will Replace RAG (2024/3/18) * https://cobusgreyling.medium.com/please-stop-saying-long-context-windows-will-replace-rag-3cd111cfb247 https://blog.orangesai.com/p/technological-ripple-effect-rag-and-long-context-cognitive-conflict https://twitter.com/RLanceMartin/status/1770559065955205302 (2024/3/21) Need to balance system complexity vs latency & token usage * 大模型超长上下文对 RAG 是降维打击 (2024/2/29) * https://twitter.com/iheycc/status/1763194137531298300 * https://heycc.notion.site/RAG-e0c30da6c2904c3599b582b978c31de1 * 技术的涟漪效应：RAG与Long Context的认知冲突 (2024/4/7) * https://quail.ink/orange/p/technological-ripple-effect-rag-and-long-context-cognitive-conflict * https://twitter.com/bindureddy/status/1788684124863303709 2024/5/10 * Infinite context in LLMs is close to largely useless. * LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs (2024/6/30) * https://arxiv.org/abs/2406.15319 * https://x.com/llama_index/status/1818802688274100578 * https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-longrag/examples/longrag.ipynb * 把 chunk 提高到 4k，top-k 減少，也蠻有用的 * 但是 embedding model 無法吃這麼長，還是要拆小 chunk，但是取最相似分數的當作那一整個大 chunk 的分數 * Long Context RAG Performance of LLMs (2024/8/12) * https://www.databricks.com/blog/long-context-rag-performance-llms * 對於大多數模型來說，存在一個飽和點，超過該點後性能會下降 * Why RAG 的論述 (2024/10/30) * https://unstructured.io/blog/rag-vs-long-context-models-do-we-still-need-rag ## Papers * Benchmarking Large Language Models in Retrieval-Augmented Generation (2023/12) * 探針測試只是基本，LLM 模型本身在 RAG 應用上還應該有以下能力 * 能抗噪、能整合資訊、能回答不知道、能辨識不對的事實 * https://arxiv.org/abs/2309.01431v2 * Long-context LLMs Struggle with Long In-context Learning https://arxiv.org/abs/2404.02060 (2024/4/2) * LongICLBench 評測 * https://twitter.com/omarsar0/status/1775638933377786076 2024/4/4 * 除了 GPT-4，其他家超過 20k 時效能都急遽下降 * ihower: 細看 paper 包括 Gemini-1.0-Pro, Gemma-7B, Llama-2-7B, Mistral-7B 以及一眾開源，感覺對手太弱 * 模型也傾向於預測序列末尾出現的標籤 * RULER: What's the Real Context Size of Your Long-Context Language Models? (2024/4/9) * https://arxiv.org/abs/2404.06654 * Nvidia 的新合成基準 RULER，包括四個類別的任務：檢索、多跳追踪、聚合和問答，所有任務都可配置為不同長度和複雜度 * https://twitter.com/GregKamradt/status/1778427541461852439 * https://github.com/hsiehjackson/RULER 有最新分數和結論 * Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach (2024/7) * https://arxiv.org/abs/2407.16833 * https://x.com/omarsar0/status/1816495687984709940 * https://x.com/bindureddy/status/1816912246146318806 * Long-Context 表現較好，但 RAG 成本顯著較低 * 混合方案 Self-Route 方法的思路不錯，可以參考學習: * 一律先用檢索找出最相關的 chunks * 用例如這個 prompt 判斷 chunks 是否可以回答用戶問題: You are given some text chunks retrieved based on a query. Based only on these chunks: 1. Determine if the query can be answered using the provided information. 2. If answerable, provide a concise answer, preferably in a single phrase or sentence. 3. If not answerable based solely on the given chunks, write "unanswerable". Retrieved text chunks: {retrieved_chunks} Query: {user_query} Answer: * 如果是 unanswerable，則改用 Long-Context 放全文。如果可以回答則只用 chunks * 使用 Gemini-1.5-Pro 时，約 82% 可以用這招就解決，只有 18% 需要改用 Long-Context。如此大大減少了需要處理的 tokens 數量，降低成本。 * Long Context RAG Performance of LLMs * https://www.databricks.com/blog/long-context-rag-performance-llms * https://ihey.cc/rag/long-context-rag-performance-llms/ * 對於大多數模型來說，存在一個飽和點，超過該點後性能會下降，例如：gpt-4-turbo 和 claude-3-sonnet 為 16k，mixtral-instruct 為 4k，dbrx-instruct 為 8k * 最近的模型，如 gpt-4o、claude-3.5-sonnet 和 gpt-4o-mini，已改善長上下文行為，隨著上下文長度增加，表現幾乎沒有下降。 * In Defense of RAG in the Era of Long-Context Language Models (2024/9) * https://arxiv.org/abs/2409.01666 * https://ihey.cc/rag/in-defense-of-rag-in-the-era-of-long-context-language-models/ * 用 Llama3.1-8B & Llama3.1-70B * Inference Scaling for Long-Context Retrieval Augmented Generation (2024/10) * 用 Gemini 1.5 Flash * 整理在 [[RAG 開發知識庫]] * Long Context vs. RAG for LLMs: An Evaluation and Revisits (2024/12) * https://arxiv.org/abs/2501.01880 * 整理有關長文本的 papers * 推薦 [[RAPTOR]] * BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack (2024/11) * https://arxiv.org/abs/2406.10149 * https://x.com/cwolferesearch/status/1869409312239415696 (2024/12/18) * LLMs 只能有效利用了 10-20% 的上下文，且隨著推理複雜性的增加，其表現急劇下降 * 大多數 LLMS 在回答超過 10,000 個詞元的文本中的事實問題時都很困難 * Fiction.LiveBench https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87 ## Needles 測試 Arize 的 The Needle In a Haystack Test https://arize.com/blog-course/the-needle-in-a-haystack-test-evaluating-the-performance-of-llm-rag-systems/ Two needles 測試 https://twitter.com/mosh_levy/status/1762027624434401314 2024/2/26 paper: Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models (2024/2/19) https://arxiv.org/abs/2402.14848 https://twitter.com/LangChainAI/status/1769072214878650377 Multi Needle in a Haystack https://blog.langchain.dev/multi-needle-in-a-haystack/ (2024/3/13) https://twitter.com/RLanceMartin/status/1769439202008060098 2024/3/18 paper: https://arxiv.org/abs/2310.01427 (2023/9/28) https://twitter.com/aparnadhinak/status/1757073620612923785 (2024/2/23) https://twitter.com/aparnadhinak/status/1766161976529711298 (2024/3/9) 另一個測試，但沒看懂重點數星星測試 https://twitter.com/9hills/status/1775353958472794391 https://twitter.com/LouisKnightWebb/status/1778105204941988244 (2024/4/11) ## 其他評測 * LOFT: Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? (2024/6) * https://arxiv.org/abs/2406.13121v1 * Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries (2024/9) * https://arxiv.org/abs/2409.12640 * https://x.com/oran_ge/status/1837691213597888774 * 這是 Google 做的評測，分數是 Gemini 排第一 * https://x.com/_philschmid/status/1845388446354792813 (2024/10/13) * ![圖片](https://pbs.twimg.com/media/GZwi9psXsA0SOFo?format=jpg&name=medium)