ColBERT - ihower's Notes

也跟 [[ColPali]] 相關 ## ModernBERT (2024/12) * https://huggingface.co/blog/modernbert * ModernBERT’s training data is primarily English and code, so performance may be lower for other languages. > 小評: 由於較於冷門，初步測試對於中文的表現似乎並不好 (2024/3) https://github.com/stanford-futuredata/colbert https://colbert.aiserv.cloud/ demo in browser 相較於 embedding model，這種是針對檢索專門的 model，模型輸入就是 query，輸出就是檢索結果 https://chat.openai.com/share/18a8d373-3304-4930-80e2-d246c228bea4 https://twitter.com/lateinteraction/status/1736804963760976092 why https://twitter.com/jobergum/status/1750592502418280469 https://twitter.com/LangChainAI/status/1774117175089144215 2024/3/31 出現在 RAG From Scratch 14集 - https://til.simonwillison.net/llms/colbert-ragatouille - ColBERT 可以做 Retriever，只是初始有點費時啊 - 也可以做 Reranker，這更實用 * https://github.com/bclavie/RAGatouille/blob/main/examples/04-reranking.ipynb > 為什麼不直接在整個索引上使用rerank()，如果它這麼好呢？嗯，你可以，但這樣並不是很有效率。ColBERT是一個非常快速的查詢器，但它需要建立索引才能進行操作。當你使用ColBERT重新排列文件時，它是無需索引的，這意味著它需要對所有文件和查詢進行編碼，並即時進行比較。這對於在CPU上處理少量文件或在GPU上處理幾百個文件來說是可以接受的，但隨著添加更多文件，處理時間將呈指數級增長！ * ColBERT-XM * https://twitter.com/antoinelouis_/status/1762886806792511655 * https://huggingface.co/antoinelouis/colbert-xm * 阿里巴巴 ModernBERT-base * https://x.com/tomaarsen/status/1882053727437406375 (2025/1/22) * GTE-ModernColBERT-v1 * https://x.com/antoine_chaffin/status/1917582078561997258 * https://huggingface.co/lightonai/GTE-ModernColBERT-v1 * What Should We Learn From ModernBERT? (2025/1/22) * https://jina.ai/news/what-should-we-learn-from-modernbert/ * https://x.com/JinaAI_/status/1882079229799776399 ## Jina ColBERT 唯一的多語言 ColBERT 類模型 https://jina.ai/news/jina-colbert-v2-multilingual-late-interaction-retriever-for-embedding-and-reranking/ https://x.com/nanwang_t/status/1829530983097659451 CC BY-NC 4.0 Deep Research(2025/2/25): https://chatgpt.com/c/67af7594-b1a0-8008-85a7-96d0e5391d23 ## Reranker 和 Cross-Encoder https://twitter.com/hwchase17/status/1745171912358060124 https://twitter.com/virattt/status/1749166976033861832 https://twitter.com/lateinteraction/status/1749254639948829175 ColBERT is basically the only “reranker” that can directly search a billion passages. ## 理論 https://cameronrwolfe.substack.com/p/the-basics-of-ai-powered-vector-search https://twitter.com/cwolferesearch/status/1747689404062126246 ## RAGatouille https://github.com/bclavie/RAGatouille 比較高階的 wrapper，但是 chunking 用了 llama_index_sentence_splitter 用法 tweet: https://twitter.com/lateinteraction/status/1745156404883960034 langchain: https://twitter.com/hwchase17/status/1743029845297401882 llamaindex: https://twitter.com/llama_index/status/1743076579302105338 https://twitter.com/jerryjliu0/status/1743077679258320925 範例 https://github.com/aigeek0x0/rag-with-langchain-colbert-and-ragatouille ## mxbai-colbert-large-v1 https://www.mixedbread.ai/blog/mxbai-colbert-large-v1 2024/3/19 ## Vespa Long-Context ColBERT https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/ 2024/3/1 ## Jina-ColBERT https://mp.weixin.qq.com/s?__biz=MzkyODIxMjczMA==&mid=2247500814&idx=1&sn=6bd26ba155ce766548fa3af0731706a9&chksm=c21eb79bf5693e8d3e3a3c913dfe78e6f281521bc67e2a967216fa12d747afab49287c33a06c#rd 核心改进是采用了 jina-bert-v2-base-en 作为基础模型... https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/ 2024/2/20 ## nanoColBERT https://twitter.com/lateinteraction/status/1749634715886440463 2024/1/23 https://github.com/Hannibal046/nanoColBERT ## Weaviate 支援 * https://weaviate.io/developers/weaviate/tutorials/multi-vector-embeddings * https://x.com/weaviate_io/status/1902389679347134509 (2025/3/20)