Embedding - ihower's Notes

* 我的繁體中文評測 https://ihower.tw/blog/archives/12167 * [[繁體中文 Embedding 和 Reranker 模型評測]] * 科普文章 https://towardsdatascience.com/text-embeddings-comprehensive-guide-afd97fce8fb5 * Step-by-Step Guide to Choosing the Best Embedding Model for Your Application (2024/6/4) * https://weaviate.io/blog/how-to-choose-an-embedding-model * SentenceTransformers embedding * [[Advanced Retrieval for AI with Chroma]] * https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md * Voyage https://blog.voyageai.com/2023/10/29/voyage-embeddings/ 2023/10/29 * https://docs.voyageai.com/embeddings/ * Cohere * embed v3 https://txt.cohere.com/introducing-embed-v3/ 2023/11/2 * 不是降維，而是轉成 int8 儲存 https://txt.cohere.com/int8-binary-embeddings/ 2024/3/8 * Compass https://txt.cohere.com/compass-beta/ * 感覺就是讓你可以多存 metadata 方便 filter ? 得用他的 Compass SDK * Jina * https://jina.ai/embeddings/ * jina-embeddings-v4 https://x.com/JinaAI_/status/1937880654127226929 (2025/6/25) * Vertex AI * https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings * 需要傳 task_type 參數 * https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-multimodal-embeddings 多模態 * 圖像嵌入向量和文本嵌入向量位於相同的語義空間中，具有相同的維度。因此，這些向量可以互換使用，例如通過文本搜索圖像，或通過圖像搜索視頻。 * Nomic Embed * https://blog.nomic.ai/posts/nomic-embed-text-v1 * https://blog.nomic.ai/posts/nomic-embed-matryoshka * 開源的 * blog 說比 text-embedding-3-small 好 * Nomic Embed v1.5 支持介於 64 到 768 的任何嵌入維度 * https://blog.nomic.ai/posts/nomic-embed-vision 支援 Vision 多模態 * https://x.com/nomic_ai/status/1798368463292973361 * v2 版 https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe * Embedding https://www.rungalileo.io/blog/mastering-rag-how-to-select-an-embedding-model * Google * Gecko https://twitter.com/leejnhk/status/1775864728742768867 (2024/4/4) * https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings#latest_models * - `text-embedding-preview-0409` * - `text-multilingual-embedding-preview-0409` * Best Embedding Model 🌟 — OpenAI / Cohere / Google / E5 / BGE (2024/4/8) * https://medium.com/@lars.chr.wiik/best-embedding-model-openai-cohere-google-e5-bge-931bfa1962dc * Scaling Test-Time Compute For Embedding Models * https://jina.ai/news/scaling-test-time-compute-for-embedding-models/ (2024/12/13) * 拆成用很多細節問題去算相似度 * Nvidia NV-Embed v2 * https://huggingface.co/nvidia/NV-Embed-v2 * 7.85B, 授權是 cc-by-nc * 本機跑不動 * Chuxin-Embedding https://huggingface.co/chuxin-llm/Chuxin-Embedding * model2vec https://github.com/MinishLab/model2vec * M2V_multilingual_output https://huggingface.co/minishlab/M2V_multilingual_output * Qwen 3 * https://qwenlm.github.io/zh/blog/qwen3-embedding/ * On the Size Bias of Text Embeddings and Its Impact in Search (2025/4/16) * https://jina.ai/news/on-the-size-bias-of-text-embeddings-and-its-impact-in-search/ * 較長文本的嵌入在與其他文本嵌入比較時，通常會顯示較高的相似度分數，無論實際內容的相似度如何。雖然真正相似的文本仍會比無關的文本有較高的相似度分數，但較長的文本會引入偏差——使其嵌入平均看起來更相似，僅僅是因為其長度。 * 它們在比較事物方面表現優異，但在衡量絕對相關性時不可靠 * 要對餘弦閾值保持懷疑。它們根本不起作用 * https://chatgpt.com/c/686a94bc-d168-8008-ad5d-d01726a7ee6d ## 微調 embedding 模型 * https://modal.com/blog/fine-tuning-embeddings (2024/4/26) * https://www.philschmid.de/fine-tune-embedding-model-for-rag (2024/6/4) * https://x.com/9hills/status/1850139724201304216 (2024/10/26) * https://github.com/ninehills/blog/issues/118 ## Long context Together 推出的 32k embedding model https://www.together.ai/blog/embeddings-endpoint-release https://twitter.com/llama_index/status/1748878917195485358 提到一種混合策略: chunk similarity 和整份 document similarity (透過這個 long context embedding model) 混合算相似性分數 > https://chat.openai.com/share/e/76da589c-2f40-4c9d-976f-bb33dc22782e Long-Context Retrieval Models with Monarch Mixer https://hazyresearch.stanford.edu/blog/2024-01-11-m2-bert-retrieval Still Need Chunking When Long-Context Models Can Do It All? (2024/12/5) https://jina.ai/news/still-need-chunking-when-long-context-models-can-do-it-all/ Long-Context Embedding Models are Blind Beyond 4K Tokens (2025/3/7) https://jina.ai/news/long-context-embedding-models-are-blind-beyond-4k-tokens/ 超過 4k 基本失明 ## 多模態 * Google vertex multimodal embeddings * https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-multimodal-embeddings * Jina CLIP (這是免費授權) * v1 * https://jina.ai/news/jina-clip-v1-a-truly-multimodal-embeddings-model-for-text-and-image/ * https://huggingface.co/jinaai/jina-clip-v1 * https://jina.ai/news/beyond-clip-how-jina-clip-advances-multimodal-search/ * v2 多語言 * https://x.com/JinaAI_/status/1859659764281782420 (2024/11/22) * https://jina.ai/news/jina-clip-v2-multilingual-multimodal-embeddings-for-text-and-images/ * Cohere Embed * https://cohere.com/blog/multimodal-embed-3 * https://cohere.com/blog/embed-4 * voyage-multimodal-3 * https://blog.voyageai.com/2024/11/12/voyage-multimodal-3 * https://colab.research.google.com/drive/12aFvstG8YFAWXyw-Bx5IXtaOqOzliGt9?_kx=t6RuXq9Bd9j9So4MpeUUWRyi4FMAXW29m3WdZDgNipI.VU3S4W * 這可以文字and圖片一起做 embedding，別家都是文字or圖片 * vdr-2b-multi-v1 * https://x.com/jerryjliu0/status/1878491439291842618 * https://huggingface.co/blog/vdr-2b-multilingual * https://huggingface.co/llamaindex/vdr-2b-multi-v1 * Nvidia * https://huggingface.co/nvidia/MM-Embed * nomic-embed-vision-v1.5 * https://www.nomic.ai/blog/posts/nomic-embed-vision * https://huggingface.co/nomic-ai/nomic-embed-vision-v1.5 ## Instructor Large Dense Embedding 針對特定領域，在 embedding 時給個指示去調整，獲得微調效果 * https://instructor-embedding.github.io/ * https://github.com/xlang-ai/instructor-embedding * 案例: https://towardsdatascience.com/the-untold-side-of-rag-addressing-its-challenges-in-domain-specific-searches-808956e3ecc8 ## 視覺化 * retriever https://github.com/gabrielchua/RAGxplorer * Simple UI for debugging correlations of text embeddings * https://github.com/jina-ai/correlations/ * https://x.com/JinaAI_/status/1927419735991529529 ## 評測 llamaindex 的作法 https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83 MTEB 排行榜 https://huggingface.co/spaces/mteb/leaderboard paper: https://arxiv.org/abs/2210.07316 code: https://github.com/embeddings-benchmark/mteb/tree/main https://twitter.com/helloiamleonie/status/1782061776961507679 Leonie 簡介 4/21 BEIR https://github.com/beir-cellar/beir * 相關資料集 * https://huggingface.co/blog/mteb * https://huggingface.co/C-MTEB 簡體中文 * 感覺 https://huggingface.co/datasets/MediaTek-Research/TCEval-v2 也可以用 * 過程 * https://towardsdatascience.com/how-to-find-the-best-multilingual-embedding-model-for-your-rag-40325c308ebb * https://srk.ai/blog/004-ai-llm-retrieval-eval-llamaindex