Advanced Retrieval for AI with Chroma

https://www.deeplearning.ai/short-courses/advanced-retrieval-for-ai/ 重點是 * Query Expansion * 產生 generated answers * 產生 multiple queries * Re-ranker * Embedding adaptors ## Overview of embeddings -based retrieval ![[Pasted image 20240114151232.png]] * 用 RecursiveCharacterTextSplitter 切 chunks * 用 SentenceTransformersTokenTextSplitter 因為我們等會要用 SentenceTransformers embedding，這只支援 256 characters? tokens? ![[Pasted image 20240114162458.png]] ## Pitfalls of retrieval -when simple vector search fails * 使用了 umap 降低 embeddings 維度到二維，來做圖示意 ![[Pasted image 20240114171201.png]] ![[Pasted image 20240114171217.png]] * 用戶 Query 是紅色 X: * 綠色圓圈是實際檢索到的 chunks ![[Pasted image 20240114172811.png]] ![[Pasted image 20240114172949.png]] 有更多 distractors 了，更分散，低度相關，會讓模型分散注意力 ![[Pasted image 20240114173021.png]] 這問題跟文件無關，但根據設計還是會撈出資料來，即使相關性低，全部都是 distractors ![[Pasted image 20240114173032.png]] ## Query Expansion ### Expansion with generated answers ![[Pasted image 20240114190523.png]] * https://arxiv.org/abs/2305.03653 * You are a helpful expert financial research assistant. Provide an example answer to the given question, that might be found in a document like an annual report. * 基本上就是讓模型做幻覺.... XD * 用 joint_query = f"{original_query} {hypothetical_answer}" 來查 * 這怪怪的，跟 [[HyDE Retriever]] 原版不同 query 是紅色 query + 假設答案是橘色 ![[Pasted image 20240114193856.png]] ### Expansion with multiple queries ![[Pasted image 20240114204913.png]] 產生五個額外相關問題: You are a helpful expert financial research assistant. Your users are asking questions about an annual report. Suggest up to five additional related questions to help them find the information they need, for the provided question. Suggest only short questions without compound sentences. Suggest a variety of questions that cover different aspects of the topic. Make sure they are complete questions, and that they are related to the original question. Output one question per line. Do not number the questions. Chroma 的 query_texts 可以一次傳多個 query texts 進去，得到多個 sub lists 因此接下來還需要 re-ranking 來排序 ![[Pasted image 20240114205058.png]] > 整理至 [[Query Optimization 最佳化策略]] ## Cross-encoder re-ranking 搜更多結果出來，例如10個，然後再用 re-ranking 排序取前5個 ![[Pasted image 20240114222322.png]] ![[Pasted image 20240114223726.png]] ![[Pasted image 20240114223804.png]] ![[Pasted image 20240114223956.png]] ![[Pasted image 20240114223726.png]] 接續上一章 ![[Pasted image 20240114224241.png]] ![[Pasted image 20240114224255.png]] ![[Pasted image 20240114224318.png]] ## Embedding adaptors 使用 feedback 來自動改進 embedding ![[Pasted image 20240114230358.png]] 產生 dataset You are a helpful expert financial research assistant. You help users analyze financial statements to better understand companies. Suggest 10 to 15 short questions that are important to ask when analyzing an annual report. Do not output any compound questions (questions with multiple sentences or conjunctions). Output each question on a separate line divided by a newline. 用 LLM 評估 (若有真實用戶回饋更好) You are a helpful expert financial research assistant. You help users analyze financial statements to better understand companies. For the given query, evaluate whether the following satement is relevant. Output only 'yes' or 'no'. yes 會轉成 1, no 會轉成 -1 用 query embeddings, retrieved_embeddings, 評估分數(1 or -1 ) 來做 model 訓練 ![[Pasted image 20240114231107.png]] ![[Pasted image 20240114231302.png]] 得到 best_matrix 後，之後的 query embedding 經過 np.matmul 調整後來用新的綠色 embeddings 相對集中 ## Other Techniques ![[Pasted image 20240114232130.png]] * 直接微調 embedding model * 針對 Retrieval 微調 LLM * https://arxiv.org/abs/2310.01352 * https://arxiv.org/abs/2310.07713 * 更複雜的 Embedding Adaptors，有完整的神經網路的 * 更複雜的相關性模型 (re-ranker) * 更複雜的 chunking，利用模型來智能分塊 ## 備註 helper_utils.py ``` import chromadb from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter import numpy as np from pypdf import PdfReader from tqdm import tqdm def _read_pdf(filename): reader = PdfReader(filename) pdf_texts = [p.extract_text().strip() for p in reader.pages] # Filter the empty strings pdf_texts = [text for text in pdf_texts if text] return pdf_texts def _chunk_texts(texts): character_splitter = RecursiveCharacterTextSplitter( separators=["\n\n", "\n", ". ", " ", ""], chunk_size=1000, chunk_overlap=0 ) character_split_texts = character_splitter.split_text('\n\n'.join(texts)) token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256) token_split_texts = [] for text in character_split_texts: token_split_texts += token_splitter.split_text(text) return token_split_texts def load_chroma(filename, collection_name, embedding_function): texts = _read_pdf(filename) chunks = _chunk_texts(texts) chroma_cliet = chromadb.Client() chroma_collection = chroma_cliet.create_collection(name=collection_name, embedding_function=embedding_function) ids = [str(i) for i in range(len(chunks))] chroma_collection.add(ids=ids, documents=chunks) return chroma_collection def word_wrap(string, n_chars=72): # Wrap a string at the next space after n_chars if len(string) < n_chars: return string else: return string[:n_chars].rsplit(' ', 1)[0] + '\n' + word_wrap(string[len(string[:n_chars].rsplit(' ', 1)[0])+1:], n_chars) def project_embeddings(embeddings, umap_transform): umap_embeddings = np.empty((len(embeddings),2)) for i, embedding in enumerate(tqdm(embeddings)): umap_embeddings[i] = umap_transform.transform([embedding]) return umap_embeddings ```