Dense retrieval using vectorized propositions * https://twitter.com/jerryjliu0/status/1735103631391948909 2023/12/14 * https://twitter.com/LangChainAI/status/1735708004618764470 2023/12/16 * https://templates.langchain.com/new?integration_name=propositional-retrieval * https://twitter.com/clusteredbytes/status/1750600846985892008 2024/1/26 > 感覺實作上會跟 [[Multi-Vector Retriever]] 很像,就是多存一個版本。主要麻煩就在於這個預處理很耗費成本啊。果然 langchain 實作也是用 multi-vector langchain template 的 prompt: https://github.com/langchain-ai/langchain/blob/master/templates/propositional-retrieval/propositional_retrieval/proposal_chain.py ``` Decompose the "Content" into clear and simple propositions, ensuring they are interpretable out of context. 1. Split compound sentence into simple sentences. Maintain the original phrasing from the input whenever possible. 2. For any named entity that is accompanied by additional descriptive information, separate this information into its own distinct proposition. 3. Decontextualize the proposition by adding necessary modifier to nouns or entire sentences and replacing pronouns (e.g., "it", "he", "she", "they", "this", "that") with the full name of the entities they refer to. 4. Present the results as a list of strings, formatted in JSON. Example: ``` ## Paper: Dense x Retrieval: What Retrieval Granularity Should We Use? https://arxiv.org/abs/2312.06648 作者解釋: https://twitter.com/tomchen0/status/1736232084514209994 1. 新的檢索單位,即「命題」 2. 每個命題應該對應到文本中的一個獨特意義片段,所有命題的組合將代表整個文本的語義。 3. 一個命題應該是最小的,即它不能再進一步分成獨立的命題。 4. 一個命題應該被置於情境中並且是自成一體的 Choi等人 (2021)。一個命題應該包含來自文本的所有必要情境(例如,共指)以解釋其意義。 5. 為了將維基百科頁面分割成命題,我們微調了一個文本生成模型,我們稱之為「命題生成器」。命題生成器將一段文字作為輸入,並生成該段落中的命題列表。根據陳等人(2023b)的方法,我們使用了一個兩步驟的提煉過程來訓練命題生成器。首先,我們使用包含命題定義和一次性示範的指示來提示GPT-4 OpenAI(2023)。我們在圖8中提供了提示的詳細信息。