請注意,本網頁為程式自動產生,可能會有錯誤,請觀賞原影片做查核。網頁產生方式為影片每5秒截圖、去除重複的影像,使用 whisper 模型做語音辨識字幕、使用 gpt-4o-mini 做中文翻譯,以及 Claude 做摘要。
Dan Becker: Awesome. You want to get started. Dan Becker: 太棒了。你想開始了嗎? | |
Jason: great. So yeah, I think today's little topic is systematically improving rag applications. Jason: And in particular, the interesting thing here is Jason: this is really a system, right? So this is a system that you should be able to apply regardless of the applications we're trying to build, because the idea is that we're going to, you know. Use this to do a data analysis, come up with hypotheses, test them, and eventually sort of get an understanding of where we need to improve the systems that we are building Jason:很好。所以我認為今天的小主題是系統性地改善 rag 應用。Jason:特別有趣的是,Jason:這真的是一個系統,對吧?所以這是一個你應該能夠應用的系統,無論我們試圖建立什麼應用,因為這個想法是我們將會,您知道的。使用這個來進行數據分析,提出假設,測試它們,最終大致了解我們需要改善我們所建立的系統的地方。 | |
Jason: before we get started. I just want to Jason: roughly introduce myself. I imagine most people here might be a little bit familiar with me from Twitter, but my actual background is that I am currently doing a lot of AI consulting for companies that do work in the llam space and the ragspace. And so, you know, you might notice some of these companies here, for example, trunk tools Jason: narrow Jason: like limitless. We. They're all rag tools. And then on top of that, we also have a bunch of AI hardware companies that again really care about improving things like memory. Jason:在我們開始之前,我想大致介紹一下自己。我想這裡的大多數人可能對我在 Twitter 上有點熟悉,但我的實際背景是我目前為從事 LLM 領域和 RAG 領域的公司提供很多 AI 諮詢服務。所以,你知道,你可能會注意到這裡的一些公司,例如 trunk tools、narrow 和 limitless。我們。他們都是 RAG 工具。除此之外,我們還有一些 AI 硬體公司,這些公司同樣非常關心改善記憶體等方面的問題。 | |
Jason: But the issue is once you deploy this application when things are failing, when you are losing users, because there's maybe, like, you know, not enough confidence in the responses. Maybe we are unable to answer certain questions. The idea that we have to go figure out what kind of questions are actually causing the issues, and we can't really blanket and say, I wanna improve rag. Instead, we need to sit, be able to say things like, Hey, questions that are retrieving documents by file name Jason: are 40% of the questions that are being asked. But Jason: you know, there's a 30% user satisfaction. Jason: And now we can actually go in and identify exactly what segments of the population, or have some questions that we cannot serve well, and then come up with ideas in the future to make some of these improvements. So I'll talk about that process in a bit. But Jason:但問題是,一旦你部署了這個應用程式,當事情失敗時,當你失去用戶時,可能是因為對回應的信心不足。也許我們無法回答某些問題。我們需要去弄清楚哪些問題實際上造成了這些問題,而不是簡單地說,我想改善 rag。相反,我們需要坐下來,能夠說出像是,「通過檔案名稱檢索文件的問題」佔了被詢問問題的 40%。但是,Jason:你知道,用戶滿意度只有 30%。Jason:現在我們可以進一步確定哪些人群的某些問題我們無法很好地服務,然後在未來提出改進的想法。所以我稍後會談談這個過程。 | |
Jason: if you you know, maybe you had to leave early, or anything like that all the stuff that I talk about today I've actually written down before. And so if if you're ever interested or have some time to spare. You can always take a look at some of my existing writing on the topic. Jason: And cool. This is, this is basically the table of contents for today's conversation. The idea is, I wanna cover the rag playbook that I run for many companies, whether it's small or big. The the playbook really is the same, you know. I'll cover. You know how we should think about the feedback mechanisms, understanding that, you know feedback isn't something that every single query will have. And so we need to also measure things like the cosign and the reranker scores Jason:如果你知道,也許你必須提前離開,或者類似的事情,今天我談論的所有內容其實我之前都寫下來了。所以如果你有興趣或有時間的話,可以隨時看看我在這個主題上已有的寫作。Jason:很酷。這基本上是今天對話的目錄。我的想法是,我想涵蓋我為許多公司運行的 rag playbook,無論是小公司還是大公司。這個 playbook 實際上是相同的,你知道的。我會涵蓋。我們應該如何思考反饋機制,理解反饋並不是每一個查詢都會有的東西。因此,我們還需要衡量像 cosign 和 reranker 分數這樣的東西。 | |
Jason: how we can cluster these things and then use that information to really figure out what is the issue. Jason: And then, near the end, I'll talk about what kind of low-hanging food that we have in terms of systems of every rag app should, may as well include in order to get the most performance out of it. Jason: But the 1st part really is the fact that, you know, we need to have a feedback mechanism. We need to actually have an objective that we care about Jason: when a company comes to me. And they say, Hey, Jason, like, we wanna improve our rag application. We actually don't know what that means. Are you losing users because we don't have faith in the system? Are we unable to give good citations? Is it the fact that the things that we're searching for aren't available in the vector databases? Or is the fact that you know we we're using things like file names, but because we don't use full text and only use vectors, we can't actually can't actually map against that. Jason: The 1st thing we have to ask is, what is the actual objective. Jason: and the easiest way to capture that feedback is. Jason: you know, having a thumbs up and thumbs down score. But even there the copy is actually very important. I was working with a client recently where the original copy was, how did we do Jason: thumbs up and thumbs down? Jason: And what this actually resulted in was the fact that we did not actually know what we were trying to capture. Jason: It turns out if the answer was correct, but the answer was too long. People would thumbs down Jason: if the answer was correct, and it was concise, but it took 3 or 4 attempts to get the right answer. People would thumbs down. Jason: But that information isn't actually helpful in understanding what kind of business outcome we're trying to drive. Jason: What they really cared about is whether the answer was correct. And just by changing that copy just by saying, Did we answer your question today? Versus, how did we do? We were actually able to 5 x the volume of ratings and improve the ratings for the data that we had, and what this meant is, you know, a couple of weeks from now we'll be able to look at the things that people thumbs up and thumbs down and have a heuristic on understanding. What exactly are people happy with, and what are we doing? Poorly wrong? Jason: But you know, even if only Jason: like 5% of the population actually uses these mechanisms. We still want relevancy scores across all your search queries. And so not only do you want to have a feedback mechanism that measures something like satisfaction. You also want a relevancy measurement that is easy to measure right like you could use things like having an Llm. Jason:我們如何將這些東西進行聚類,然後利用這些資訊來真正找出問題所在。Jason:然後,在最後,我會談談我們在每個 rag 應用系統中應該包含的低懸果,以便獲得最佳性能。Jason:但第一部分實際上是,我們需要有一個反饋機制。我們需要有一個我們關心的目標。Jason:當一家公司來找我時,他們會說,嘿,Jason,我們想改善我們的 rag 應用程式。我們實際上不知道這意味著什麼。是因為我們對系統失去信心而失去用戶嗎?我們無法提供良好的引用嗎?還是我們搜尋的內容在向量數據庫中不可用?或者是因為我們使用像檔案名稱這樣的東西,但因為我們不使用全文,只使用向量,所以我們實際上無法對應到這些。Jason:我們首先要問的是,實際的目標是什麼。Jason:而捕捉這些反饋的最簡單方法是,Jason:你知道,有一個讚和不讚的評分。但即使在這裡,文案實際上也非常重要。我最近與一位客戶合作,原始文案是,我們做得怎麼樣。Jason:讚和不讚?Jason:而這實際上導致了我們不知道我們試圖捕捉什麼。Jason:結果發現,如果答案是正確的,但答案太長,人們會不讚。Jason:如果答案是正確的,而且簡潔,但需要 3 或 4 次嘗試才能得到正確答案,人們也會不讚。Jason:但這些資訊實際上對於理解我們試圖推動的商業結果並沒有幫助。Jason:他們真正關心的是答案是否正確。僅僅通過改變那段文案,僅僅說,今天我們是否回答了你的問題?而不是,我們做得怎麼樣?我們實際上能夠將評分的數量提高 5 倍,並改善我們擁有的數據的評分,這意味著,你知道,幾週後我們將能夠查看人們讚和不讚的內容,並對理解有一個啟發。我們到底讓人們滿意的是什麼,我們做得不好的地方是什麼?Jason:但你知道,即使只有 5% 的人實際上使用這些機制,我們仍然希望在所有搜索查詢中獲得相關性評分。因此,你不僅希望有一個測量滿意度的反饋機制,還希望有一個易於測量的相關性測量,對吧?你可以使用像 LLM 這樣的東西。 | |
Jason: But in practice it's much more practical and much more cost effective to use something like the cosine distance of your embedding score. And if you're using a reranker, the reranking scores that might come out as something like cohere. Jason: And so now your logging data looks something like this. There's a query. There's a cosine score, and there's a feedback score, and that feedback score might be null for 95% of cases. But by changing the copy we can improve that quality and improve. The signal that we have is something we're trying to Jason: do better on. Jason: So now you have all this data, you have basically a piece of text, a couple of numbers, and your goal is to figure out how to prioritize my team. Jason: to systematically improve the rack application. Jason: and the most practical thing you can do Jason: initially is to do some form of unsupervised learning. Jason: If you use something like bird topic or Lda, we can actually use the query strings and generate topics of questions and topics of clusters that we can use to really focus on the segments that we care about. Jason: and so that if you if you run this type of modeling and we'll cover that later. If you're interested, you can basically have a topic. Jason: The number of state, the number of questions within a topic. And now you have both a mean cosine score and a mean feedback user satisfaction score. Jason: And just by looking at this, we're going to give ourselves a lot of ammunition in figuring out how exactly we're failing the user. Jason: And whether or not it's actually relevant Jason: to even improve. Jason: For example, maybe we find a topic that is very low, count Jason: very low relevancy and very low satisfaction. Jason: You know, we can at least have recognized that that topic is something we might not want to prioritize in the near term in terms of improving. And we might even realize that these kinds of questions are so irrelevant, we want to put it into the prompt of our language model to say, Hey. Jason: if you're going to ask any questions about this topic. This is not something I'm going to try to answer today. Jason: whereas you might be in another domain, where you find topics with very high counts, very high cosine similarity, but very low feedback. Jason: right? Maybe what we can say. This is the top cluster that we've discovered. Jason: Our search engine says that we're finding relevant data. Jason: But the users are extremely unhappy with how we're answering the question. Jason: Now, we know we need to be thinking about Jason: what kind of answer we're generating whether? That's the question prompt. Is there something else that's going on that we're not understanding for the users, and we can go figure that out Jason: other times we might get topics with high counts, with low relevancy and high satisfaction. Jason: And again, by looking at this sort of grid of options right, higher, low, count higher, low relevancy, higher, low feedback. We can Jason: pretty much identify the regions that we can Jason: dive into and sort of brainstorm on, how we can improve these systems. Jason: And when you build these topics, there's going to be really 2 kinds of topics. Jason: There's going to be the topic where the content that you're searching is the interesting piece. So I'm going to just call that a content topic. Jason: There's another kind of topic which is a capability topic. Jason: So imagine I'm doing a marketing, marketing, rag, app Jason: topic cluster could be questions regarding pricing versus questions regarding like different kinds of case studies. Right? Jason: If those kinds of questions have low relevancy and low feedback. Usually the solution is to improve the inventory that you have Jason: right. It might be the case that our customers are coming in and asking about pricing, and we're not doing well because we don't have enough pricing documents. Jason: That would be a topic cluster. Jason: The other topic we talked about was a capabilities posture Jason: here, instead of asking a question like, you know? Jason: what was the pricing that we have? Jason: Instead, it could have been who was the last person who updated the pricing document there. The capabilities topic really is around like modified dates. Jason: And if that information isn't available to the language model. If it isn't available in a text, Chunk, you'll never be able to answer that question. Jason: But if that does show up we can look at the counts. We can look at the cosine distance, and we can look at the feedback to identify whether or not it's worth fixing today or tomorrow, based on how you prioritize things Jason: right? Other simple examples of a capability might be comparing and contrasting across multiple documents. If that's the case, you might want to have parallel search. Jason: You, you know, encounter things where we ask about recency or latency, a lot, a sort of recency, or like latest. Jason: we might want to have date ranges. And if we have date ranges now, we need now we can recognize that we need to have the current date available in the prompt, the language model to prepare a search. Query like that. Jason: A really funny example that we noticed recently was Jason: in different industries. Find it like Fy. 24 fy. 25 don't end on January. And so you you might discover that financial year is being referenced a lot. And because of that we're matching across the wrong year token. These are all capabilities of a model that we need to improve by changing the search system compared to the topic clusters that we talked about before, where the way that we can fix that is to identify like inventory issues and making sure the documents that we have are actually available. Jason: And then, if we have issues there, we might build out rules that say. Jason: you know, if there are less than 10 documents that match this question, it might be useful to report back to the user or report back to the, you know. Jason: content management team and say, Hey. Jason: lots of people are asking about pricing. And this is not available in a search tool. Right? So now, there's different ways of solving the kind of rag problem by basically looking at this simple topic, count. Jason: score and satisfaction. Jason: So once you develop these topics, then you have to do a little bit of reading the tea leaves. Jason: Once you identify these topics. You're going to have an insight as to what kind of topic clusters and capability clusters actually make sense for your use case. Jason: And today we ran this project process offline. And so when something changes in the future, we might not be able to detect that unless we rerun this process. Jason: So once you identify the topics and the capabilities. One thing that is really useful to do is to actually build classifiers that can classify these questions in real time. Jason: We have these question type labels. Whether it's a type called, you know, is privacy question, or it's, you know, a capability like compare and contrast or date range filtering or ownership. Jason: We can do this classification task sort of asynchronously, as data is being processed, and then send that to something like amplitude or data dog, where we can monitor over time. How these things change Jason: and where, you might notice, is today we have some seasonality where maybe, like Jason: summary type questions are being asked, like Monday morning at the beginning of a work week. But, more importantly, when you get new users or new kinds of inventory, it's really valuable to understand how the distribution of these questions and how you prioritize improving these questions change over time. Jason: A simple example could be Jason: you onboard, a new enterprise client. Jason: and you discover that after the onboarding process, the amount of questions that we were ignoring, skyrocketed because it turns out this company really cares about comparing contrast questions, whereas your previous user base did not. Jason: And being able to have this real-time monitoring of saying, Okay, 20% of our questions are document search and all of a sudden it went to 100%. Jason:但在實際操作中,使用像是嵌入分數的餘弦距離要實用得多,成本效益也更高。如果你使用重排序器,重排序分數可能會變成像是 cohere 這樣的東西。Jason:所以現在你的日誌數據看起來像這樣。有一個查詢,有一個餘弦分數,還有一個反饋分數,而這個反饋分數在 95% 的情況下可能是空的。但通過改變文案,我們可以改善那個質量並提升。我們擁有的信號是我們正在努力 Jason:做得更好的事情。Jason:所以現在你擁有這些數據,基本上是一段文本,幾個數字,而你的目標是找出如何優先考慮我的團隊。Jason:系統性地改善 rack 應用程序。Jason:你可以做的最實用的事情 Jason:最初是進行某種形式的無監督學習。Jason:如果你使用像是 bird topic 或 Lda 的東西,我們實際上可以使用查詢字符串來生成問題的主題和我們可以用來真正專注於我們關心的區段的聚類主題。Jason:這樣如果你運行這種類型的建模,我們稍後會討論這個。如果你有興趣,你基本上可以擁有一個主題。Jason:主題中的狀態數,問題數。現在你擁有了平均餘弦分數和平均反饋用戶滿意度分數。Jason:僅僅通過查看這些,我們將給自己提供大量的武器來弄清楚我們究竟是如何讓用戶失望的。Jason:以及這是否實際上與 Jason:改善有關。Jason:例如,也許我們找到一個主題,該主題的計數非常低,Jason:相關性非常低,滿意度也非常低。Jason:你知道,我們至少可以認識到這個主題是我們可能不想在短期內優先改善的事情。我們甚至可能意識到這些問題是如此不相關,我們想把它放入我們語言模型的提示中,說,嘿。Jason:如果你要問任何關於這個主題的問題。這不是我今天要嘗試回答的問題。Jason:而你可能在另一個領域,發現主題的計數非常高,餘弦相似度非常高,但反饋非常低。Jason:對吧?也許我們可以說,這是我們發現的頂級聚類。Jason:我們的搜索引擎說我們找到了相關數據。Jason:但用戶對我們回答問題的方式非常不滿意。Jason:現在,我們知道我們需要考慮的是 Jason:我們生成的答案類型,無論那是否是問題提示。是否有其他事情發生,我們沒有理解用戶,我們可以去弄清楚 Jason:其他時候,我們可能會得到計數高、相關性低和滿意度高的主題。Jason:再次通過查看這種選項的網格,對吧,計數高、低,相關性高、低,反饋高、低。我們可以 Jason:基本上識別出我們可以 Jason:深入探討的區域,並進行頭腦風暴,看看我們如何改善這些系統。Jason:當你建立這些主題時,會有兩種主題。Jason:會有一種主題是你正在搜索的內容是有趣的部分。所以我將其稱為內容主題。Jason:還有另一種主題是能力主題。Jason:所以想像一下,我正在做市場營銷,市場營銷,rag,應用程序 Jason:主題聚類可能是關於定價的問題與關於不同類型案例研究的問題。對吧?Jason:如果這些問題的相關性和反饋都很低,通常解決方案是改善你擁有的庫存。Jason:對吧。可能是我們的客戶進來詢問定價,而我們做得不好,因為我們沒有足夠的定價文件。Jason:那將是一個主題聚類。Jason:我們談到的另一個主題是能力姿態。Jason:在這裡,而不是問一個問題,比如,你知道嗎?Jason:我們擁有的定價是多少?Jason:相反,它可能是誰是最後一個更新定價文件的人。能力主題真的圍繞著修改日期。Jason:如果該信息對語言模型不可用。如果它在文本塊中不可用,你將永遠無法回答該問題。Jason:但如果它出現,我們可以查看計數。我們可以查看餘弦距離,我們可以查看反饋,以確定今天或明天是否值得修復,根據你如何優先考慮事情。Jason:對吧?能力的其他簡單示例可能是在多個文檔之間進行比較和對比。如果是這種情況,你可能想要進行平行搜索。Jason:你知道,我們遇到的事情是我們經常詢問最近性或延遲,很多,某種最近性,或像最新的。Jason:我們可能想要有日期範圍。如果我們有日期範圍,現在,我們需要認識到我們需要在提示中提供當前日期,以便語言模型準備這樣的搜索查詢。Jason:最近我們注意到的一個非常有趣的例子是 Jason:在不同的行業中,發現像是 FY 24 FY 25 不會在一月結束。所以你可能會發現財政年度被引用得很多。由於這個原因,我們在錯誤的年份標記之間進行匹配。這些都是我們需要通過改變搜索系統來改善的模型能力,與我們之前談到的主題聚類相比,我們可以修復的方式是識別庫存問題,並確保我們擁有的文件實際上是可用的。Jason:然後,如果我們在那裡有問題,我們可能會建立規則,說。Jason:你知道,如果匹配這個問題的文檔少於 10 個,可能有用的是向用戶報告或向你知道的內容管理團隊報告,說,嘿。Jason:很多人都在詢問定價,而這在搜索工具中不可用。對吧?所以現在,有不同的方法來解決這種 rag 問題,基本上通過查看這個簡單的主題、計數。Jason:分數和滿意度。Jason:所以一旦你開發了這些主題,你就必須做一些閱讀茶葉的工作。Jason:一旦你識別了這些主題,你將對哪些主題聚類和能力聚類實際上對你的用例有意義有一個洞察。Jason:今天我們在離線運行這個項目過程。因此,當未來發生變化時,我們可能無法檢測到,除非我們重新運行這個過程。Jason:所以一旦你識別了主題和能力。一件非常有用的事情是實際上建立分類器,可以實時對這些問題進行分類。Jason:我們有這些問題類型標籤。無論它是叫做的類型,你知道,是隱私問題,或者是你知道的能力,比如比較和對比或日期範圍過濾或所有權。Jason:我們可以在數據處理的同時進行這個分類任務,然後將其發送到像 amplitude 或 data dog 這樣的地方,我們可以隨著時間的推移監控這些事情的變化。Jason:而你可能會注意到,今天我們有一些季節性,可能像是 Jason:摘要類型的問題在工作周開始的星期一早上被詢問。但更重要的是,當你獲得新用戶或新類型的庫存時,了解這些問題的分佈以及你如何優先考慮改善這些問題隨著時間的變化是非常有價值的。Jason:一個簡單的例子可能是 Jason:你為一個新的企業客戶進行上線。Jason:你發現,在上線過程之後,我們忽略的問題數量激增,因為這家公司真的很關心比較對比問題,而你的先前用戶群則不關心。Jason:能夠實時監控,說,好吧,我們的問題中有 20% 是文檔搜索,突然間它變成了 100%。 | |
Jason: Maybe we just got like a really big company that, you know. Jason: is using the application differently than what it was before, and there would be no way to detect that unless you had this real time visibility that told you that something was going on. Otherwise you might just run this clustering job, you know, a month from now, 2 months from now and then realize that you know this new customer was very unsatisfied because we deprioritized the type of questions that they really really care about Jason: And and generally what happens is when you actually build up these clusters, there's usually going to be like 3 or 4 things that stand out. And instead of having a goal that is like improving the rag application, we can set very simple goals, like improving daytime filtering Jason: right or improving our meta. The amount of metadata we include in text clusters. Jason: And now you go from this very ambiguous improving, this AI system problem to basically figuring out how much like Jason: times does your team have and be able to run these very small experiments Jason: on top of that. Because you have these specific topics. When you want to evaluate new data sets, you can actually use things like synthetic data generation to generate more questions of the topics that you care about. Jason: and then be able to really quickly test whether or not your experimental changes has improved some system, right? Jason: And so this is something we're going to talk about right now around some of the lower hang fruit. Jason: the general idea really is the fact that once you have this topic information, you know what kind of questions people are asking. And now you can prompt a language model and generate more questions like this in order to maybe help you generate better baselines in order to evaluate your search tools. Jason: A simple way you can do this is by saying, You know what I'm going to randomly sample a text, Chunk. Jason:也許我們只是有一個非常大的公司,你知道。Jason:正在以不同於之前的方式使用這個應用程式,除非你有這種實時的可見性告訴你有事情正在發生,否則就沒有辦法檢測到這一點。否則你可能會在一個月後、兩個月後運行這個聚類工作,然後意識到這個新客戶非常不滿意,因為我們降低了他們真正關心的問題的優先級。Jason:而且通常發生的情況是,當你實際上建立這些聚類時,通常會有 3 或 4 件事情突出。與其有一個像是改善 rag 應用程式的目標,我們可以設定非常簡單的目標,比如改善白天的過濾。Jason:對,或者改善我們的元數據。我們在文本聚類中包含的元數據量。Jason:現在你從這個非常模糊的改善這個 AI 系統問題轉變為基本上弄清楚你的團隊有多少時間,並能夠在此基礎上進行這些非常小的實驗。Jason:因為你有這些特定的主題。當你想評估新的數據集時,你實際上可以使用像合成數據生成這樣的東西來生成更多你關心的主題問題。Jason:然後能夠非常快速地測試你的實驗變更是否改善了某個系統,對吧?Jason:所以這是我們現在要談論的一些較低的果實。Jason:一般的想法實際上是,一旦你擁有這些主題信息,你就知道人們在問什麼樣的問題。現在你可以提示一個語言模型,生成更多這樣的問題,以幫助你生成更好的基準來評估你的搜索工具。Jason:你可以這樣做的一個簡單方法是說,你知道,我要隨機抽樣一段文本。 | |
Jason: I'm going to use AI to generate a question using that text chunk Jason: and my evaluation metric is simply. Jason: if I search this question in my rag, app Jason: that text jump must show up in the top. 3 top, 10, top, 25. Results of this search system, and I know that whether that, whether or not that shows up is really relevant, because I know they come from topics that I care about versus the topics that I don't. Jason: and if you, you know, include better metadata or add better daytime filters, you'll immediately be able to recognize whether or not this will improve your baseline. So instead of trying to guess whether something's gonna improve your system, instead of trying to guess if day times, or if new embedding models or new text chunking models are gonna improve, you can now just test these by having synthetic data even before you have a lot of, you know, user feedback data on something like user satisfaction. Jason: And that's basically it. Right? Like, you want to make sure you you do things like having a full text search involved. Right? Usually, when you build these reran, when you use these reranker models, we've always found improvements in how we return our our baselines Jason: right? Because people ask questions like who edited the last document. What was the title of that document? When was this file last updated? Show me the latest version of the pricing Doc. We showed to Jason: Fortune 500 companies. You know that query isn't going to be just an embedding that's going to be a date filter with a category filter with a you know document, type filter. Jason: And by Jason: looking at these topics, figuring out that these are the things that are important and then building specific, nuanced search tools. Jason: you'll be able to actually identify and report back to. You. Know Jason: your executive, your manager, even yourself. And we can say, Hey. Jason: for this topic, where we have 40% of the questions be in this topic, we've been able to meaningfully impact and improve Jason: cosign distance or user satisfaction. Right? Jason: And this will give you sort of the Jason: to me, at least, the data and the evidence you need to invest more time in improving the system and knowing that improvements in the system is actually a good leave. Jason: Better outcome. Right? You can imagine we have things like the count the mean cosine distance, the mean feedback. Jason: We might have, you know, the average character length or the average latency, and we can now go try and correlate these variables to other business outcomes. Jason: you know, is my rag app actually being able to convert users better are those who have high satisfaction or lowest sax fashion, using my rag or my chat bot once a week, twice a week, 3 times a week. Jason:我將使用 AI 生成一個問題,使用那段文字。Jason:我的評估指標很簡單。Jason:如果我在我的 rag 應用程式中搜索這個問題,Jason:那段文字必須出現在前 3、前 10、前 25 的搜索結果中,我知道它是否出現是非常相關的,因為我知道它們來自我關心的主題,而不是我不關心的主題。Jason:如果你知道,包含更好的元數據或添加更好的日期篩選器,你將立即能夠識別這是否會改善你的基準。因此,與其試圖猜測某些東西是否會改善你的系統,與其試圖猜測日期時間,或者新的嵌入模型或新的文本分塊模型是否會改善,你現在可以通過擁有合成數據來測試這些,即使在你擁有大量用戶反饋數據之前,像是用戶滿意度。Jason:基本上就是這樣。對吧?你想確保你做的事情像是涉及全文搜索。對吧?通常,當你建立這些重新排名模型時,我們總是發現我們返回基準的方式有所改善。Jason:對吧?因為人們會問像是誰編輯了最後的文件。那個文件的標題是什麼?這個文件最後一次更新是在什麼時候?給我顯示最新版本的定價文件。我們向 Jason:財富 500 強公司展示過。你知道那個查詢不會僅僅是一個嵌入,而是會有一個日期篩選器、一個類別篩選器和一個文件類型篩選器。Jason:通過 Jason:查看這些主題,弄清楚這些是重要的事情,然後建立具體的、細緻的搜索工具。Jason:你將能夠實際識別並回報給你。你知道 Jason:你的高層、你的經理,甚至是你自己。我們可以說,嘿。Jason:對於這個主題,我們有 40% 的問題在這個主題中,我們已經能夠有意義地影響和改善 Jason:餘弦距離或用戶滿意度。對吧?Jason:這將給你某種 Jason:對我來說,至少是你需要的數據和證據,以便投入更多時間來改善系統,並知道系統的改善實際上是一個好的結果。Jason:更好的結果。對吧?你可以想像我們有像是計算平均餘弦距離、平均反饋的東西。Jason:我們可能有,你知道,平均字符長度或平均延遲,現在我們可以去嘗試將這些變數與其他商業結果相關聯。Jason:你知道,我的 rag 應用程式實際上能否更好地轉換用戶,那些擁有高滿意度或最低滿意度的人,每週使用我的 rag 或我的聊天機器人一次、兩次、三次。 | |
Jason: Right? Those are the things that you can do once you actually have these metrics. And once you have these clusters to identify what exactly is going on. Jason: and obviously on top of metadata, you want to do things like adding type filters and document filters and date filters. Jason: I see we have 8 min left. So I'm going to just go jump to the next slide as well. Jason: The goal of this really isn't to like necessarily show you exactly how to do these things right, that the goal really is to show you that one of the simple ways of going about this is to do this kind of data analysis, to identify and prioritize where you want to include your systems. Jason:對吧?這些都是在你實際擁有這些指標之後可以做的事情。而一旦你有了這些集群來識別究竟發生了什麼。Jason:顯然除了元數據之外,你還想做一些事情,比如添加類型過濾器、文件過濾器和日期過濾器。Jason:我看到我們還剩下 8 分鐘。所以我將直接跳到下一張幻燈片。Jason:這個目標其實並不是要向你們展示如何正確地做這些事情,真正的目標是向你們展示,進行這種數據分析是一種簡單的方法,以識別和優先考慮你想要包含的系統。 | |
Jason: And the main goal for me really is to also hear some questions that you guys have on how you guys are running into these kind of issues. And the goal is that after this course is done, Dan, I will be spinning up a Jason: course on really breaking down this process and understanding how we can improve these applications. And so if any of you guys are interested as well, I'll be available on discord to answer any kind of questions. And if you guys are interested in learning more about the course, or you know Jason: telling us what you guys are interested about the course, this is a code record. You can go check out, and there's a tight form survey should hey? Like 2 min, and it'll give it. It'll give us a good understanding of what are the concerns and questions that you know your teams have around improving the systems. And what kind of outcomes you're trying to drive. Jason:對我來說,主要的目標是聽聽你們對於如何遇到這些問題的問題。目標是在這門課程結束後,Dan,我將會開設一門關於真正解析這個過程的 Jason 課程,並了解我們如何改善這些應用程式。因此,如果你們有興趣的話,我會在 discord 上隨時回答任何問題。如果你們想了解更多關於這門課程的資訊,或者告訴我們你們對這門課程的興趣,這是一個代碼記錄。你們可以去查看,還有一個緊湊的表單調查,大約需要 2 分鐘,這將幫助我們更好地了解你們團隊在改善系統方面的關注和問題,以及你們想要達成的結果。 | |
Jason: Dan, do you have anything else. Dan Becker: Yeah, we'll drop. We'll drop a link to the survey. Also just a direct link in discord. And then we've got some QA. So we should use this time and go through some of the the questions. Jason: Yeah. Do you want to pick the questions? Dan Becker: Yeah, I'll take. I'll I've ranked them from most votes to least. Dan Becker: Top question is Dan Becker: from Simon Willison. Dan Becker: How many real user queries do you need to have before these techniques start to be worthwhile? Presumably, if you have only a few 100 logged queries, you're better off highballing the data directly. Jason: I would say so. I think it's still useful to find these clusters, because, you know. Jason: like. Jason: even if you just have a hundred questions, and the difference was between like regular questions versus questions that involve, like time, like an embedding model, should be able to at least discern the like. Oh, latest recent. Those are things that show up. Jason: It's really just useful to have some kind of mechanism. You can group by Jason: and then focus on those things independently, because really this topic clustering is just summarization. So if you copy pasted all the 100 questions that put them into opus, you're kind of doing the same thing. Jason: But once you have these discrete variables, you can then group by them and say, Hey, are they? Are the cosine distances, the same are the thumbs up and thumbs down the same. Oh, it turns out, you know Jason: every question that is in topic one was asked by this customer of ours. Let's go. Let's go. Reach out and ask and say, Hey, you guys had really low satisfaction. What were you looking for? What kind of outcomes were you looking for? Dan Becker: Let me pull up the next question. Dan Becker: how do? Dan Becker: Okay? Dan Becker: I'm gonna actually rephrase it as I talk but someone's asking how to have a standardized data ingestion pipeline. When you've got complex data like Pdfs, Powerpoint Dan Becker: flow charts, visual charts Dan Becker: and so on. Which is, Dan Becker: yeah, I think one of the challenges I've seen most frequently Dan Becker: is, you know, especially charts. You have any Dan Becker: quick reactions to this, Jason. Jason: Yeah. So if we have an understanding of what kind of questions people are asking. Jason: what I would, I usually prefer to do is, I would prefer to commit much more compute upfront. Jason: Just had to pre answer these questions right? Like, Jason: if I have a bunch of images in a Pdf is the goal to really pull questions out like, Do I just want to be Jason: finding images? Or is it usually the case that the paragraphs above images are also referenced, these images themselves. And I want to do some kind of question answering system right like, for example. Jason: one of them is like pictures of clothing right? So if I have bounding boxes Jason: that I run on onto his images, maybe would. Actually, what I'm actually do is I just want to find like, find all images where it contains at least one shirt. Jason: So understanding what questions people are asking is is usually the 1st step. Jason: The thing that is more general, would be around generating summary indices Jason: right? Where, like for all images, I can generate a text summary. What these images look like for charts. I can generate text summaries over those charts. And again, the goal here really is to create synthetic texts Jason: that I can use to look up alongside this like other multimodal data. Jason: And if I know what kind of questions people are asking, I can always say. Jason: given a chart. Jason: Given this chart and questions people ask about charts. Jason: You know. What are some questions that would feed this chart, and again building a baseline of what exactly is going on. Dan Becker: I I would just add to that. Dan Becker: So Dan Becker: I've worked on Edf that had many tables. Dan Becker: And there are Dan Becker: python libraries that will help you extract those tables into Csv. Dan Becker: they're really finicky. And so if your tables have, for instance, Dan Becker: common problem that we saw is nested headers, so you could have like Jason:Dan,你還有其他的嗎? Dan Becker:有,我們會放一個調查的連結。還有一個直接的連結在 Discord 上。然後我們有一些 QA。所以我們應該利用這段時間來看看一些問題。 Jason:是的。你想挑選問題嗎? Dan Becker:是的,我會挑。我已經根據投票數從多到少排列了。 Dan Becker:最上面的問題是 Dan Becker:來自 Simon Willison 的。Dan Becker:在這些技術開始有價值之前,你需要多少真實用戶查詢?假設如果你只有幾百個登錄的查詢,你最好直接高估數據。 Jason:我會這麼說。我認為找到這些集群仍然是有用的,因為,你知道。 Jason:像是。 Jason:即使你只有一百個問題,而區別在於像是常規問題與涉及時間的問題,像是嵌入模型,應該至少能夠辨別像是。哦,最新的最近。這些是會出現的東西。 Jason:擁有某種機制來分組是非常有用的。你可以分組然後獨立關注那些事情,因為這個主題聚類實際上就是總結。所以如果你把所有一百個問題複製粘貼到 opus 中,你有點在做同樣的事情。 Jason:但一旦你有了這些離散變量,你就可以根據它們進行分組,然後說,嘿,它們的餘弦距離是否相同,讚和不讚是否相同。哦,結果是,你知道 Jason:每個在主題一中的問題都是我們這位客戶提出的。讓我們去聯繫並詢問,說嘿,你們的滿意度真的很低。你們在尋找什麼?你們在尋找什麼樣的結果? Dan Becker:讓我拉出下一個問題。 Dan Becker:怎麼樣? Dan Becker:好吧? Dan Becker:我實際上會在說話時重新表述,但有人在問如何擁有一個標準化的數據攝取管道。當你有複雜的數據像是 Pdf、Powerpoint Dan Becker:流程圖、視覺圖表等等。這是, Dan Becker:是的,我認為我最常見到的挑戰之一是, Dan Becker:你知道,特別是圖表。你對此有什麼快速反應,Jason? Jason:是的。所以如果我們了解人們在問什麼樣的問題。 Jason:我通常更喜歡的是,我會更願意提前投入更多的計算資源。 Jason:只是為了預先回答這些問題,對吧?像是, Jason:如果我在 Pdf 中有一堆圖片,目標真的只是提取問題嗎?我只是想找到圖片?還是通常情況下,圖片上方的段落也會提到這些圖片本身。我想做某種問題回答系統,對吧?例如。 Jason:其中一個是像衣物的圖片,對吧?所以如果我在這些圖片上運行邊界框,可能會。實際上,我真正想做的是我只想找到所有包含至少一件襯衫的圖片。 Jason:所以了解人們在問什麼問題通常是第一步。 Jason:更一般的事情會圍繞生成摘要索引 Jason:對吧?像是,對於所有圖片,我可以生成一個文本摘要。這些圖片看起來像什麼,對於圖表,我可以生成這些圖表的文本摘要。而且,這裡的目標真的就是創建合成文本 Jason:我可以用來與這些其他多模態數據一起查詢。 Jason:如果我知道人們在問什麼樣的問題,我可以總是說。 Jason:給定一個圖表。 Jason:給定這個圖表和人們對圖表的提問。 Jason:你知道。有什麼問題可以填補這個圖表,再次建立一個基準,了解到底發生了什麼。 Dan Becker:我只是想補充一下。 Dan Becker:所以 Dan Becker:我曾經在有很多表格的 Edf 上工作。 Dan Becker:而且有 Dan Becker:python 庫可以幫助你將這些表格提取到 Csv 中。 Dan Becker:它們真的很挑剔。所以如果你的表格有,例如, Dan Becker:我們看到的常見問題是嵌套標題,所以你可以有像是 | |
Dan Becker: in the header. You'd be like revenue, and then sub header would be individual columns would be revenue broken down by country. Dan Becker: We never got that to work very well, but if your tables are pretty standard and simple. Dan Becker: we were able to extract those to Csv. And then it passed the Csv. Dan Becker: To the Alabama Dan Becker: and we've had Dan Becker: Not fantastic results, but reasonable results from Dan Becker: dealing with tables like that. Dan Becker: Visual charts. I imagine that you could use Dan Becker: models that are that are have a vision component. Dan Becker: Again, something I haven't worked on a lot myself. But Dan Becker: yeah, like. Jason: One specific thing I'd like to call out in this tables. Example is, you know, understand what kind of questions are being asked about these tables is a is a huge lift right in. Sometimes when I have a table, I might be trying to do aggregate statistics of that table, in which case I might want to then be able to do something like. Given a Csv file. Write some kind of Texas equal engine to answer questions on aggregates. Jason: That's not always the kind of you know, data data masses we want. Right? Other times, we might be processing specifications and like large tables where we're trying to find, like, you know, certain rows right? If we knew that ahead of time Jason: that we knew that we want to do summary statistics versus a scan of the table. We can build separate systems that can do that right. And that really comes from again looking at the questions and making sure that this is what you're asking for, right like, I think, before we were processing tables with like 20,000 rows. And my solution was, Oh, let's put this into a single light. Jason: But it turns out they just wanted to figure out like, when did we order like these kind of hex screws for this kind of, you know, construction site, and we just needed a chunk them Jason: like 20 lines each, and we'll be able to find the answer Jason: because we only looked for Jason: figuring out like which row something existed in. Dan Becker: Here's a question. i i i really like this one from side G any suggestions on capturing feedback when the rag element is hidden away from the user. For instance, I'm using agents to build a report. Dan Becker: My rag tool and my rag is a tool the agent uses to build elements of the report. Dan Becker: Since the user doesn't direct, since the user does not directly use or query the rag, how do you suggest I capture feedback. Jason: Yeah, like in the report context, you you kind of still have like fields of inputs. Jason: And so one piece of direct feedback is. Jason: you know, do we use? Do we ask the user to ever correct the inputs? If there was a correction, that's a piece of feedback right now, we know that like, oh. Jason: fields of this type require a lot of user corrections. What fields are those turns out when the fields are enums of a high cardinality, we tend to fail. That could be like a discovery. So on a field level, there's edits that we can do Jason: without the feedback. We still have this idea of the reranker and the cosine distance right? Jason: Even if we don't touch anything for every Jason: internal rag Api call that we make, we are getting N text chunks. And for those end text trunks we are getting average cosine distances. And the Max, we ranker relevancy, the mean rebuild rank relevancy right? Dan Becker:在標題中。你會像是收入,然後子標題會是按國家劃分的收入的各個欄位。Dan Becker:我們從來沒有讓這個運作得很好,但如果你的表格相當標準和簡單。Dan Becker:我們能夠將這些提取到 Csv 中。然後它通過了 Csv。Dan Becker:到阿拉巴馬州。Dan Becker:我們得到了 Dan Becker:不算太棒的結果,但從 Dan Becker:處理這樣的表格中得到了合理的結果。Dan Becker:視覺圖表。我想你可以使用 Dan Becker:有視覺組件的模型。Dan Becker:再一次,這是我自己沒有太多研究的東西。但 Dan Becker:是的,像這樣。Jason:我想在這些表格的例子中指出一件具體的事情,就是了解對這些表格提出了什麼樣的問題是一個巨大的挑戰。有時當我有一個表格時,我可能會試著對該表格進行聚合統計,在這種情況下,我可能想要能夠做一些像是,給定一個 Csv 文件,寫一些類似 Texas 的引擎來回答聚合問題。Jason:這不總是我們想要的數據數據量,對吧?其他時候,我們可能在處理規範和像大型表格那樣的東西,我們試圖找到某些行,對吧?如果我們事先知道 Jason:我們知道我們想要做摘要統計而不是掃描表格。我們可以建立可以做到這一點的獨立系統,對吧?這真的來自於再次查看問題,並確保這就是你所要求的,對吧?像是,我認為,在我們處理有 20,000 行的表格之前。我的解決方案是,哦,讓我們把這放進一個單一的輕量級。Jason:但結果發現他們只是想弄清楚,我們什麼時候訂購了這種十字螺絲釘,用於這種建築工地,我們只需要將它們分成 Jason:每 20 行一組,我們就能找到答案 Jason:因為我們只是在尋找 Jason:弄清楚某個東西存在於哪一行。Dan Becker:這裡有個問題。我真的很喜歡這個來自 Side G 的問題,有什麼建議可以在 RAG 元素隱藏在用戶面前時捕捉反饋。例如,我正在使用代理來建立報告。Dan Becker:我的 RAG 工具和我的 RAG 是代理用來建立報告元素的工具。Dan Becker:由於用戶不直接使用或查詢 RAG,你建議我如何捕捉反饋。Jason:是的,在報告上下文中,你仍然有一些輸入欄位。Jason:所以一個直接反饋的部分是。Jason:你知道,我們是否要求用戶更正輸入?如果有更正,那就是一個反饋,現在我們知道,哦。Jason:這類型的欄位需要很多用戶更正。那些欄位是什麼?結果發現當欄位是高基數的枚舉時,我們往往會失敗。這可能是一個發現。因此在欄位層面上,我們可以進行編輯 Jason:而不需要反饋。我們仍然有重新排序器和餘弦距離的概念,對吧?Jason:即使我們不觸碰任何東西,對於我們每一個 Jason:內部 RAG API 調用,我們都會獲得 N 個文本塊。對於這些 N 個文本塊,我們獲得平均餘弦距離。以及最大,重新排序的相關性,平均重建的排名相關性,對吧? | |
Jason: And you know whether it's like point 7 or point 8. That's not really going to tell us much. Jason: But if certain fields are getting as like point 3 and other fields are getting as point 8, Jason: it still guides us to where to look, to come up with a hypothesis. To then come up with an experiment. Hamel: If you have a report that you're generating with Rag, and you're not able to edit it. It seems like really bad. Jason:而且你知道無論是 0.7 還是 0.8,這其實不會告訴我們太多。Jason:但是如果某些領域得到的是 0.3,而其他領域得到的是 0.8,Jason:這仍然指引我們該往哪裡看,以提出假設。然後再提出實驗。Hamel:如果你有一份與 Rag 生成的報告,而你無法編輯它,這看起來真的很糟糕。 | |
Hamel: Ux is really bad product design. Hamel: I think. Hamel: yeah, and. Jason: And it really depends. Like, if your report is generated using like a markdown output, it's really hard to like, assign back what is going on. Whereas if you have a report that's like a Json object where there's keys and values, you can figure out okay, like this, key is having lots of edits, and we can fix that. So there's a little bit of like structured reasoning, and how we even design the report Hamel: Ux 真的是很糟的產品設計。Hamel: 我覺得。Hamel: 是啊,然後。Jason: 這真的取決於。如果你的報告是使用 markdown 輸出生成的,那麼很難去指派回發生了什麼。而如果你有一個像 Json 物件的報告,其中有鍵和值,你就可以弄清楚,這個鍵有很多編輯,我們可以修正它。所以在我們設計報告的過程中,有一點結構化的推理。 | |
Jason: experience. Dan. Dan Becker: I was. Gonna say, this also seems to me to be the case, and it's probably the one that I have. The most experience with Dan Becker: this seems to be a case that might be the easiest case. Dan Becker: If you're extracting data for for a report, there might just be a ground truth, or the case that I've dealt with. There's a ground truth. Dan Becker: Now, we can literally just write a see Dan Becker: set of assertions, and say, instead of relying on the feedback being, excuse me, thumbs up or thumbs down. Dan Becker: We need to extract a certain fact. We're going to write a test. Dan Becker: And the benefit of this compared to one that uses user feedback is user feedback. You actually have to to deploy a system to figure out whether it's an improvement or not. Dan Becker: But if we want to experiment this changing the model or changing something about how we do chunking. Dan Becker: improve this system, if I can write a bunch of tests. Dan Becker: because I know I'm just pulling up fax to feed into a report. Dan Becker: Now, I can Dan Becker: iterate very quickly, because I can test that in seconds rather than Dan Becker: days which you would typically need for a deployed system. Dan Becker: The next most up voted one, I think, is Dan Becker: comments. Slash question. Dan Becker: Someone says Cosine usually isn't an exact measure of relevancy. Dan Becker: It's relative and to some degree arbitrary. Why would you recommend including it as a measure? Anyways. Jason: Yeah, the quick answer is, it is at least a number to look at or like. Right. What happens is, if I wanted to generate like Rag to do emails for a sales project. The best metric would be like sales conversion. If I could capture that data, I would do it, and I would time it very hard. But usually what it's gonna happen is I'm gonna have maybe hundreds of email before I get a sale. And so I don't have enough signal in low volume to really make anything actionable. Jason: If I did a thumbs up button, maybe with 1,000 emails, I get 50 Jason: interactions. Jason: And so I might need to have like tens of thousands of emails to even be able to figure out satisfaction. Jason: whereas with cosine distance it is at least a metric that I can have access to for every single request. Jason: and the idea is like, as I increase my volume. Jason: I should be able to rely less and less on that right? So you, in this case it was Jason: some of thumbs down is the best we have. Cosign is available for everything. Jason: But really, you know, we're just trying to have some direction to go. And we gotta recognize that like, yeah, like, not all metrics are are useful. But they could be useful in in sort of a a moment. In time. Dan Becker: Also the the way we're use. The fact that it's relative is fine, because we're actually using trying to figure like we're using it in a relative way. Jason: 經驗。Dan。Dan Becker: 我本來想說,這對我來說似乎也是如此,這可能是我擁有最多經驗的案例。Dan Becker: 這似乎是一個可能是最簡單的案例。Dan Becker: 如果你正在為報告提取數據,可能會有一個真實的基準,或者是我處理過的案例。這裡有一個真實的基準。Dan Becker: 現在,我們可以字面上寫一組斷言,並說,不是依賴於反饋是,對不起,讚成或反對。Dan Becker: 我們需要提取某個事實。我們將寫一個測試。Dan Becker: 與使用用戶反饋的情況相比,這樣做的好處是用戶反饋。你實際上必須部署一個系統來確定這是否是改進。Dan Becker: 但是如果我們想實驗這個改變模型或改變我們如何進行分塊。Dan Becker: 改善這個系統,如果我可以寫一堆測試。Dan Becker: 因為我知道我只是提取事實來填充報告。Dan Becker: 現在,我可以 Dan Becker: 非常快速地迭代,因為我可以在幾秒鐘內測試,而不是 Dan Becker: 通常需要幾天的部署系統。Dan Becker: 下一個最受歡迎的,我認為是 Dan Becker: 評論。/問題。Dan Becker: 有人說餘弦通常不是相關性的精確度量。Dan Becker: 它是相對的,在某種程度上是任意的。你為什麼會建議將其作為一個度量來包含呢?無論如何。Jason: 是的,簡單的回答是,這至少是一個可以查看的數字,或者像。對吧。發生的事情是,如果我想生成像 Rag 這樣的電子郵件來進行銷售項目。最佳指標將是銷售轉換。如果我能捕捉到這些數據,我會這樣做,而且我會非常努力地去做。但通常發生的事情是,我可能會有數百封電子郵件才會得到一次銷售。因此,我在低量中沒有足夠的信號來真正使任何事情可行。Jason: 如果我做了一個讚的按鈕,也許在 1,000 封電子郵件中,我會得到 50 Jason: 互動。Jason: 所以我可能需要有數萬封電子郵件才能夠弄清楚滿意度。Jason: 而使用餘弦距離,至少是一個我可以在每一個請求中訪問的指標。Jason: 而這個想法是,隨著我增加我的量。Jason: 我應該能夠越來越少依賴於這個,對吧?所以在這種情況下,Jason: 一些反對是我們所擁有的最好。餘弦對於所有事物都是可用的。Jason: 但實際上,你知道,我們只是想要一些方向去走。我們必須認識到,像是,對,並不是所有的指標都是有用的。但它們在某種時刻可能是有用的。Dan Becker: 此外,我們使用的方式。事實上它是相對的這一點很好,因為我們實際上是在相對的方式中使用它。 | |
Dan Becker: Yep. Dan Becker: alright! How about this one? I'm gonna Dan Becker: reorder by current upfront. Dan Becker: Can you speak to the use of graph embeddings in rag, and how to evaluate that Dan Becker: care. Jason: I mean, I don't really work too much with knowledge graphs or graphs in general. Usually, if there was a graph like I'm gonna be like talk, a little bit more technical than maybe I don't. I don't know how technical audience is. But Jason: if there was a graph and I was factoring the adjacency matrix. Jason: All I'm really saying is, you know, I want to train. I want to fine-tune my embedding model beyond something that's like a sentence transformer, or A, Jason: or a like an open AI call in order to have some outcome. Right? And the question is, you know, the question I would really be asking is Jason: if I switched out open AI embeddings with graph embedding in the search tool. How much would my relevancy metrics show up? Jason: And the answer is, you know, depending on the question type. It doesn't really matter if the question was last modified date, you know. Maybe maybe the graph won't be able to do that, whereas the graph embeddings would be able to improve. You know the the content topic type clusters, but not the capability clusters right? And so these kind of questions are really hard to evaluate unless we have these sort of segments and scores that we care about Jason: in order to help us prioritize. Dan Becker: 是的。Dan Becker: 好的!那這個呢?我要 Dan Becker: 按照當前的前置順序重新排序。Dan Becker: 你能談談在 rag 中使用圖嵌入的情況,以及如何評估這一點嗎? Dan Becker: 關心。Jason: 我的意思是,我其實不太與知識圖譜或圖形一般打交道。通常,如果有一個圖,我會說得更技術一些,但也許我不知道技術觀眾的程度。Jason: 但是如果有一個圖,我在考慮鄰接矩陣。Jason: 我真正想說的是,你知道,我想訓練。我想微調我的嵌入模型,超越像句子變壓器這樣的東西,或者 A,Jason: 或者像一個 open AI 的調用,以便有一些結果。對吧?而問題是,你知道,我真正想問的是 Jason: 如果我在搜索工具中用圖嵌入替換 open AI 嵌入,我的相關性指標會顯示多少?Jason: 答案是,你知道,根據問題類型的不同,這並不重要,如果問題是最後修改日期,你知道。也許圖形無法做到這一點,而圖嵌入能夠改善。你知道,內容主題類型的聚類,但不是能力聚類,對吧?所以這些問題真的很難評估,除非我們有這些我們關心的細分和分數 Jason: 以幫助我們優先考慮。 | |
Dan Becker: Got a question from Peter one. That what quite a few people hypothesize about Dan Becker: how much? Dan Becker: of what you're talking about today, which is like rag. Dan Becker: Will still be relevant when a hypothetical Gpt. 5 Dan Becker: is released. Dan Becker: Yep. Jason: I think something like this will be even easier when a 255 was released, because part of it is just understanding Jason: how to write queries, and recognizing that the embedding isn't the only thing that matters right like, if we embed a text Chunk and the question was like, can you compare and contrast the pros and cons of these 3 services? Jason: Right? Gb, 5. Will just be able to write, you know. Maybe 6 queries. The pros of this service, the cons of this service, the pros of that service, and then be able to send aside some information. Jason: and so I don't think we ever get to a place where this is going to be something that fails. Jason: especially if you recognise that Jason: even as long context grows larger. Jason: it does not mean that search becomes less important. Jason: right? Because if you're actually a business, it's not just accuracy or user satisfaction. You also have to trade off things like latency for business outcomes right? Jason: Like, if DVD 5 is really good and you can put 10 million tokens in it. Jason: And I'm building an application that's going to sell products. Jason: If the response time is 5 seconds, I might someone might click the bag button. I might not not actually buy the product. And so there's always going to be trade-offs. We have to make Jason: that said, understanding what the what your customer is actually asking. Jason: That is a skill that you develop as like Jason: the data scientists, and as the business and those things will be, you know. Jason: here forever. If anything, I would just love to be able to copy, paste all the questions in Cbd 5 and ask it for the clusters and ask them for the prioritizations. But you know. Jason: building that thumbs up, thumbs down. Button like that's something that is sort of separate from the analysis that 55 could do. Dan Becker: Alright. I've got another one, which I think is again about knowledge graphs. I I think we've covered that Dan Becker: Can you talk a bit more about date and dock type filters. Dan Becker: do you embed something about that in the Lm, so Dan Becker: yeah, the. Dan Becker: I think there's someone basically just asking, do you talk a little bit about metadata metadata filtering, and even just like what is the idea of it? And how does it work. Dan Becker: 收到 Peter 的一個問題。很多人對 Dan Becker 的假設是:多少?Dan Becker: 你今天所談論的內容,比如說 rag,Dan Becker: 在假設的 Gpt. 5 發布時,還會不會相關。Dan Becker: 是的。Jason: 我認為當 255 發布時,這樣的事情會更容易,因為部分原因只是理解 Jason: 如何撰寫查詢,並認識到嵌入並不是唯一重要的事情,對吧?如果我們嵌入一段文本,而問題是:你能比較和對比這三項服務的優缺點嗎?Jason: 對吧?Gb, 5 將能夠寫出,你知道的,也許 6 個查詢。這項服務的優點,那項服務的缺點,那項服務的優點,然後能夠排除一些信息。Jason: 所以我不認為我們會到達一個這樣的地方,這將是一個失敗的事情。Jason: 特別是如果你認識到 Jason: 即使長上下文變得更大,Jason: 這並不意味著搜索變得不重要。Jason: 對吧?因為如果你實際上是一個企業,這不僅僅是準確性或用戶滿意度。你還必須在商業結果上權衡延遲等事情,對吧?Jason: 比如說,如果 DVD 5 非常好,你可以放入 1000 萬個標記。Jason: 而我正在建立一個將要銷售產品的應用程序。Jason: 如果響應時間是 5 秒,可能有人會點擊返回按鈕。我可能不會實際購買產品。因此,總是會有我們必須做出的權衡。Jason: 也就是說,理解你的客戶實際上在問什麼。Jason: 這是一種技能,你會像 Jason: 數據科學家一樣發展出來,作為商業,這些事情將會,知道的。Jason: 永遠存在。如果有什麼,我只是希望能夠將所有問題複製、粘貼到 Cbd 5 中,並請它提供集群和優先級。但你知道。Jason: 建立那個讚、踩的按鈕,這是一個與 55 能做的分析有些分開的事情。Dan Becker: 好的。我還有一個問題,我認為這又是關於知識圖的。我認為我們已經涵蓋了這一點。Dan Becker: 你能多談談日期和文檔類型過濾器嗎?Dan Becker: 你在 Lm 中嵌入了一些關於這方面的內容嗎?Dan Becker: 是的,Dan Becker: 我認為有人基本上在問,你能談談元數據過濾和它的概念是什麼?它是如何工作的? | |
Jason: Yeah, if you see my screen, do you see the slides? Or do you see a blog post? Jason:對,如果你看到我的螢幕,你看到的是簡報嗎?還是看到一篇部落格文章? | |
Dan Becker: See a blog post. Jason: Okay, perfect. 丹·貝克:看到一篇部落格文章。傑森:好的,完美。 | |
Jason: one simple way of doing this is actually using something like instructor to generate structured outputs. Jason:一個簡單的方法是實際上使用像 instructor 這樣的工具來生成結構化的輸出。 | |
Jason: Right? So in this example, I have an output model. That is a like metaphor query that has a rewritten query, a published date range, and a loud list of domains. Jason:對吧?所以在這個例子中,我有一個輸出模型。這是一個像隱喻查詢的查詢,包含重寫的查詢、發佈日期範圍,以及一個大量的域名列表。 | |
Jason: and my prompt is effectively says, given a request that comes in, I want you to structure it in a way that I can use in a subsequent search system. Jason: And so now. Jason:我的提示實際上是說,針對進來的請求,我希望你能以我可以在後續搜尋系統中使用的方式來結構它。Jason:所以現在。 | |
Jason: you know, for a question that is like, what are some recent developments in AI, I actually get a question. I get a rewritten query that looks like this. It says, Okay, the query should be like novel developments, advancements. AI article blah, I create day ranges because this is trying to capture what recency looks like, and in this specific instance it also recognizes that there are good domains that I want to search again. So in this case the category or type is going to be. I only want to return documents created from this time range from archive Jason:你知道,對於一個像是「最近在 AI 領域有哪些發展」的問題,我實際上會收到一個問題。我收到了一個重寫的查詢,看起來像這樣。它說,好吧,查詢應該像是新穎的發展、進步。AI 文章等等,我創建日期範圍,因為這是試圖捕捉最近性是什麼樣子,而在這個特定的情況下,它也認識到有一些我想再次搜尋的好領域。所以在這種情況下,類別或類型將是。我只想返回從這個時間範圍內創建的文件,來自檔案。 | |
Jason: right? And this is something that you kind of have to customize because you are the one that has to build out Jason:對吧?這是你必須自訂的東西,因為你是那個必須建立出來的人。 | |
Jason: the start and end date range. Jason:開始和結束日期範圍。 | |
Jason: But because you understand that this is something that's important. It is something that you can go out and build, and deterministically know that this will improve, how your search application runs. Jason: And so if you have different document types. Jason: you know. Maybe this is something that is. Jason:但是因為你明白這是重要的事情。這是你可以去建立的東西,並且可以確定地知道這將改善你的搜尋應用程式的運行方式。Jason:所以如果你有不同的文件類型。Jason:你知道。也許這是某種東西。 | |
Jason: you know, the content in your Crm that is tagged by your marketer. And so, you know there is a pricing tag, and if you knew that ahead of time Jason: you can just make sure you only search documents across certain prices. Jason:你知道,你的 CRM 中由你的行銷人員標記的內容。所以,你知道有一個定價標籤,如果你事先知道這一點,Jason:你可以確保你只搜尋某些價格範圍內的文件。 | |
Jason: So about pricing Jason: creep. Jason:所以關於定價 Jason:爬行。 | |
Dan Becker: Actually, let me ask a follow up in you when you do this, Jason Dan Becker: is so we're getting a query. You're using instructor to map it to Dan Becker: to the right metadata to query on. Dan Becker: Do you also? Dan Becker: when you are putting documents into the vector, database. Dan Becker: use instructor to create metadata Dan Becker: a lot of metadata. It just is what it is, you know, when the document was created. But historically, have you used instructor to create metadata that you add on for each document Dan Becker:其實,讓我問你一個後續問題,Jason。Dan Becker:我們正在獲取查詢。你使用 instructor 將其映射到 Dan Becker:正確的元數據以進行查詢。Dan Becker:你也有嗎?Dan Becker:當你將文件放入向量數據庫時。Dan Becker:使用 instructor 來創建元數據。Dan Becker:很多元數據。這就是它的本質,你知道的,文件是什麼時候創建的。但從歷史上看,你是否使用 instructor 來創建你為每個文件添加的元數據? | |
Dan Becker: in the vector database. Jason: Yep. So one of the things I do because this is again based on the used case. So one really funny example was that for different industries. The financial year does not end, and the the fiscal year does not end in a calendar year. Jason: And so we just had some instructions that said, Okay, if you're in mining, Fy 24 is actually an Fy 23. Jason: And we just had a small prompt that just says for any financial document convert it to the actual date versus a like the effective fiscal year. As a really small, prompt, a really small extraction. Jason: A more interesting example actually was around pulling complex diagrams from complex tables. Right in this situation we knew that this table could be like just shove into the prompt and and generate a response. But what we did was, we said. Jason: given this table, generate from me 6 or 7 questions that would retrieve this table. Jason: and then, when we actually planted. We wrote this table in the database. We embedded the questions Jason: where the table was metadata rather than bending the table itself. Jason: Right? But these are very Jason: specific decisions we made because we knew that what people were looking for was, we need to just pre-compute an FAQ, Dan Becker:在向量資料庫中。Jason:對。所以我所做的事情之一是,這再次基於使用案例。所以一個非常有趣的例子是,不同產業的財政年度並不在日曆年結束。Jason:所以我們剛好有一些指示說,好吧,如果你在採礦業,Fy 24 實際上是 Fy 23。Jason:我們只是一個小提示,說對於任何財務文件,將其轉換為實際日期,而不是像有效的財政年度。這是一個非常小的提示,非常小的提取。Jason:一個更有趣的例子實際上是關於從複雜表格中提取複雜圖表。在這種情況下,我們知道這個表格可以直接放入提示中並生成回應。但我們所做的是,Jason:根據這個表格,為我生成 6 或 7 個問題,以檢索這個表格。Jason:然後,當我們實際上植入時,我們將這個表格寫入資料庫。我們嵌入了問題,Jason:而表格是元數據,而不是彎曲表格本身。Jason:對吧?但這些都是非常Jason:具體的決策,因為我們知道人們所尋找的是,我們需要預先計算一個常見問題解答, | |
Jason: right? Jason: And like that was a decision we made because we realized that what we really needed was Faqs across all documents and couldn't include a tables. Dan Becker: We got a question here. What's your favorite platform for building rag systems? Lang Chain Lama Index. Something else. Jason: I think for the most part, I've kept things very simple. Jason: right? And so really, most of the applications we've built is gonna use something like lance dB, and the reason we choose lance, dB, is primarily because full text search sequel and vector, search Jason: can be stored in a single database. Jason: whereas, historically, if you used like early versions of pine cone and open search, you would have to have an open search index, a pine cone index and a postgres table Jason: in order to be able to do like a wear clause and a having clause with a full text, search with a vector so if you have to like, get the indices, intersect them, and then do something else crazy. And so things like that landscape make that very easy Jason: in terms of the query building, you know. Jason: basically, what I end up doing is, I have a landscape instance. Jason: I have a processor that converts the user's question into a pontantic object. Jason: And then I just have that dynamic object have a method called search, and that makes it very explicit call, because I have all this knowledge about what my data looks like. I know that this is a Crm. I know that it's gonna have tags. I know that this is gonna have a company tag. And I wanna embed all that knowledge into the query, engine. Jason: Think something like Lanternal Lama Index is very easy when when we just have sort of like very deterministic general data. Jason: But Jason: as data becomes much more specific, it's it's just simpler to have the control. You need to Jason: like, turn those knobs right? Dan Becker: How many more questions you want to handle. Jason: I got time cool. Dan Becker: Let's keep going. Then. Dan Becker: How do you develop an intuition for what questions you have at the start of a project? How does this change. If you're building the rag from scratch Dan Becker: to wait for feedback to come in before you add metrics. Jason:對吧? Jason:這是我們做出的決定,因為我們意識到我們真正需要的是在所有文件中都有常見問題解答,而不能包含表格。 Dan Becker:我們這裡有個問題。你最喜歡用什麼平台來建立 rag 系統?Lang Chain、Lama Index,還是其他的? Jason:我想大多數情況下,我保持事情非常簡單。 Jason:對吧?所以實際上,我們建立的大多數應用程序將會使用像 lance dB 這樣的東西,我們選擇 lance dB 的原因主要是因為全文搜索、SQL 和向量搜索可以存儲在單一數據庫中。 Jason:而且,歷史上,如果你使用早期版本的 pine cone 和 open search,你必須擁有一個 open search 索引、一個 pine cone 索引和一個 postgres 表, Jason:才能夠進行像是帶有全文搜索的 where 子句和 having 子句,所以如果你必須獲取索引,交集它們,然後做一些其他瘋狂的事情。像這樣的情況使得這一切變得非常簡單。 Jason:在查詢構建方面,你知道的。 Jason:基本上,我最終做的是,我有一個 landscape 實例。 Jason:我有一個處理器,將用戶的問題轉換為一個 pontantic 對象。 Jason:然後我只是讓那個動態對象有一個叫做 search 的方法,這使得它的調用非常明確,因為我對我的數據的樣子有所有這些知識。我知道這是一個 CRM。我知道它會有標籤。我知道這會有一個公司標籤。我想把所有這些知識嵌入到查詢引擎中。 Jason:想像像 Lanternal Lama Index 這樣的東西,當我們只是擁有非常確定性的一般數據時,這是非常簡單的。 Jason:但是 Jason:隨著數據變得更加具體,擁有你需要的控制就簡單多了, Jason:就像,調整那些旋鈕,對吧? Dan Becker:你想處理多少個問題? Jason:我有時間,酷。 Dan Becker:那我們繼續吧。 Dan Becker:你如何在項目開始時發展出對問題的直覺?如果你從零開始建立 rag,這會有什麼變化? Dan Becker:在你添加指標之前,等待反饋進來。 | |
Jason: yes and no. So one of the things I do for every project that is at this point, in my opinion, a must have is, once you've chunked the data Jason: you should have. The 1st thing you should do is just use something like a language model to generate 2 or 3 questions per text, chunk. Jason: And you might say, Well, how do you know the Llm. Is going to ask questions that my user is going to ask? If we don't, but it's going to be something. Jason: and oftentimes you'll still be surprised at how hard the search column is. Jason: For example, I had been using a Paul Grant essay for a long time a demo. And so I ran this synthetic data generation process and we found that our search relevancy was like 97% Jason: right? And then we did. DM, 25. And it was also 97%. It turns out for Paul Graham. S's you don't need semantic search. You just need text search. Jason: And then I thought, Okay, this is probably too easy. Let's just move. Let's just move this to Github issues. Jason: We do the same thing for any Github issue. Generate a fake query for me. And then this time we've got 60% Jason: full text search. Got 61 semantic search out 62 semantic search plus full text search plus reranker like 65, Jason: like, Oh, wow! Like. Jason: we actually can't. Even if we cheated. Jason: we can't do well on this. Jason: And now you go and begin the process of understanding what the heck is going on Jason:是的,也不是。所以在我看來,每個項目必須具備的一件事是,一旦你將數據分塊,Jason:你應該做的第一件事就是使用類似語言模型的東西為每個文本塊生成兩到三個問題。Jason:你可能會說,那你怎麼知道 LLM 會問出我的用戶會問的問題?如果我們不知道,但這將會是某種東西。Jason:而且通常你仍然會對搜索列的難度感到驚訝。Jason:例如,我已經使用 Paul Graham 的一篇文章作為演示很長一段時間。所以我運行了這個合成數據生成過程,我們發現我們的搜索相關性大約是 97%。Jason:對吧?然後我們做了 DM,25。它的相關性也是 97%。結果發現對於 Paul Graham,你不需要語義搜索,你只需要文本搜索。Jason:然後我想,好吧,這可能太簡單了。讓我們把這個移到 Github 問題上。Jason:我們對任何 Github 問題做同樣的事情。為我生成一個假查詢。這次我們得到了 60%。Jason:全文搜索得到了 61%,語義搜索得到了 62%,語義搜索加全文搜索加重新排名得到了 65%。Jason:哇!像這樣。Jason:即使我們作弊,我們也無法在這上面做得很好。Jason:現在你開始理解到底發生了什麼。 | |
Jason: right. It turns out for Github. A question might be, how can I get started right? It turns out, if we don't condition on repository, you'll never do. Well, okay, this means I need to build Ui Jason: to condition on this kind of variables, because if I don't have these filters I'll never be able to do well. Jason: and that kind of that's just what you learn as you're splunking across the data sets and splunking across the kinds of questions. Right? And the idea is like, today, you have a thousand some other questions and no users. Jason: Tomorrow you'll have a thousand questions and a thousand users, and eventually you can wean yourself off of this off of this synthetic data. But that's kind of a transition, as you deploy to a production system. Hamel: What's the term that you? I've been kind of making up a term when I try to explain it to my clients like, but there's, you know, there's like query expansion. I call it target expansion. I just kind of make it up Hamel: like to make each document like into questions and basically make it more likely, like the recall happens. Jason: Yep. Hamel: What is what is the right word for that? I don't. I just feel like I made it up. Jason: Mean, yeah, I basically tried to. So I think the research is, you are building a synthetic text chunk. Jason: right? Jason: But I just see it as like I am eating additional compute cost Jason: insert in order to improve my retrieval. Right? It's like, if I'm writing a bunch of like if I have a primary key, and it's an integer Jason: to a database. You're just upset like inserting a rogue. But behind the seeds it's like building some kind of tree right? And like that, insert is expensive because I've specified like primary key bigot, unique or something right? Jason: And you know the insert is slower. But I get a better read Jason: right? And that's how I think about it. I'm just like doing this. Hamel: We have to make up new terminology. I'm just curious if there. Jason: Yeah, yeah. Okay. Jason: yeah. I think I think the general consensus is just like synthetic text chunks. Jason: Yep. Dan Becker: Back to so. Jason: It would be hide. Yeah. Dan Becker: Coming coming back to this question, says, How do you develop an interest? For what question questions will be asked at this type of a project. Dan Becker: I think that if you don't have at least a theory for that. Dan Becker: I'd ask like, why are you doing this project? Dan Becker: Like most Dan Becker: projects, they have some goal. And you say, like, Oh, we want to help users find, answer their questions about product reviews. Dan Becker: And then. Dan Becker: like, it seems to me that Dan Becker: I can imagine a hobby project where you're just like, I just want to build a rag system. Dan Becker: But if you've got a business problem you're trying to solve. You should have a intuition about what are the questions people will come in with. Jason: If you do, the Bm, 25, one. Jason: Yep. Jason: I'll do that one. Jason: when you do something like using synthetic data to generate your question. Answer pairs to evaluate. You'll usually find that bm, 25 and vector cert are not that far off in practice, like it might be a 67 68, you know, 92, 95. Jason: But there are specific examples where Bm. 25 is much, much more superior. Jason: The 1st one is, if the author of the document is also the searcher. Jason: We're much more likely to have exact word matches on how people search for things. Because I wrote the I wrote the thing. I just, I just Jason: like I had 15 h of interviews. I didn't want to search across my transcripts. I know how I Jason: talk. Basically. So that's 1 thing. Jason: the second one is very specific, which is oftentimes if you are building like an enterprise rag application like, I would not be surprised if 30% of the users use it as document search, they don't really care about the answer. They just gotta figure out which document it was. Jason: And it turns out, if, again, like, if you are the author of these documents, you know the file name Jason: and Bm, 25 will match the file. Name and outperform. Semantic search Jason: by, you know, multiples. Jason: Because if I know the file name is just like company name 24 report. Jason: I I do want to just be able to say, like company name like 24 report, I'd be able to find that document and just matching on file name alone will get me way. Better results than and then back to search. And these are just small examples of Jason: why, it's effectively, you know. I think a default to be able to include Bm. 25, and in conjunction with vectors. Dan Becker: Oh, we got a question before up boots, but I think we have to get it just the right moment in time. Someone says, wait! Can we deep dive on that idea? What's a good ux for for report editing Dan Becker: and I think this is related to Hamil made a comment about Dan Becker: if you're generating reports they should have editing. I don't know that if that's Dan Becker: so topical to where we are at this point. Dan Becker: do anything else you want to say about that, Hamil. Hamel: The only thing I mean, I guess, like you have to make your make sure your product is good. Jason:對。結果是關於 Github。一個問題可能是,我該如何正確開始?結果是,如果我們不對倉庫進行條件限制,你永遠無法做到。好吧,這意味著我需要建立 Ui Jason:來對這類變數進行條件限制,因為如果我沒有這些過濾器,我將永遠無法做得好。Jason:這種情況就是你在數據集上進行探索和提問時所學到的。對吧?而這個想法就像是,今天你有一千個其他問題,卻沒有用戶。Jason:明天你會有一千個問題和一千個用戶,最終你可以逐漸擺脫這些合成數據。但這是一個過渡,當你部署到生產系統時。Hamel:你所說的術語是什麼?我在試著向我的客戶解釋時,已經開始編造一個術語,但有,你知道的,有像查詢擴展。我稱之為目標擴展。我只是隨便編造的。Hamel:像是將每個文檔變成問題,基本上使得回憶發生的可能性更高。Jason:是的。Hamel:那個正確的詞是什麼?我不知道。我只是覺得我編造了它。Jason:我的意思是,是的,我基本上試著這樣做。所以我認為研究是,你正在建立一個合成文本塊。Jason:對吧?Jason:但我只是把它看作是我在增加額外的計算成本,Jason:以改善我的檢索。對吧?就像,如果我寫了一堆,如果我有一個主鍵,它是一個整數,Jason:對於一個數據庫。你只是插入一個異常。但在背後,它就像是在建立某種樹對吧?而且,這樣的插入是昂貴的,因為我已經指定了主鍵的唯一性或其他東西,對吧?Jason:你知道插入會更慢。但我得到了一個更好的讀取。Jason:對吧?這就是我對此的看法。我只是這樣做。Hamel:我們必須編造新的術語。我只是好奇是否有。Jason:是的,是的。好吧。Jason:是的。我認為一般共識就是合成文本塊。Jason:是的。Dan Becker:回到這個問題,Jason:這會是隱藏的。是的。Dan Becker:回到這個問題,說,如何發展興趣?在這類項目中會問什麼問題。Dan Becker:我認為如果你至少沒有一個理論。Dan Becker:我會問,為什麼你要做這個項目?Dan Becker:像大多數Dan Becker:項目,它們有某種目標。你會說,哦,我們想幫助用戶找到有關產品評論的問題答案。Dan Becker:然後。Dan Becker:對我來說,Dan Becker:我可以想像一個愛好項目,你只是想建立一個 rag 系統。Dan Becker:但如果你有一個商業問題要解決。你應該對人們會提出什麼問題有直覺。Jason:如果你這樣做,Bm,25,1。Jason:是的。Jason:我會做那個。Jason:當你做一些像使用合成數據來生成你的問題答案對以進行評估。你通常會發現,bm,25 和向量證書在實踐中並沒有那麼遠,像可能是 67 68,知道,92,95。Jason:但有一些具體的例子,Bm。25 是更優越的。Jason:第一個是,如果文檔的作者也是搜索者。Jason:我們更有可能在如何搜索事物上有精確的詞匹配。因為我寫了,我寫了那個。我只是,Jason:像我有 15 小時的訪談。我不想在我的轉錄本中搜索。我知道我如何Jason:說話。基本上。所以這是一件事。Jason:第二個是非常具體的,通常如果你正在建立像企業 rag 應用程序,我不會感到驚訝,如果 30% 的用戶將其用作文檔搜索,他們並不真的關心答案。他們只是必須弄清楚是哪個文檔。Jason:結果是,如果,再次,如果你是這些文檔的作者,你知道文件名,Jason:而 Bm,25 將匹配文件名並超越。語義搜索Jason:多倍。Jason:因為如果我知道文件名就像公司名稱 24 報告。Jason:我確實想要能夠說,就像公司名稱一樣,像 24 報告,我能夠找到那個文檔,僅僅匹配文件名就會給我更好的結果,然後再回到搜索。這些只是小例子,Jason:為什麼,這實際上,你知道。我認為默認情況下能夠包括 Bm。25,並與向量結合。Dan Becker:哦,我們在上面有一個問題,但我認為我們必須在正確的時刻獲得它。有人說,等等!我們可以深入探討這個想法嗎?報告編輯的良好 UX 是什麼?Dan Becker:我認為這與 Hamel 提到的有關,Dan Becker:如果你正在生成報告,它們應該有編輯功能。我不知道這是否是 Dan Becker:在我們目前的情況下那麼具體。Dan Becker:你還想對此說些什麼,Hamel。Hamel:我唯一的意思是,我想,確保你的產品是好的。 | |
Hamel: If you're gonna it's not just AI, like they're really focused on AI, so like, you have to think about it holistically. Hamel: I mean, yeah, you could focus on AI all you want. But if the the yeah like it, you have to think about the product, too. It's like, kind of like being a data scientist, too. This is the case in Ml. As well a lot of times, you know. Hamel: as you become more senior. Hamel: you know it often. Hamel: and and and. Jason: You guys. Hamel: About the product as well. 哈梅爾:如果你要這樣做,不只是 AI,他們真的專注於 AI,所以你必須從整體上考慮。哈梅爾:我的意思是,是的,你可以隨便專注於 AI。但如果是這樣,你也必須考慮產品。這有點像當數據科學家一樣。這在機器學習中也是如此,很多時候,你知道。哈梅爾:隨著你變得更資深。哈梅爾:你知道,這常常是這樣。哈梅爾:而且而且而且。傑森:你們。哈梅爾:還有產品的部分。 | |
Hamel: Yeah, people can begin. My article helps. 哈梅爾:對,人們可以開始了。我的文章有幫助。 | |
Jason: Alright, so my! My wi-fi can't got knocked off. Jason:好吧,我的!我的 Wi-Fi 被斷線了。 | |
Jason: Where are we? Jason:我們在哪裡? | |
Dan Becker: Like, now. Dan Becker: yeah, cool. I got it. Dan Becker: Go ahead. 丹·貝克:就像,現在。丹·貝克:對,酷。我明白了。丹·貝克:繼續。 | |
Jason: Yeah. So the question was like, How do we improve the ux? The report editing, I think the dangerous thing that Jason:對。所以問題是,我們如何改善使用者體驗?報告編輯,我認為危險的事情是 | |
Jason: I would love to avoid is having the report be generated as a markdown file. Jason:我希望避免的就是報告生成為 markdown 檔案。 | |
Jason: because if you generate a markdown file, you might say, Okay, well, if I have headers and action item bullets. It's structured output. And this is going to be easy to edit. Jason:因為如果你生成一個 markdown 檔案,你可能會說,好吧,如果我有標題和行動項目項目符號。這是結構化的輸出。而且這將很容易編輯。 | |
Jason: But what's going to be hard is it's hard to attribute Jason:但是要歸因是很困難的。 | |
Jason: edits for certain regions of a document. Jason: 編輯文件的某些區域。 | |
Jason: And so Jason: if I can be much more opinionated and do structured outputs Jason:所以 Jason:如果我可以更有主見並且做出結構化的輸出 | |
Jason: right where I have keys and values, there's inputs and input fields. Then I can much more easily edit specific parts of a problem right? Like, instead of having action items in summary as Markdown. If I said. Action items. A list of strings. Summary is a, you know. Jason:正好在我有鍵和值的地方,有輸入和輸入欄位。然後我可以更輕鬆地編輯問題的特定部分,對吧?就像,不是把行動項目放在摘要中作為 Markdown。如果我說,行動項目。一個字串的列表。摘要是一個,你知道的。 | |
Jason: text field. Jason: Then I can like delete an action item right? With a delete action. Jason: 文本欄位。Jason: 那我可以刪除一個行動項目對吧?用刪除動作。 | |
Jason: I can. Instead of going from 3, I can just add a new one and create a new one. Those actions are a little bit more straightforward to measure right. Just Jason:我可以。與其從 3 開始,我可以直接新增一個並創建一個新的。這些行動的測量方式會更直接一些。 | |
Jason: if you create 3 action items. If you add another one, that means you have bad recall. If you delete one, it means you had bad precision. Right? You can at least. Jason:如果你創建了 3 個行動項目。如果你再加一個,那就意味著你的回憶不好。如果你刪除一個,那就意味著你的精確度差。對吧?你至少可以。 | |
Hamel: Have like. Jason: Yeah, and sort of come up with a better. Hamel: As structure data, either. Even, that's just like implementation behind the scenes. 哈梅爾:有點像。傑森:對,然後有點想出更好的。哈梅爾:作為結構數據,無論如何,那就像幕後的實現。 | |
Hamel: so, yeah. 哈梅爾:所以,是的。 | |
Dan Becker: okay, how do you? We actually okay. Dan Becker: how do you do, Rab? Where we want the Lm to pinpoint which documents they got the answer from after getting the top K documents. Dan Becker:好,你怎麼樣?我們其實還好。Dan Becker:你怎麼樣,Rab?我們希望 LLM 能夠精確指出他們從哪些文件中獲得答案,這是在獲得前 K 個文件之後。 | |
Dan Becker: for example, after it gets 10 documents from Rag. And then the answer for the top 10 document, basically, how do you do citations. Dan Becker:例如,在從 Rag 獲取 10 份文件後。然後對於這 10 份文件的答案,基本上,你是如何進行引用的。 | |
Jason: Yeah. So the easiest way to cite is to cite entire text jumps right? So what this ha! What happens is, if you have 10 text chunks. You can just format the prompt in a way where you can say, Here's 10 text chunks. Each one is an id, and as you generate your answer include Jason:對。所以引用的最簡單方法就是引用整個文本片段,對吧?所以這樣會發生什麼呢!如果你有 10 個文本片段。你可以將提示格式化成這樣,你可以說,這裡有 10 個文本片段。每一個都有一個 ID,並且在你生成答案的時候包含。 | |
Jason: like a citation, right? And maybe the answer is just square bracket, chunk, text, chunk, id Jason:就像引用一樣,對吧?也許答案就是方括號、區塊、文本、區塊、ID。 | |
Jason: and usually that works right. If these text chunks are very small. Then you can still build in the ui that says, if I mouth over like the square brackets. 2. I want you to render text chunk number 2. Jason:通常這樣是有效的。如果這些文本片段非常小,那麼你仍然可以在用戶界面中建立一個功能,當我將滑鼠移到方括號上時,2. 我希望你顯示文本片段編號 2。 | |
Jason: That's like a very straightforward way of doing it. Jason:這就像是一種非常直接的做法。 | |
Jason: A much more complex way of doing it could be like setting by stands and whatnot. But generally the idea is that because most of these language models are trained to output Markdown. Jason:一種更複雜的做法可能是設置標準等等。但一般來說,這個想法是因為大多數這些語言模型都是訓練來輸出 Markdown。 | |
Jason: if you make the citation look like a mark down, URL. It works pretty well. Jason:如果你把引用做成像是標記語言的格式,URL。這樣效果很好。 | |
Jason: and so you can say, not only it does. Every text chunk has a as a user id, but make it look like a URL. Actually, it actually performs much better in practice. Jason:所以你可以說,不僅如此,每個文本片段都有一個用戶 ID,但看起來像是一個 URL。實際上,它在實踐中表現得更好。 | |
Dan Becker: What do you think about high dimension embedding models as rerankers versus cross encoder models. Dan Becker:你對高維嵌入模型作為重新排序器與交叉編碼模型有什麼看法? | |
Jason: Yeah, I almost always want to use both right again. It's mostly like a latency trade off Jason:對,我幾乎總是想再次使用兩者。這主要是像延遲的權衡。 | |
Jason: vector databases are are pretty fast in terms of search relative to, like, you know, cross encoders. And so the again, the idea is like, I might have a million documents. Vector, search gives me a hundred documents, and the rebanker gets me, you know, 20 documents. And really, I'm just making trade offs between precision recall and the latency of these systems right? It's it's gonna be very hard for a vector database to realize whether or not Jason:向量資料庫在搜尋速度上相對於像是交叉編碼器來說是相當快的。所以再次強調,這個想法是,我可能有一百萬份文件。向量搜尋給我一百份文件,而重新排序器給我,您知道的,20份文件。其實,我只是在精確度、召回率和這些系統的延遲之間做取捨,對吧?對於向量資料庫來說,判斷是否能做到這一點將會非常困難。 | |
Jason: I love coffee. And I hate coffee in bed to be similar or different. Jason: but it is something that your cross encoder is going to be much Jason:我愛咖啡。而我討厭在床上喝咖啡是相似還是不同。Jason:但這是你的交叉編碼器將會是更好的東西。 | |
Jason: more comfortable handling. Jason:更舒適的處理方式。 | |
Dan Becker: The next question is the same about just adding citations. We talked about using Markdown for that. Dan Becker:下一個問題是關於添加引用的問題。我們談到了使用 Markdown 來處理這個。 | |
Dan Becker: Can you comment on hierarchical retrievers from Lama Index fine tuning embeddings to improve rag. How effective it this is! Actually 2 questions. Dan Becker:你能評論一下從 Lama Index 微調嵌入的層次檢索器,以改善 rag 嗎?這有多有效!其實是兩個問題。 | |
Dan Becker: Can you comment on hierarchical retrievers, and then also separately. Can you comment on fine tuning your embedding model to get better measures of Dan Becker:您能評論一下層次檢索器嗎?然後另外,您能評論一下微調您的嵌入模型以獲得更好的衡量標準嗎? | |
Dan Becker: relevance for rag. Jason: Yeah. So with the hard call retur stuff. Dan Becker: 與 rag 的相關性。Jason: 是的。所以關於硬性回報的東西。 | |
Jason: Funny. If I was talking with Jerry on that kind of stuff, I think generally just like Jason: 有趣。如果我跟 Jerry 談這種事情,我覺得一般來說就像 | |
Jason: as language models get better. We'll definitely need to reason about this like hierarchy less and less because, like these, hierarchical changes are probably gonna be like small percentage point improvements, right? In which hits like a longer context, just means we don't have to be as clever to solve these kind of problems like the reason hierarchical matters might be if we Jason:隨著語言模型的進步,我們肯定需要越來越少地以層級的方式來思考這個問題,因為這些層級變化可能只會帶來小幅度的改善,對吧?在這種情況下,較長的上下文意味著我們不必那麼聰明就能解決這類問題,層級重要的原因可能是如果我們 | |
Jason: had 4 text chunks in a row, 1, 2, and 3 were retrieved. The idea is like maybe 4 like 3 should have been retrieved. 2. Jason:連續有 4 段文字,1、2 和 3 被檢索到。這個想法是,可能 4 像 3 應該被檢索到。2。 | |
Jason: So there's no like missing context. I think, that can be solved by just having bigger chunks and like better retrieval. Jason:所以沒有缺少上下文。我認為,這可以通過擁有更大的區塊和更好的檢索來解決。 | |
Jason: But Jason: This is like an opinion I hold strongly, actually, mostly because I've been doing like fine-tuning and betting models for search since, like 2,015, Jason:但是 Jason:這其實是我堅持的一個觀點,主要是因為我從 2015 年開始一直在進行微調和投注模型的搜尋。 | |
Jason: I think any company that is making money is leaving money on the table by not fine-tuning in many models. Jason:我認為任何賺錢的公司如果不在許多模型上進行微調,都是在浪費錢。 | |
Jason: If you think about large e-commerce websites, you know Netflix trains an embedding model because they know what you watch and what you don't watch. And you know we we can like have that information be used to build the recommendation system. That's better. Amazon has their own custom embedding models because they have checkout data. And so they know that a user embedding and a product embedding is is very, very good. Jason:如果你想到大型電子商務網站,你知道 Netflix 訓練了一個嵌入模型,因為他們知道你觀看了什麼和沒有觀看什麼。你知道我們可以利用這些資訊來建立更好的推薦系統。亞馬遜有他們自己的自訂嵌入模型,因為他們擁有結帳數據。因此,他們知道用戶嵌入和產品嵌入是非常非常好的。 | |
Jason: and Jason: 和 | |
Jason: it's just unclear. Whether or not I believe that Openai Jason: has data with question. Answer pairs Jason:這只是不清楚。無論我是否相信 OpenAI 擁有問題和答案對的數據。 | |
Jason: for your domain entertained it. Jason: Right? So it's like, chances are, Openai does not have that data available. And you have that data available Jason: 針對你的領域來說,這是有趣的。Jason: 對吧?所以這就像是,機會是,OpenAI 沒有那個數據可用。而你有那個數據可用。 | |
Jason: on top of that because of how smart these embedding models are in general. If you go and search like modal embedding fine tuning, I actually worked on a project with Modal, where we found that Jason:再者,由於這些嵌入模型普遍非常智能。如果你去搜尋像是 modal embedding 微調,我實際上曾與 Modal 合作過一個專案,我們發現 | |
Jason: even with 2,000 question answer pairs, we will outperform Openai and cohere right? And so in these relevancy tests, we got maybe like 84% from Openai by default. But if you fine tune a 200 MB Burt model, you're gonna get 86% with 2,000 examples. And you're gonna get like the 89% with 20,000 examples. Jason:即使有 2,000 個問題回答對,我們也會超越 Openai 和 Cohere,對吧?所以在這些相關性測試中,我們從 Openai 得到的預設值大約是 84%。但如果你微調一個 200 MB 的 Burt 模型,你會在 2,000 個範例中得到 86%。而在 20,000 個範例中,你會得到大約 89%。 | |
Jason: And if you're a company with like real revenue and real users, you're gonna get 20,000 questions in in weeks, right? Jason: 如果你是一家有真正收入和真正用戶的公司,你會在幾週內收到 20,000 個問題,對吧? | |
Jason: And then you can do much more interesting things and recognize that, hey? Jason:然後你可以做更有趣的事情,並且認識到,嘿? | |
Jason: You know, before I had to like synthetically generate text chunks or synthetically generate text chunks from the questions, because I know that the question and the text chunks don't look the same, and I'm hoping the inviting model can capture this. Jason:你知道,在我之前,我必須像是合成生成文本片段或從問題中合成生成文本片段,因為我知道問題和文本片段看起來不一樣,而我希望這個邀請模型能夠捕捉到這一點。 | |
Jason: If you find to a new inventory model, it does exactly what you want it to do. Jason:如果你找到一個新的庫存模型,它會完全按照你的需求運作。 | |
Dan Becker: coming back to a topic from earlier, having a really tough time extracting data from tables and Pdfs in rag pipelines, any trips tips to improve that. Dan Becker:回到之前提到的話題,在 rag 管道中從表格和 Pdf 中提取數據真的很困難,有沒有什麼技巧可以改善這個問題。 | |
Dan Becker: Yep. Jason: a blanket answer. I feel pretty comfortable saying now is actually probably looking at things like the llama parse. 丹·貝克:沒錯。傑森:一個籠統的回答。我現在覺得相當有把握地說,實際上可能是在看像是 Llama parse 這樣的東西。 | |
Jason: They've put in a lot of good work in blending language models and existing text extract tools Jason: 他們在融合語言模型和現有的文本提取工具方面做了很多好的工作。 | |
Jason: to basically generate tables that are better. Jason: A simple answer in terms of prompting. Ashley is to just switch from Csv's to Markdown tables. Jason:基本上是生成更好的表格。Jason:在提示方面的簡單答案是,Ashley 只是將 Csv 轉換為 Markdown 表格。 | |
Jason: I found that in practice giving language models. Jason:我發現實際上給予語言模型。 | |
Jason: asking language models to output Markdown tables is actually higher performance than generating Csvs. And I have, like some recall metrics of like, can you find the row in the in in like a 600 log line column. It does look like Markdown is is a better tool. So, for example, if you had a Pdf and a table that you have a hard time parsing. It might be worth giving this image to Gpd. 4 or to Opus and asking for it to generate Markdown tables back out. Jason:要求語言模型輸出 Markdown 表格實際上比生成 Csv 更具性能。而我有一些回憶指標,比如說,你能在一個 600 行的日誌列中找到那一行嗎?看起來 Markdown 確實是一個更好的工具。因此,例如,如果你有一個 Pdf 和一個難以解析的表格,將這個圖像提供給 Gpd 4 或 Opus,並請求它生成 Markdown 表格回來,可能是值得的。 | |
Jason: and then it'll be able to handle things like multiple headers or multiple rows and indices because it can just manage the white space. Jason:然後它將能夠處理像是多個標題或多個行和索引的事情,因為它可以管理空白。 | |
Dan Becker: Alright. Thanks for the informative session. Do you have any specific guidelines, metrics, telemetry, hooks where knowledge, extraction and a involves extraction? Dan Becker:好的。謝謝你提供的資訊會議。你有任何具體的指導方針、指標、遙測、鉤子,關於知識、提取和涉及提取的內容嗎? | |
Dan Becker: you wanna read this one? I can also read it. But. Dan Becker:你想讀這個嗎?我也可以讀。不過。 | |
Jason: I can try. Dan Becker: Yeah. Spec-. Jason:我可以試試。Dan Becker:對。規格。 | |
Jason: Basically, the question was like, Are there any guidelines on? You know how metrics, telemetry hooks to figure out where we should instrument our system Jason:基本上,問題是像是,有沒有任何指導方針?你知道如何使用指標、遙測鉤子來找出我們應該在哪裡對系統進行儀器化。 | |
Jason: in order to sort of figure out how our system is working. Luckily enough, I spent like 3 years instrumenting recommendation systems. If you go back in my writing. There's a blog post called N levels of complexity of rag applications. Jason:為了弄清楚我們的系統是如何運作的。幸運的是,我花了大約 3 年的時間來研究推薦系統。如果你回去看看我的寫作,有一篇博客文章叫做《rag 應用的 N 層複雜性》。 | |
Jason: And I kind of just outlined like where I like to Jason: log. And what kind of data. I try to log Jason:我大致勾勒了我喜歡記錄的地方,以及我嘗試記錄的數據類型。 | |
Jason: right? Jason: yeah, like, you know, cause it basically allows like a recommendation system, right? Which is a user signed up like signs up. I'm gonna show them 10 products. They're gonna scroll them. Look at the products. They're gonna click a product and buy the product right? It's very much the same thing as like a question came in. We use a bunch of text jumps right? One simple thing that we can do is if I give it 10 text jumps that joints and answer and cites 2 of them. Jason:對吧? Jason:對啊,就像,你知道的,因為它基本上允許像推薦系統,對吧?用戶註冊後,我會展示給他們10個產品。他們會瀏覽這些產品,看看產品。他們會點擊一個產品並購買這個產品,對吧?這跟問題進來的情況非常相似。我們使用了一堆文本跳躍,對吧?我們可以做的一個簡單的事情是,如果我給它10個文本跳躍,它會連接並引用其中的2個。 | |
Jason: You could imagine those 2 are more relevant than the other 8 Jason: that is still feedback that you can put into an embedding fine team model. You can say, okay for Jason:你可以想像那兩個比其他八個更相關。Jason:那仍然是你可以放入嵌入微調模型的反饋。你可以說,好吧,對於 | |
Jason: for any N textbooks I present to the language model. Jason:對於我呈現給語言模型的任何 N 本教科書。 | |
Jason: I want to fine tune a model that ranks the ones that were cited higher than the ones that were not cited Jason:我想要微調一個模型,讓被引用的排名高於未被引用的。 | |
Jason: right? And so you can again like, use these little micro games in your infrastructure to generate the data, you need to fine-tune a model in the future. Jason:對吧?所以你可以再次像這樣,在你的基礎設施中使用這些小型微遊戲來生成數據,以便未來微調模型所需。 | |
Jason: But this is all kind of like in in the end levels of rag posts. And as I build more I'll have better information. Jason:但這一切有點像是在 rag 帖子的最終階段。隨著我建立更多,我會有更好的資訊。 | |
Dan Becker: What's your recommended chunking strategy? How big should the chunks be? Should you have overlap between chunks. Dan Becker:你推薦的分塊策略是什麼?每個分塊應該有多大?分塊之間應該有重疊嗎? | |
Jason: Yep. So this is a recommendation, as I'm kind of stealing from open eye and anthropic, and, interestingly enough, as they are increasing their contact slats. The recommended chunk sizes have also increased. So back in the Gpd. 4 or 32 K. World, I think everyone was recommending, like Jason:是的。所以這是一個建議,因為我有點借鑒了 open eye 和 anthropic,並且有趣的是,隨著他們增加了聯絡槽,建議的區塊大小也增加了。因此在 Gpd. 4 或 32 K 的世界裡,我想每個人都在推薦,像是 | |
Jason: Like 500 tokens with a 50% overlap. Jason:像 500 個 tokens,有 50% 的重疊。 | |
Jason: I think now both opus and Claude are like 100 K context. And now both of them Jason:我覺得現在 opus 和 Claude 都像是 100 K 的上下文。現在他們兩個都這樣。 | |
Jason: roughly recommend, like 800 overlap by 80%. Jason: 大約建議,像是 800 重疊 80%。 | |
Jason: What I've really found is when you actually generate these synthetic data sets right with question answer pairs, and you play around with a chunk size? I have not seen too much of a performance change right? And so what I yeah, what I basically do is, I just follow their advice. A 100 tokens, 50 million overlap. Jason:我真正發現的是,當你實際上正確生成這些合成數據集,並且使用問題回答對進行實驗時,當你調整塊大小時,我沒有看到太多的性能變化,對吧?所以我基本上做的就是,我只是遵循他們的建議。100 個標記,5000 萬的重疊。 | |
Jason: and if I want to improve my system, usually I'll improve them by generating synthetic text chunks of whole documents rather than trying to augment individual text chunks. If that makes any sense. Jason:如果我想改善我的系統,通常我會透過生成整份文件的合成文本片段來改善它們,而不是嘗試增強單個文本片段。如果這樣說有道理的話。 | |
Hamel: Is the answer, like, have emails and test at least test some things. Hamel: 答案是,像是有電子郵件並且至少測試一些東西。 | |
Jason: Yeah. But I also find that like, I've never had an email. Where? When I went from 500 to 800 like, something amazing happened. Right? Jason:對。但我也發現,我從來沒有收到過一封電子郵件。在哪裡?當我從 500 增加到 800 的時候,發生了一些驚人的事情。對吧? | |
Jason: it's so specific, like. Jason:這麼具體,像。 | |
Jason: certain kinds of data have longer, like longer token references. Right? Jason:某些類型的數據有較長的,像是較長的標記引用。對吧? | |
Jason: It's like. Jason: 就像。 | |
Jason: it really depends what kind of data you have Jason: but that is something you would basically be able to find out. If you generate a synthetic data sets right Jason:這真的取決於你擁有什麼樣的數據。Jason:但這基本上是你能夠找出來的。如果你生成一個合成數據集的話。 | |
Jason: like it turns out that Jason:結果是 | |
Jason: you know, in legal documents, it turns out on page 7, it references a date. That is a reference to a reference of the signed day, which is on page 28 like that might not be like Jason:你知道,在法律文件中,結果在第七頁提到了一個日期。那是對簽署日期的參考,簽署日期在第二十八頁,這樣可能不會像。 | |
Jason: detectable if you use like 50 token like 500 tokens, right? So we'll just say, like, Well, this is due 35 days after signing. But signing date is like a different document somewhere else. Right? But that's that's very specific, and you'll still be able to uncover that with synthetic data. Jason:如果你使用大約 50 個 token 或 500 個 token 的話,是可以被檢測到的,對吧?所以我們就說,這是在簽署後 35 天到期。但簽署日期是在其他地方的不同文件上,對吧?但這是非常具體的,你仍然可以用合成數據來揭示這一點。 | |
Jason: Let me let me refresh this. Dan Becker: Yeah, refresh Dan Becker: looks good. There's a question. Jason: 讓我刷新一下。 Dan Becker: 嗯,刷新一下。 Dan Becker: 看起來不錯。有個問題。 | |
Dan Becker: Yeah. Jason: what kind of model you use? Because I mostly work with enterprises. The model of choice ends up, going down to either Opus Haiku or 4. 0, primarily, because we can afford to be less worried about context length limitations. 丹·貝克:對。傑森:你使用什麼樣的模型?因為我主要與企業合作。選擇的模型最終會是 Opus Haiku 或 4.0,主要是因為我們可以不太擔心上下文長度的限制。 | |
Jason: And so but just by using one of these models, we know it's gonna do a good job. Jason:所以僅僅使用這些模型之一,我們知道它會做得很好。 | |
Jason: the only Jason:唯一的 | |
Jason: yeah, I think that's that's mostly the the thing because we have that context, whether we have, we get to be a little bit less clever with how we do our search, and we can as much Jason:對,我想這主要是因為我們有那個背景,無論我們如何進行搜尋,我們可以稍微少一點聰明才智,並且我們可以盡可能地。 | |
Jason: higher resolution instructions on how to generate outputs that we can use where we can render citations, and we can render different interesting Uxes. Right? So, for example, when we're limitless, not only do we generate the action items. We generate action items with a Signees and opus is able to actually generate a URL that we can parse and render like a profile head or something like that. Those are the things that we matter. And for large scale applications. Jason:更高解析度的指示,告訴我們如何生成可以使用的輸出,並且我們可以呈現引用,還可以呈現不同有趣的 Uxes。對吧?所以,例如,當我們無限制時,我們不僅生成行動項目。我們生成帶有簽署者的行動項目,而 opus 能夠實際生成一個我們可以解析和呈現的 URL,就像個人資料頭像或類似的東西。這些是我們所重視的事情,並且適用於大規模應用。 | |
Jason: It's really hard to get around the rate limits that we have on some smaller systems. Jason:在一些較小的系統上,我們的速率限制真的很難繞過。 | |
Jason: Another question is Jason: any picks on picking and betting models? Jason: 另一個問題是 Jason: 有關於選擇和投注模型的建議嗎? | |
Jason: Ultimately, again, this is about having those emails right? Like, if you have that synthetic set of a thousand questions against 10,000 text chunks. Jason: 最終,這又是關於擁有那些電子郵件,對吧?就像,如果你有那一千個問題的合成集對應一萬個文本片段。 | |
Jason: And you have your email. That is just when I search this, the top 3 must contain the original text, Chunk. Jason:而且你有你的電子郵件。當我搜尋這個時,前 3 名必須包含原始文本,Chunk。 | |
Jason: We just gotta try them all and figure out what's flat. Jason: And luckily enough, if you were in a world where you had 2,000 questions and 10,000 relevant text chunks. Jason: 我們只需要嘗試所有的,然後找出什麼是平的。 Jason: 幸運的是,如果你身處一個有 2,000 個問題和 10,000 個相關文本片段的世界裡。 | |
Jason: Any model, you fine tune will just outperform whatever base model you have. Jason: 任何模型,只要你進行微調,將會超越你所擁有的任何基礎模型。 | |
Jason: And so you kind of need, the Eval to make a decision in the beginning. But once you actually have production level data like even 3,000 question answer pairs. Jason:所以你在一開始需要 Eval 來做決策。但一旦你實際擁有生產級別的數據,比如說 3,000 個問題回答對。 | |
Jason: you'll just fine to, in which case you're making a different set of constraints. Right? How big is the model, how how much throughput do I have on the model? Does it fit on an A 10 or a 100? Those are end up being the decisions you you have to make. Jason:你會很好,這樣一來你就設定了一組不同的限制。對吧?模型有多大,我在模型上有多少吞吐量?它適合 A 10 還是 A 100?這些最終都是你必須做出的決策。 | |
Jason: Let me think Jason: I wanna talk about this one just for this interesting. Jason:讓我想想 Jason:我想談談這個,因為這很有趣。 | |
Jason: how much do you look at the data and inspect the data up front to understand the questions versus adapting over time Jason:你在一開始有多看數據和檢查數據以了解問題,與隨著時間的推移進行調整相比? | |
Jason: in in a bigger company, like maybe not a company like 10 people with a company that's like 50 people. I imagine there's gonna be someone that's just constantly looking at this data like, maybe like once a week, like one small thing that I do kind of like the stand up process is Jason:在一個更大的公司裡,可能不是像十個人的公司,而是像五十個人的公司。我想會有某個人不斷地查看這些數據,可能每週查看一次,像我所做的一些小事情,有點像站立會議的過程。 | |
Jason: when we start this initiative. It's going to be like, you know. Give me a Csv file of like, 40,000 question answer pairs. I'm going to Jason:當我們開始這個計畫時。這將會像,你知道的。給我一個大約 40,000 個問題和答案對的 Csv 檔案。我將要 | |
Jason: run some clustering algorithm. Look at the clusters. Each Jason:運行一些聚類算法。查看這些聚類。每個 | |
Jason: have some ideas, rerun it, have some ideas. Jason: And then what I do is I create very explicit labels. Jason: 有一些想法,重播一下,有一些想法。Jason: 然後我所做的是創建非常明確的標籤。 | |
Jason: Let's say, okay, like this is a marketing like pricing da da da capabilities. Compare Contrast, Enable and needle in a haystack, etc. Jason:假設,好的,這是一個行銷定價的能力等等。比較對比、啟用和大海撈針等等。 | |
Jason: And I log that. Jason:然後我記錄下來。 | |
Jason: And basically, what I do is I just make sure I have a other key. And I monitor the percentage of questions that are other Jason:基本上,我所做的就是確保我有另一個關鍵。我會監控其他問題的百分比。 | |
Jason: over in real time. Jason: And so I'll just like check once a week. Jason: 每週檢查一次。 | |
Jason: And hopefully, other doesn't go from like 10% to like 50% randomly. And if it does. Jason:希望其他的不要隨便從 10% 跳到 50%。如果真的這樣的話。 | |
Jason: I can now just come back in and say, Show me the 10,000 questions that were other. Rerun the cluster. What did I learn? So it's it's gonna be a very much iterative process. You know. Jason:我現在可以回來說,給我看那 10,000 個其他問題。重新運行這個集群。我學到了什麼?所以這將是一個非常迭代的過程。你知道的。 | |
Jason: it's it's going to be iterative for the rest of your life as long as you're getting new clients and new customers and new users and new inventory to search against, and new types of questions. Jason:這將會是你一生中的反覆過程,只要你不斷獲得新的客戶、新的消費者、新的用戶,以及新的搜尋庫存和新的問題類型。 | |
Jason: And you know, we just do that kind of like during stand up. We just like, Hey, you know, 10% of the questions were other for some reason for questions around categories and date times, it has like 30% satisfaction. Let's go double, click, and figure out like what the heck is going on. Jason:你知道,我們就像在站立會議中那樣做。我們就像,嘿,你知道,10% 的問題是其他類型的,出於某種原因,關於類別和日期時間的問題,滿意度只有 30%。讓我們深入了解一下,搞清楚到底發生了什麼。 | |
Hamel: How do you Hamel: teach your clients to look at data. Hamel: 你如何教導你的客戶看待數據。 | |
Hamel: feel like it can be hard to. Jason: Oh, man, if I did, I would. 哈梅爾:感覺這可能很難。傑森:哦,天啊,如果我能,我會。 | |
Jason: okay, I have a job and thank myself. A job. No, I it's really hard like it really is just like Jason:好吧,我有一份工作,並且感謝自己。一份工作。不,我真的很難,真的就是這樣。 | |
Jason: the sign, because it's basically about teaching like engineers, a scientific method. I think I've been thinking it about more as like, what is a report you can create? Jason:這個標誌,因為它基本上是關於像工程師一樣教學,一種科學方法。我想我一直在思考的是,您可以創建什麼樣的報告? | |
Jason: What is the data that you can collect? Jason: And then what are some hypotheses? Jason:你可以收集哪些數據? Jason:那麼有哪些假設呢? | |
Jason: And then, once you have these hypotheses. Jason: for example, I think that we are not doing well because Jason:然後,一旦你有了這些假設。Jason:例如,我認為我們表現不佳是因為 | |
Jason: it looks like all these questions have recent recency filters, and we don't have licenses filters as a hypothesis. The experiment is to add recency filters, and the observable is, you know, that that satisfaction score should go up like a simple example. Was, Jason:看起來這些問題都有最近性過濾器,而我們沒有作為假設的許可證過濾器。實驗是添加最近性過濾器,而可觀察的結果是,你知道,滿意度分數應該會上升,像一個簡單的例子。 | |
Jason: People had feedback on, like the Jason: transcript summaries regenerated, and it kind of felt like the issue was just. It was too short. Jason: 人們對於像是 Jason: 轉錄摘要的再生有反饋,感覺問題就是。它太短了。 | |
Jason: like a 30 min meeting and a 2 h meeting at the same length and transcript. Jason:像是 30 分鐘的會議和 2 小時的會議在同樣的長度和逐字稿。 | |
Jason: So we had that hypothesis. So I just plotted like a transcript lens versus summary length. Jason: 所以我們有那個假設。我只是將逐字稿長度與摘要長度繪製成圖。 | |
Jason: and like the color with satisfaction, is it? Oh, it looks like they're correlated. Jason: And then what we did in practice, we just said Jason: 然後像這個顏色和滿意度,是不是?哦,看起來它們是相關的。Jason: 然後我們在實踐中所做的,我們只是說 | |
Jason: we just added a line that's like, make sure the summary is length is proportional to the length of the transcript. Jason:我們剛剛新增了一條規則,就是確保摘要的長度與逐字稿的長度成比例。 | |
Jason: We ran an email, they just said, for these 50 transcripts Jason:我們發了一封電子郵件,他們只是說,對於這 50 份成績單。 | |
Jason: without this prompt in it was like a thousand 100 characters on average. And then, when we added the prompt, it was 2,000 on average. Okay, that's my experiment. I'm gonna deploy this and just see if satisfaction went up and it went up Jason:沒有這個提示的時候,平均大約是一千個字符。然後,當我們加入提示後,平均變成了兩千個。好吧,這是我的實驗。我會部署這個,看看滿意度是否提高,結果確實提高了。 | |
Jason: right. Jason: But it's hard to Jason: figure what the system there is. But really, the idea is that, yeah, it's like hypothesis, data, intervention, experiment, hypothesis, data, intuitive experiment. Jason:對。Jason:但是很難去 Jason:弄清楚那裡的系統是什麼。但其實,這個想法是,對,就像假設、數據、介入、實驗、假設、數據、直觀實驗。 | |
Hamel: Seems like, yeah, seems like data scientists. 哈梅爾:看起來,是的,看起來像數據科學家。 | |
Jason: Yeah. Jason:對。 | |
Jason: the most popular question right now is just better databases versus cost encoders. Oh, I think we answered that. Jason:目前最受歡迎的問題就是更好的資料庫與成本編碼器之間的比較。哦,我想我們已經回答過了。 | |
Jason: yeah, I think it just comes back down to. Really, it's going to be your evals Jason:對,我想這最終還是回到。真的,這將是你的評估。 | |
Jason: and your latency constraints as a function of your emails and your business outcomes Jason:以及你的延遲限制,作為你的電子郵件和商業成果的函數 | |
Jason: in recommendation systems. 100 ms of latency could make you 1% higher revenue. Jason:在推薦系統中,100 毫秒的延遲可能讓你的收入提高 1%。 | |
Jason: And so that might just be the difference of. Jason: I want a top KA 1,000 using embeddings and top 100 using rerancers versus, you know. Top K. 10,000. Then top 100 reranes right like that. Just might be slower. Jason:所以這可能就是差別。Jason:我想要使用嵌入的前 KA 1,000 和使用重排的前 100,與你知道的。前 K 10,000。然後前 100 重排,像這樣。可能會比較慢。 | |
Jason: even if it might be 1% better. Jason: 即使它可能好 1% 。 | |
Jason: I make that size less money in the long run. Jason:我認為那個尺寸長期來看賺的錢更少。 | |
Jason: Rep. Jason: 代表 | |
Jason: So let me, let's do this one. Jason: I had a client that wanted to ask questions of lease contracts, such as, What is every red flag in terms of late fees? Jason: 那麼讓我來做這個。Jason: 我有一位客戶想要詢問租約的問題,例如,關於逾期費用的每一個紅旗是什麼? | |
Jason: This statement would appear 50 times in a document. How would you search for this Jason:這個陳述在文件中會出現 50 次。你會如何搜尋這個? | |
Jason: if I knew that up front Jason:如果我事先知道這個 | |
Jason: I would just Jason: write a synthetic text, Chunk, that it's like, what are the red flags around late fees and just like, have that in explicitly right? Like, if it turns out that Jason: 我會寫一篇合成的文本,Chunk,像是,關於逾期費用的紅旗是什麼,然後就明確地寫出來,對吧?就像,如果結果是這樣的話 | |
Jason: you know, 20% of the questions are about late fees. Jason: I might just have opus. Look at a single Pdf, and just say, Give me all the statements on late fees in a single chunk Jason:你知道,20% 的問題都是關於逾期費用。Jason:我可能只需要 opus。看一個單一的 Pdf,然後就說,給我所有關於逾期費用的聲明,集中在一起。 | |
Jason: and then have that as something that I can augment my my data set with. Right? You are creating the inventory. You need to kind of sell to the customer that you have Jason:然後把這個當作我可以增強我的資料集的東西。對吧?你正在創建庫存。你需要向客戶推銷你擁有的東西。 | |
Jason: like, if you were an e-commerce website and everyone wanted to search shorts. But you didn't buy shorts. Jason:就像,如果你是一個電子商務網站,大家都想搜尋短褲。但你卻沒有購買短褲。 | |
Jason: You just gotta go, and, like, you know, get some shorts. Jason: and I think I think those would be the same example. I would just take Opus or Haiku and apply Q on every document, and in a situation where you might want to do in practice, you can say, Okay, 1st 1st pass is for every document that comes in. Can I predict whether or not it is a lease contract if it is a lease contract. I know that my customers care about red flags on late fees Jason:你只需要去,然後,像,你知道,買一些短褲。Jason:我認為這會是同樣的例子。我會把 Opus 或 Haiku 拿來,對每個文件應用 Q,在你可能想要實踐的情況下,你可以說,好吧,第一次通過是針對每個進來的文件。我能否預測它是否是一份租約合同?如果它是一份租約合同,我知道我的客戶關心延遲費用的紅旗。 | |
Jason: for every document that I predict as a lease contract extract out clauses around late fees. Jason:對於我預測為租約的每一份文件,提取出有關逾期費用的條款。 | |
Jason: And then you put that back into a bunch of database. Jason:然後你把那個放回一堆資料庫裡。 | |
Jason: Yeah, let's do like 2 more hand and then Jason:對啊,我們再打兩局然後。 | |
Jason: have this up Jason: any recommendation for vector, stores. And how you decide metadata for different cases. Jason:有這個 Jason:對於向量、商店有什麼推薦嗎?還有你是如何為不同情況決定元數據的。 | |
Jason: Again. It's just like Jason: 又來了。就像 | |
Jason: it's very much like looking at the data, like, I think the answer is, look at the data and have real examples to work with, the only thing I would call out around vector stores is. Jason:這非常像是在查看數據,我認為答案是,查看數據並擁有實際的例子可以使用,唯一我想提到的關於向量存儲的是。 | |
Jason: you don't really need a vector store. You need a search engine. And that includes more than just vectors. Right? You have. Bm, 25. You have. SQL, right? Like, if I wanna do an aggregate, a vector, database is not gonna do an aggregate for me right? Jason:你其實不需要一個向量儲存庫。你需要一個搜尋引擎。而這不僅僅包括向量,對吧?你有 Bm,25。你有 SQL,對吧?就像,如果我想做一個聚合,向量資料庫是無法為我做聚合的,對吧? | |
Jason: And so that just towards only one part of the problem you might like, some companies might just want to use. Some might want to have dynamo Dv and open search and pinecone in separate places. Some might be like S. 3 files. Jason:所以這只是針對問題的一部分,你可能會喜歡,有些公司可能只想使用。有些可能想把 dynamo Dv、open search 和 pinecone 放在不同的地方。有些可能會像 S. 3 檔案。 | |
Jason: those are kind of like implementation details of your business. Jason:那些有點像是你們業務的實施細節。 | |
Jason: I think all these have one star ratings. So Jason:我覺得這些都有一顆星的評價。所以 | |
Jason: okay, I'll do this one as just like a closing statement, which is like, How do you find your 1st customers in need of rag, your consulting services. Jason:好吧,我就把這個當作結尾陳述,這就像是,你如何找到第一批需要布料的客戶,你的諮詢服務。 | |
Jason: my dad. Jason:我爸爸。 | |
Jason: as a recommendation systems. And one of my core beliefs is that recommendation systems and rag systems are pretty much identical. Jason:作為推薦系統。我核心的信念之一是推薦系統和 RAG 系統基本上是相同的。 | |
Jason: So the business I worked on before was a business called Sticks, where a human customer sends an email to a stylist. And that email is like a paragraph for long. And the stylist uses the AI to turn that paragraph into a bunch of search queries. Jason:所以我之前工作的業務是一個叫 Sticks 的公司,客戶會發送電子郵件給造型師。那封電子郵件大約有一段文字那麼長。造型師利用 AI 將那段文字轉換成一系列的搜尋查詢。 | |
Jason: And we basically show the stylus like 100 pieces of clothing, and the stylus picks the clothing, and then writes a note for our customer. Jason:我們基本上展示了大約 100 件衣服,然後觸控筆選擇衣服,接著為我們的客戶寫下備註。 | |
Jason: It turns out that process is the same as an L instead of a stylus. Jason:結果發現這個過程跟 L 一樣,而不是使用觸控筆。 | |
Jason: a human sends a paragraph to the style a human Llm. Jason:一個人發送一段文字給一個人 LLM 的風格。 | |
Jason: We produce, like a hundred text chunks versus 100 pieces of clothes, and then the Llm. Generates a response, whereas you know, the South generates response. And so Jason:我們產出大約一百段文字,而不是一百件衣服,然後 LLM 生成一個回應,而你知道,南方也生成回應。所以 | |
Jason: I think I just had a lot of experience instrumenting systems that look like that monitoring and measuring systems that look like that and improving systems that look like that. And I've just been sort of converting all this Rexis into improving rag applications. Jason:我想我在儀器系統方面有很多經驗,這些系統看起來像是監控和測量系統,並且改善這些系統。我一直在將這些 Rexis 轉換為改善 rag 應用程式。 | |
Jason: And yeah, that's all I got. Jason:而且,這就是我所擁有的一切。 | |
Dan Becker: Alright, this is awesome. Dan Becker:好吧,這太棒了。 | |
Dan Becker: Thanks, Jason. Hamel: Yeah, thank you for going so much over as well. But the question. 丹·貝克:謝謝你,傑森。哈梅爾:是的,謝謝你也花了這麼多時間。但問題是。 | |
Jason: No worries, no worries. Yeah. I think a lot of these ones just have like a no thumbs up, so I should be good to. Jason:不用擔心,不用擔心。對。我想很多這些只是沒有讚的,所以我應該沒問題。 | |
Jason: I'll save this, and and put them somewhere else. Jason:我會把這個保存起來,然後放到別的地方。 | |
Jason: Alright. Jason: sip. Jason: 好的。 Jason: 喝一口。 | |
Jason: take care. |