Back to Basics for RAG w/Jo Bergum


Dan Becker: Everyone.
Dan Becker: the Joe
Dan Becker: to him.
Hamel Husain: Hello!
Jo: Hello!
Hamel Husain: Really excited about this talk is also the link.
Jo: Yeah.
Hamel Husain: Last one of the conference.
Hamel Husain: So it's very special. So.
Jo: Thank you so much for inviting me.
Hamel Husain: Yes.
Jo: Fantastic. Yeah, it's amazing to see all the interest in your course. And I mean the lineup of Speaker. So I was really honored when you.
Jo: when you ask me to to join. That's amazing.
Hamel Husain: No, I'm really honored to have you to have you join.
Hamel Husain: So yeah, it's great.
Hamel Husain: Think your perspectives on on rag are are very good.

Hamel Husain: so I'm excited about you sharing it more widely.
Jo: Thank you.
Dan Becker: We usually
Dan Becker: start like 5 after the hour or after the hour. So
Dan Becker: got another 5 min.
Dan Becker: we have
Dan Becker: 23 people who are watching or listening to us now.
Dan Becker: I bet we'll end up probably
Dan Becker: hard to say with this being the last one. But I'm guessing we'll end up, probably around a hundred
Dan Becker: map.
Jo: Sounds, great.
Hamel Husain: Trying to think. Is there a
Hamel Husain: the score channel for this talk?
Dan Becker: Yeah, I just made one.
Dan Becker: So it is
Dan Becker: posted a link in bergamag.
Dan Becker: and I just posted a link to it in general.
Hamel Husain: They don't sort it alphabetically, alphabetically. They were sorted by
Hamel Husain: some kind of
Hamel Husain: Okay, see you here.
Dan Becker: Could be chronologically based on when it was created. I was thinking about the same thing.

Dan Becker: That's not bad. Well, who even knows?
Dan Becker: Early on, Joe? I asked. You know, in these 5 min, when some people are waiting
Dan Becker: what they want to hear us talk about.
Dan Becker: And they the popular response was war stories. So either war stories or what? Something
Dan Becker: that in your coding
Dan Becker: has not worked or gone wrong
Dan Becker: in the last week.
Jo: In the last week.
Jo: Hmm.
Jo: no, I I yeah, I don't have a lot of war stories for this week. But
Jo: I've been trying out some new techniques for evaluating such results. So I'll share some some of those results in this in this talk. Yeah. So you you make some interesting findings. And then you also do some mistakes. Use Co pilot a lot and it's auto completions are basically

Jo: a couple of years ago, some, the Openai Apis are changing. And so, yeah.
Jo: not that interesting though.
Dan Becker: Yeah, you know, with copilot
Dan Becker: W with rag, you do a lot of metadata filtering so that you try and get more recent results. And it feels to me that with large language models more broadly.
Dan Becker: it'd be nice to do something so that it tries to auto complete with newer results rather than older ones you could imagine. Like, when you calculate loss functions.
Dan Becker: there's a weight involved in that weight as a function of when the training data is from or it'd be nice if there was something like that.
Dan Becker: But
Jo: Yeah, it was also interesting from.
Jo: I think,
Jo: the existing technologies like Xql. Databases. The completions are pretty good, both from shutt, Gpt and and general language models because they have a good.
Jo: It's a lot of that data in their training data, basically. But
Jo: if you have a new product with some new Apis. The cold completions

Jo: don't work that. Well. So that's why we at last. But we also try to
Jo: build our own rag solution on on search, less by AI to, you know. Help people use use Vespa and that's 1 of the things that been frustrating with these language models is that they are
Jo: quite familiar with elastic because elastic search has been around for
Jo: quite some time. But West place is is newer in the public domain. So
Jo: people are getting better completions for elastic search than than less. Also.
Jo: we have to do something about that.
Dan Becker: Damp.
Jo: Yeah, I see some great questions already. So that's fantastic. So I'm I'm I'm planning on.

Jo: I'm not sure how much time, because there were quite a few invites, but I'm hoping to spend a half an hour talking.
Jo: and that we could have an open session. So you know, drop your questions that that's that's awesome.
Jo: So we can get the discussion going.
Jo: And there's there. Yeah, there's a lot of things to be excited about in in in search, and I'll I'll cover some of them, and especially around evaluations. So so, major bulk in this talk will be about setting up your own evaluations so that you can actually
Jo: make changes and iterate on on search and actually measuring
Jo: the impact of of that
Jo: and it doesn't need to be very fancy to have something that you can actually iterate on so
Jo: and thankfully. Large language models can also help us there. Thanks to recent advances. So so I think that's that's interesting
Jo: in it.



Jo: So I'll try to share my



Jo: presentation, see if everything is
Jo: working well.
Dan Becker: We can see it.
Jo: You see it. Okay.
Jo: zoom is so much better than meat.
Hamel Husain: Yeah, I agree with that.
Dan Becker: Yeah.
Dan Becker: it's I guess Google has fortunately bridge at 1 point. Yeah, they have, like 10 different solutions for.
Jo: Yeah, yeah.
Dan Becker: I think they they've probably consolidated them, but
Dan Becker: they haven't used that to make them dramatically better.
Jo: Yeah, because in meet, when you run percent
Jo: like, everything just disappears. You're like, okay, here, I have the full view.
Jo: Yeah.
Jo: that's an improvement.
Dan Becker: Alright! You know, we're 5 after. We got about 100 people.
Dan Becker: do you? Wanna wait another minute or 2. That's great. But otherwise I think you can start anytime.
Jo: Yeah, sure, I can
Jo: just get started. Yeah. So thank you for having me. I'll talk about back to basics.
Jo: And I'm Joe Kristen Bergam.
Jo: And let's see if I can. Actually, yeah. So about me. I'm a distinguished engineer. I work at west by AI,

Jo: and I've been at best by AI for 18 years. Actually. So, I've been working in search and recommendation space for about 20 years and best by eyes, basically a platform serving platform that was recently spun out of Yahoo. We've been open source since 2,017, and in my spare time I spend some time on twitter posting memes.
Jo: yeah. And in this talk I'll talk about stuffing text into the language model. Prompt. I'll talk about information, retrieval.
Jo: The r and rag.
Jo: And most of the talk will be about evaluation of these systems of information with digital systems

Jo: stuff
Jo: things into the prompt, not necessarily related to question answering or search.
Jo: But, for example, if we are building a labeler or a classifier, we can also use retrieval to retrieve kind of relevant examples or examples out of our training data sets right?
Jo: So that's 1 way. But it's not that often discussed that you can also use retrieval. So let's say, you have 1 billion annotated training examples. You can actually use retrieval to retrieve relevant
Jo: examples and then have the large language models. Reason around that and
Jo: predict a label.

Jo: But most are
Jo: thinking about rag in the context of building this kind of
Jo: question answering model that you see at Google, and all these chat bots and and similar, where you retrieve for a question open ended question. And then you retrieve some hopefully relevant context. And you then stuff that into the prompt. And then you have
Jo: hopefully, the language model will generate. A grounded response. It might not be how this nation free. But some say that it improves the kind of accuracy of the generation stuff.

Jo: So
Jo: that's kind of demystifying it.
Jo: And working with these like reference architecture. There's some orchestration component. There's some input, there's some output. Hopefully, you have some evaluation of that output prompting different language models. And then you have kind of state which can be files, search engines vector, databases, regular databases, or even numpy
Jo: so there's a lot of things going on there.

Jo: and
Jo: there's a lot of hype around rag and also different methods for doing rag. And I'm on Twitter a lot. And I see all this kind of
Jo: Twitter. Trans, there's new model. So there's a lot of
Jo: components in this machinery. Lots of new tricks check out this. So it's a lot of kind of hype.
Jo: So I, I like to kind of try to cut through that. And you know what, what's what's behind this? How does this work on your data. This is actually, is this actually a model that actually have some basics or backing from research?
Jo: Have you actually evaluated it on some data set.
Jo: And I think
Jo: this is it can be if you're like coming into this space. And you're new to retrieval. You're new to search, and you're new to language models. And you want to build something. There's a lot of
Jo: confusing information going around.

Jo: and I just saw this twitter thread about the rag, and people are losing faith in it. And you know we we removed AI powered search.
Jo: And I think there's been like
Jo: brag is only about
Jo: taking a language model from, for example, Openai, and then they use their embeddings. And then you have a magical search experience, and that's all you need, and I think that's
Jo: naive. To think that you can build a great product or a great rag solution in in that way, just by using vector embeddings and and the language models.
Jo: because there are the retrieval stack in this pipeline the process of obtaining relevant information based on some query basically has been around, for, like Benjamin and his talk covered for decades. And
Jo: there are a lot of people or the brightest minds that I've actually spent a lot of time on on retrieval and search right?
Jo: Because it's so relevant across many kind of multi 1 billion companies like recommendation services search like Google Bing and whatnot. So this is a kind of always been a very hot and interesting topic.

Jo: And it's much deeper than
Jo: encoding your text into one vector, representation. And then that's it.
Jo: But I'll talk about how we can emulate these information retrieval systems. And this kind of basically, you could treat this as a more or less of a green box

Jo: where you have put some data into late, and you have your kind of retrieval system. And you're asking that retrieval system, a question, and you're getting back a ranked list of documents.
Jo: And then you can evaluate these documents and the quality of these documents. With regards to relevance of of how relevant they are. With regards to the query.
Jo: And this is kind of independent. If you're using what kind of retrieval method you're using, or combination, or hybrids, or face ranking or culvert, or splayed, or whatnot you can evaluate any type of system if it's using numpy or files or or whatnot, it doesn't really matter.

Jo: And the basic idea of this wiggling of such system is that you take a query and you retrieve those documents, and then you can have human annotator, for example.
Jo: to judge the quality of each of the documents. And there are different ways of doing this. We can do it by the binary judgment, saying that, okay, this document is relevant for the query or not.
Jo: or we can have a graded judgment where you say, okay, 0 means that the document is irrelevant for the query. And one, it's slightly relevant or 2 is highly relevant right? And we can also do use this to
Jo: judge the rank lists that are coming out of recommendation systems or personalization and many different systems that are producing a rank list.
Jo: And in information. Retrieval. This is going back decades. And there are a lot of researchers working on this. And you have track which is the text. Retrieval conference spans multiple different topics each year. News, retrieval, all kinds of different retrieval tasks. Ms. Marco. Maybe some of you are familiar with which is one of the largest data sets. That you can publish research on is from Bing, actually real
Jo: data, which is annotated. And a lot of these embedding models are trained on this data set.
Jo: Then we have beer from those rhymers at all that
Jo: evaluate
Jo: types of models without actually using the training data. But this is like in the 0 shot setting. So there are many different collections. And then there are metrics that can measure how well, the retrieval system is actually working. So we call at K, for example, K here, meaning a position in the in the ranking list, so K. Could be. For example, 10 or 20 or 100 or 1,000
Jo: and it's a metric that is focusing about. You know, you know, that there are like 6 items that are relevant for this query. And are we actually retrieving those 6 relevant documents

Jo: into the to the top? K, in most systems, you don't actually know how many relevant documents there are in the collection in the web scale. It might be, you know, millions of documents that are relevant to the query, so it's not unless you have a really good control of your corpus. It's really difficult to kind of know. What are the actually relevant documents in in the document. But position is much easier because we can look at those results and say.
Jo: Are there any irrelevant hits in in the top. K so precision is one. But it's not really rank aware. So it's not bothering. If the missing or relevant hit is placed at position one or 10, so the precision at 10 would be the same. If you. If it doesn't, it doesn't necessarily depend on the position.
Jo: And DC d very complicated metric, but it tries to incorporate the labels so that the the graded labels, and also awareness of the rank position. If you wanna look up that you can. You could.
Jo: You basically go to Wikipedia? But it's it's a quite advanced metric. Reciprocal rank measures. Where is the 1st relevant hit in the position. So if you place the relevant hit or a relevant hit at position one, you have a reciprocal rank of one. If you place the relevant hit at Position 2. You have a recipe rank of 0 point 5.
Jo: Then, of course, you have Lgtm, which is, looks good to me, maybe the most common metric used in the industry.
Jo: And of course, also in industry, you have other evaluation metrics like engagement.
Jo: Click, if you're measuring what actually uses are interacting with the search dwell time or e-commerce. Add to chart all these kind of signals. That you can feed back.
Jo: Of course, revenue e-commerce is search. For example, it's not only about the relevancy, but also you have some objectives for your business.
Jo: I also like to point out that most of the benchmarks are comparing just a flat list, and then, when you're evaluating each of these queries, you get a score for each query, and then you take the average to kind of come up with a average number for the whole kind of retrieval method.
Jo: But in practice in production systems, you will see that maybe 20% of the queries actually, is contributed like 80% of the volume. So you have to think a little bit about that when you're evaluating systems.
Jo: Yeah, so.
Jo: and to do better than looks good to me.

Jo: you really have to measure.
Jo: How you're doing. And
Jo: since
Jo: you have all these benchmarks, M. Tab and whatnot.
Jo: they don't necessarily transfer to your domain or your use case. If you're building a rag application or retrieval application over code or documentation or specific health domain or products, or because there are different domains, different use cases. So your data, your queries.
Jo: and the solution. To do better is to measure and building your own relevancy data set.
Jo: And it's actually not that hard
Jo: if you have actually a service in production.
Jo: Look at what actually users are searching for and look at the results
Jo: hand
Jo: put in a few hours and judge the results.
Jo: Might
Jo: it actually relevant? Are you producing relevant results? And it doesn't really need to be fancy at all. And if you don't have traffic, if you haven't launched with it, you obviously have played around with the product.

Jo: or you can also ask a large language model to, you, know, present it some of your content, and then you can ask it, you know. Okay, what's the question that will be natural for a user to retrieve this kind of passage. So you kind of kind of Bootstrap even before you have any kind of user queries.
Jo: And, as I said, it doesn't need to be fancy you can log. There are some fancy tools for doing this with user interface faces and docker and whatnot. But a simple Tsa Ts, tab separated file will do the trick. Preferably you will have like a static collection. But maybe not. Everybody has the luxury that you can actually have a static collection. And the reason why you would like to have a static collection is that
Jo: when you are judging the results?
Jo: And you're saying that for this query, for instance, query id 3. In the document. D. 5. You say that? Oh, this is a relevant one
Jo: when we are judging the kind of, or computing the metric for the query, if there's a new document that is suddenly appearing, which is which is irrelevant or relevant.
Jo: it might, you know, we might actually change thing in in the how we display things in in the ranking without we able to pick it up. So that's why you prefer. We have this kind of static collections. But and all the information retrieval data sets, they are usually static, right? They don't. They don't change. So we can evaluate methods and and practices over time.
Jo: But
Jo: you can also
Jo: this process use language models to
Jo: judge the results. And there's been interesting research coming out of Microsoft being team for over the last year, where they find that with some prompting techniques that they actually can have the large language models be
Jo: pretty good at judging, query and and passages. So given a passage that this retreat for the query and they can ask, language model, is this relevant or not relevant? And they find that this actually correlates pretty well. And if you find, like a prompt combination that actually correlates with your data or your kind of golden data set, then you can start using this in at a more massive scale.

Jo: and here's a very recent paper coming out 8 days ago where they also demonstrated that this prompt could actually work very well to assess the relevancy of the queries.

Jo: And this can free us from this having this kind of static golden data set because we could start instead sampling real user queries and then ask the language model to evaluate the results. So I think this is a very interesting direction.
Jo: And we have in our rag or vest by rag documentation search. I built like a small golden set with about 90 query passive judgments.
Jo: and I just ran them through with this prompt or a similar prompt and I'm getting quite good correlation between what what I'm judging the results. And Gpt force is judging them, which is good because it means that I can now much cheaper judge, more results and then potentially also use this to adjust ranking.

Jo: Because when you have this kind of data set.
Jo: you can also iterate and make changes. And then you can see, you know how it's actually performing. So instead of saying, Oh, we change something
Jo: you can go to. We actually deploy this change that did increase. And Gtg, with 30%. And this is from
Jo: our documentation vesper documentation search, which is relevant for us. It's our domain, you see here, Semantic, here is off the shelf vector embedding models. And then there are different ways in investment. To use. Hybrids won't go to the into the details. But now I actually have numbers on it. And then when I'm making changes.

Jo: So that's
Jo: about evaluation so independent of the method or technique that we're using.
Jo: we can evaluate the results coming out of the retrieval system.
Jo: Now I want to talk a little bit about
Jo: the representational approaches and and scoring functions that can be used for efficient retrieval.



Jo: and the motivation for having this kind of representational approach is that you want to try to avoid scoring all the documents in the collection.
Jo: So if you're using, some of you might heard about cohere re ranking service, or this kind of ranking services where you basically input the query and all the documents, and they go and score everything. But then you have everything in memory already. Retrieve the documents, and imagine doing that at the web scale, or if you have 100 million documents is is not possible, right?
Jo: And it's also similar to doing a graph. So instead, we would like to have some kind of technique for representing these documents so that we can index them, so that when the query comes in that we efficiently can retrieve over this representation, and that we efficiently in sublinear time can retrieve the kind of top rank docs, and then we can feed that into subsequent ranking faces.

Jo: And there are 2 primary representations. And that is the sparse representation where we basically have the total vocabulary is kind of the whole sparse vector representation that you potentially take what for a given query or a given document?
Jo: Only the words that are actually occurring in that document or in that query, have a non 0 weight.
Jo: and this can be efficiently retrieved over using algorithms like weekend or Max score and inverted indexes. You're all familiar with elastic search or other kind of keyword search technologies. They build on this. More recently, we also have using neural or kind of embedding or sparse embedding models, so that instead of having a unsupervised way that is just based on your corpus statistics. You can also use transformer models to
Jo: learn the weights of the words in the queries on the documents.
Jo: and then you have dense representations. And this is where you have text embedding models, where you take some text and you encode it into this latent embedding space. And you compare queries and documents in this latent space, using some kind of distance metric.
Jo: And there you can build indexes using different techniques, vector databases, different types of algorithms. And in this case also, you can accelerate search quite significantly so that you can search even 1 billion scale data sets in milliseconds single credit
Jo: but the downside is that there are a lot of trade offs related to that the actual search is not exact. It's an approximate search, so you might not retrieve exactly the ones that you would do if you did. A brute force search over all the vectors in the collection.
Jo: And these representations are mainly
Jo: supervised through transfer learning. Because you using typically an off the shelf embedding models that's been trained on some other data, some data sets. And then you're trying to apply that to your model. You can fine tune it if you have relevancy data and so forth. Then it's no longer like a 0 shot or transfer learning, but still like a learn representation.
Jo: And I think these representations and the whole chat gpt open AI chat Gpt language model open AI Embeddings really open the world of embeddings to a lot of developers. And this stock for quite some time. And it's still stuck, I think, because people think that this will give you a magical AI powered representation.

Jo: It's not bad. And you could also use now a lot of different technologies for implementing search vector databases, regular databases. Everybody now has a vector search report. Which is great because you can now use different or more
Jo: wide landscape of of different technologies to kind of solve search.
Jo: And but there are some challenges with these text embedding models, especially because the way they work.

Jo: Most of them are based on a kind of encoder style transformer model. Where you take the input text, you tokenize it into a fixed vocabulary.
Jo: and then you have previously, in the pre-training stage and the fine tuning stage. You have a learn representations of each of these fixed tokens.
Jo: Then you feed them through the encoder network.
Jo: And for each of the input tokens.
Jo: you have an output. Vector
Jo: and then there's a pooling step.
Jo: typically averaging
Jo: into a single vector representation. So this is how you represent not only one word, but a full sentence, or even now, with the embedding model coming out today, supporting, you know, encoding several books as one vector
Jo: but
Jo: the issue with this is that the representation becomes quite diluted. When you kind of average everything into one vector which is proven not to work that well for high precision search.
Jo: So you have to have some kind of shunking mechanism to in order to have a better representation for search. And this fixed vocabulary, especially for birth, based models. You're basing it off a vocabulary that was trained in 2,018. So there are a lot of words that it doesn't know. So we had one issue here with a user that was searching for our recently announced support for running inference with Ggf. Models in Vespa.
Jo: And this has a lot of out of word. Oh, sorry out of vocabulary words. So it gets maps to different concepts, and this might produce quite weird results when you are mapping this into the latent embedding space.
Jo: And then is the final question is, is, does this actually transfer to your data to your queries? So? And but the framework or the kind of evolution routines that I talked about earlier.
Jo: We'll give you the answer to that, because then you can actually test. You know, if if they're working or not.
Jo: And
Jo: and also, I think, on on the baselines. It's quite important to establish kind of some baselines. And in the information retrieval community, the kind of de facto baseline is. Bm, 25.
Jo: So, bm, 25 is this scoring function where you tokenize the text, linguistic processing and so forth. It's well known, implemented in multiple mature technologies like elastic Vespa tantiv whatnot. I think there was a even a library announced today. Bm, 25 in Python.

Jo: So, and and it builds a model kind of model, unsupervised from your data, looking at the words that are occurring in the collection, how many times it's occurring in the data, and and how frequent the word is in in the total collection. And this is scoring function.
Jo: and it's very cheap, small index footprint. And most importantly, you don't have to invoke kind of a transformer embedding model like a 7 b llama model, or something like that which is quite expensive.
Jo: It has limitations, but it can avoid these kind of spectacular failure. Cases of of embedding retrieval related to out of vocabulary words.
Jo: The huge downside is that if you want to make this work in Cjk languages, or or Turkish or different type of languages. You need to have some kind of tokenization integrated which you will find in in engines like elastic search or open search, or or vespa

Jo: and long context. So we did the announcement earlier this year of supporting Colbert in a specific way. I'm just including this to
Jo: show you that this is a long context. Documents so they are. I think they're around 3 k tokens long, and the researchers evaluated these different models, and they were presenting results about M. 3, which is scoring 48.9 in this diagram, and they were comparing it with Openai embeddings with M different types of Mistral or different types of embedding models.
Jo: And then we realize that you know, this is actually quite easy to beat.
Jo: just using a vanilla. Bm, 25. Implementation, even Lucine or vespa, or elasticsearch, or open search. So.
Jo: having that kind of mindset that you can evaluate and actually see what works. And remember that beyond 25 can be a strong baseline. I think that's an important takeaway.
Jo: Then there's a hybrid alternative. We saw, you see, a lot of enthusiasm around that where you can combine these representations, and it can overcome this kind of fixed vocabulary issue with regular embedding models.

Jo: But it's not also us. Not a single silver bullet reciprocal rank fusions or methods to fuse this kind of different methods, you know, it really depends on the data and the type of queries. But if you have, build your own emails.
Jo: then you don't have to listen to me about. You know what you should do, because you can actually evaluate and test things out, and you can iterate on it. So I think that's that's really critical to be able to build better rag, to be to to improve the quality of the retrieval phase.

Jo: Yeah. And of course, I talked about long context.
Jo: and that the long context models we all want to get rid of chunking. We all want to get rid of all the videos about how to chunk but
Jo: the basic kind of short answer to this is that you do need to chunk in order to have meaningful representations of text for high precision search.
Jo: So typically, like Nils rhymers, the de facto embedding expert says that if you go about 250, so 256 tokens. You're starting to lose a lot of precision right? There's other use cases that you can use this embeddings for, like classification. You know, there are a lot of different things. But for high position search it becomes very diluted because of these pooling operations. And also there's not that many great data sets that you can actually train models on have longer text.
Jo: And even if you're chunking to have meaningful representation, it doesn't mean that you have to split this into multiple rows in your database. There are technologies that allows you to kind of index multiple vectors per row so that's possible. Finally, real world rag. Not sure. But if you've seen this. But there was a huge Google leak earlier
Jo: in in May.
Jo: where they revealed a lot of different signals. And in the real world
Jo: it's in the real world search. It's more. It's more about not only about the text similarity. It's not only about bm, 25 or a single vector cosine similarity. There are things like freshness, authority, quality patrons you heard about and also revenue. So there are a lot of different features.

Jo: and Gbt is still, you know, a simple, straightforward method, and it's still the kind of king of of tabular features where you have
Jo: specific name features, and you have values for them. So combining Gbt with this kind of new neural features is quite effective when you're starting to actually operate in the in the real world.
Jo: So quick summary. I think that information retrieval is more than just a single vector representation.
Jo: And if you want to improve your retrieval stage, you should look at building your own emails.
Jo: And please don't ignore the Bm 25 baseline
Jo: and choosing some technology that has hybrid capabilities, meaning that you can have

Jo: exact search for the exact tokens and still have matches, and also combine the signals from
Jo: text search via keywords and text search via embeddings
Jo: can avoid some of these failure modes that I talked about.
Jo: And
Jo: yeah. And finally, real-world search is is more than than text similar.



Jo: So that's what I had. And I hoping for questions. If you wanna check out some resources, I I do a lot of writing on the the blog west by AI, so you can check that out and
Jo: if you hated it, you can tweet me at Joe Bergam at Twitter.
Jo: I'm quite active there. So I appreciate if you headed it. And since then you can mention me there. And yeah, and you can also contact me on on Twitter. I love getting questions. So.
Hamel Husain: That's a that's a bold call to action. If you hated it.
Jo: It is.

Hamel Husain: We definitely have a lot of questions. I'll just go through some of them.
Jo: Food.
Hamel Husain: What kind of metadata is most valuable to put into a vector dB
Hamel Husain: for doing rec.
Jo: Yeah, if if you look at the only the text domain if you're only concerned about text. So you have no freshness component, or you don't have any authority. If you, for example, are building like a healthcare
Jo: or a health, your users are asking helps questions like you definitely wanna have some kind of filtering. What's the authoritative sources within hop health? You don't wanna drag up Reddit or things like that? Right? So, and title and other metadata of courses. But it really depends on the use case if you're like a text, only use case, or if it's like more like real world where you have different types of signals. So.
Hamel Husain: Makes sense.
Hamel Husain: Do you have any thoughts on calibration of different indices of the different indices? Not only are different document indices not aligned in terms of similarity scores. But it's also nice to have confidence scores, for how likely the recommendation is to be good.
Jo: Yeah, I think, is, is a very tough question so these different methods for all these different scoring function you can call that have a different distribution, different shape, different score ranges. So it's really hard to
Jo: combine them, and they're not probabilities as well. So it's very difficult to to map them into a probability that actually, this is is, or or filtering people wants to like. Oh, I have a cosine similarity filter on 0 point 8 but it's different from different types of model, but combining them
Jo: is also a learning task. It also kind of you need to learn the parameters. And Gbt is quite good at that, because you're learning a nonlinear combination of these different features. But in order to do that, then you also have this have trading data
Jo: but the way I described here for doing evaluation can also help you
Jo: generate training data for
Jo: training ranking models. So.
Hamel Husain: So does the calibration really turn into a hyper parameter tuning exercise with your email set? Or does that kind of.
Jo: Yeah, well, you you could do that right if if if you don't have, if you don't have any data that you can train them all along to train those parameters. You could do hyper parameter sweep. And then, you know, basically check if your Eval is improving or not.
Jo: But if you wanna apply like more like a Ml technique on this right then you will either, like Google is doing like gathering search and clicks and interactions. But now we also see more that people actually using large language models to generate syntactic training data. So you can distill kind of the the powers of the the larger models into smaller models that you can use for for ranking purposes, but is is a very broad topic. I think
Jo: so. It's very difficult to kind of deep dive. And it is very difficult to say that. Oh, you should have a cut off 0 point 8 on the vector similarity. And or you can do this transformation. So there, there are no like really great tricks to to do this. Without having some kind of training data, and at least some evaluations.
Hamel Husain: What are your observations on
Hamel Husain: the efficacy of rerankers? And do you usually recommend to use a reranker.
Jo: Yeah, because the rerankers the the the great thing about rerankers is that in the face between people and ranking pipelines, you're gradually throwing away hits. Using this kind of representational approach, and then you can have a gradual approach where you're investing more compute into fewer hits and still be within your latency budget. And then the great thing about rerankers like coherent. You could deploy them. Investment as well
Jo: is that they offer this kind of token level interaction.
Jo: Because you input both the query and the document at the same time through the transformer network. And then you have token level interactions. You're no longer interacting between the query and the document through this vector representation. But you're actually interi feeding all the tokens of the query and the document into the method. So yeah, that definitely can can help accuracy. But that is coming a question about cost and latency and so forth.
Jo: Yeah, so a lot of tradeoffs in in this. But if you're only looking at accuracy and you can afford the additional cost, yeah, definitely, they they can help.

Jo: Yep.
Hamel Husain: Hey, William Horton is asking, do you have advice on combining usage data along with semantic similarity, like, if I have a number, then, like a number of views or some some kind of metadata like that from a document.
Hamel Husain: Yeah, I want to say.
Jo: Goal is.
Hamel Husain: To, the.
Jo: It goes. Yeah, it goes into more of if you have interaction data, it becomes more of a learning to rank problem.
Jo: You 1st need to come out with labels from those interaction because there'd be gonna be multiple interactions. And then you, there's gonna be add to charge at different different actions will have different weights. So the standard procedure is that you, you convert that the data into kind of a label data set similar to what I shown here in the Eval

Jo: so when you convert that to kind of a label data set, then you can train a model, for instance, a Gbt model where you can include
Jo: the semantic score as well as a feature.
Jo: Yeah.
Hamel Husain: Alright. Someone's asking the question
Hamel Husain: that you may not be familiar with. But I was, gonna give it a shot. It's a reference to someone else. What are your thoughts on Jason Lou's post about the value of generating structured summaries and
Hamel Husain: reports for decision makers, instead of doing rag the way we are doing. Leslie's commonly done today. Have you seen seen that you, familiar with.
Jo: Yeah, I mean, Jason is fantastic. I I love Jason and but he's a high volume Twitter. So I I don't read everything. So i i i haven't. I haven't caught up on that yet. No, sorry.
Hamel Husain: Okay, no worries. I don't wanna try to rehash it from my memory either.
Jo: Sounds.
Hamel Husain: Just skip that one.
Hamel Husain: What are some of your favorite advancements recently in text embedding models or other search technologies
Hamel Husain: that people
Hamel Husain: I'll just stop the question there, yeah, what are your?
Hamel Husain: Yep. What? Yeah. Yeah.
Jo: Yeah, I I think I think I'm betting Miles will will
Jo: be become better. What I do hope is that we can have models that are have a larger vocabulary.
Jo: So like llama vocabulary, so we have a larger vocabulary. So like bird models, they have this old vocabulary from 2,018. I think we I would love to see a new kind of Diberta model trained on more recent data, with more recent techniques, like a pre-training stage, including a larger vocabulary for the organization.
Jo: I think, as I said, that I'm not
Jo: too hyped about increasing the complex length because all the information. Retrieval research shows that
Jo: they are not that well, at generalizing a long text into a good representation for high position. Search
Jo: so I'm not so excited about the direction where you're going, with just larger, larger, larger, complex windows for embedding models, because I I think it's the wrong direction. I would rather see
Jo: larger vocabularies and and better pre trained models like Diberta. It's it's still a good model for for embeddings. Yeah.
Hamel Husain: Someone's asking does query expansion of out of vocabulary words with bm, 25. Work better at search. And I think, like, just add onto that. Do you think people are
Hamel Husain: going as far with classical search techniques as they should
Hamel Husain: like? You know, things like query, expansion, and all kinds of other stuff that have been around for a while before.

Hamel Husain: like, what's your feeling about the spectrum? And like, Yeah.
Jo: I I think you can get really good results by starting with Bm, 25 and classical resorts, and adding a reranker on top of that.
Jo: You won't get the magic if you have a single word query.
Jo: and there are no words in your collection. Then you might fail at recall. But you don't get into this kind of really
Jo: nasty failure modes of of embedding vector search alone.
Jo: And yeah, definitely, there are techniques like query, expansion, query, understanding, and language models. They are also quite good at this. There's a paper from Google. They did query expansions with the Gemini but
Jo: pretty well, not amazingly well compared to the size of the model and the additional latency. But we have much better tools for doing query, expansion. All kind of fancy techniques now involving prompting of live language models so
Jo: definitely that, too, is is really interesting for expansion. So that's another way. But.
Jo: like in the the diagram where. So this machine and all these components and things like that, what I'm hoping people can take away from this is that if you're wondering about this technique that technique I read about this is that
Jo: if you put that into practice in a more systematic way, having your own evil. You will be able to answer those questions on your data, on your curs. Without me saying that the the threshold should be 0 point 6 right, which is bullshit, because I don't know your curies or your domain or your data.
Jo: So by building these emails, then you can actually iterate and and get the answers.
Hamel Husain: In one slide you mentioned limitations and fixed vocabulary with text that is chunked poorly. How do you overcome these sort of limitations in a domain that uses a lot of jargon, and that doesn't tokenize well with an out of the box model.
Jo: Yeah, that then you're out of luck with the regular embedding models. And that's why the hybrid capability we actually can combine the keyword search with the embedding retrieval mechanism. But the hard thing is to understand when to completely ignore the embedding results.
Jo: because embedding retrieval, no matter how far they are in the out in the vector space will be retrieved right? So when you're asking for 10 nearest neighbors, they might not be so near. But you're still retrieving some junk.
Jo: And then it's important to understand that this is actually junk, so that you don't use like techniques like reciprocal rank fusion, which by some vendors, is sold as the the full kind of bone solution to solve all this. But then you're just blending rubbish into something that could be reasonable from the keyword search. So

Jo: currently, and another alternative is as well. That might do a little bit. Stop capping is to fine tune your own embedding model. But you still have the vocabulary issues. So, but if you have resources to kind of, do the pre training stage Jo: on your data with a vocabulary that is more matching up with your domain. That might work. But then you have a training job that goes from scratch. But I hear it's a lot easier to train work from scratch nowadays than in 2,018. So it's it might be a viable option for some organizations. Jo: Most of the companies are doing this. Anyway, they're they're starting in all their semantic search papers. They basically say, here's our pipeline. We pre trained to build a tokenizer on the whole Amazon corpus. Right? They they don't use birth base from 2,018 Jo: right. Hamel Husain: That makes sense. Okay, last question. Hamel Husain: would you see, Colbert, based methods get around or at least improve retrieval when we're concerned with tokenizer problems. Jo: Yeah. So cover to introduce that is basically another neural method where you, instead of learning one vector representation of the full passage or the full query. Jo: you are learning token level vector representations. And this is a bit more expensive. Compute wise at serving time than the regular single embedding models, but Jo: it has close to the accuracy of like the regular rewriteer. Jo: but it's still also suffer from a vocabulary. Because it still uses the same kind of vocabulary as other models. So, but if we can get better Jo: pre-trained model that are trained with a larger vocabulary, I hope that it's a path towards better kind of neural search with Colbert and other embedding models as well. Hamel Husain: Hey? Great? Hamel Husain: Yeah, that's it. There's certainly more questions. But you know we don't want to go on for infinite amount of time. I think we hit the more important ones. Hamel Husain: So yeah, I was a little. Jo: There's there's a lot of great questions. So if you if you wanna, you know. Jo: throw them at me at Twitter, and I'll try to answer as best as possible. Dan Becker: Thanks. Joe. Hamel Husain: Yeah, thank, you. Jo: Yeah. Great being here. Great seeing you guys and have a great day. Jo: Bye, bye.