Fine-Tuning OpenAI’s GPT 3.5 to Unlock Enterprise Use Cases

Webinar 網址: https://exchange.scale.com/home/events/fine-tuning-open-ais-advanced-base-model-gpt-3-5-2023-11-08 錄影: https://exchange.scale.com/home/videos/fine-tuning-open-ais-gpt-35-to-unlock-enterprise-use-cases-2023-11-08 ## Transcript Hi everyone. Hi, my name is Chloe Ho from Scale AI, where I'd lead our go-to-market efforts with our strategic accounts. And today I am very excited to welcome you all to our webinar. Fine-tuning OpenAI's GPT 3.5, the Unlock Enterprise Use Cases. At scale, our mission is to accelerate the development of AI applications. This is why we are excited about our strategic partnership with OpenAI, providing GPT 3.5 fine-tuning for the world's largest enterprises. Today, I am joined by our speakers Colin and Love. Colin is a solutions architect at OpenAI, where he works with strategic customers across Europe and North America to put machine learning products into production. Love Cthari is a head of product for all enterprise efforts at scale, where he is focused on developing and deploying cutting-edge generative AI solutions enterprises. So today in today's session, Colin and love are going to be diving deep into what fine-tuning GPT 3.5 is best practices for fine-tuning and providing real world examples of what fine-tuning can do for enterprises. As a bit of background as to why we're co-hosting this webinar today, a scale has partnered with OpenAI since 2019, building better LLMs through training datasets and reinforcement learning with human feedback or RLHF. Now we are proud to be OpenAI's preferred partner for fine-tuning GPT 3.5. And at scale, we believe that fine-tuning is the key to unlocking the performance of LLMs for any organization's most critical use cases. And so our partnership brings together both OpenAI's advanced base models, GPT 3.5, with scales fine-tuning expertise and our industry-leading data engine to help every company create custom and state-of-the-art models for their specific business needs. Then with that, I'll hand it off to our speakers to kick off this webinar. Awesome. Thanks a lot, Chloe. I wanted to share a little bit. The reason opening up to work with scale is that scale are expert users of our models. At OpenAI, we're focused on building the best models that we can. Well, partners like scale bring expertise in deploying them in the real world. Right here, we're looking at some benchmarks of GPT-4 and GPT-3.5 performance on some standard exams which are designed to be taken by humans. You can see here that GPT-4 performs GPT-3.5 on a significant margin on a number of technical disciplines. For example, physics, chemistry, and law. However, if you look at English literature in history, you actually see that GPT-4 doesn't improve on GPT-3.5 at all. And that is why the focus of today is fine tuning. So these base models, despite being extremely capable, still fail on certain domain or require addition of more training data to actually help them perform better. And that's why love's going to talk through why fine tuning and really the value that it brings. Thanks, Colin. Yeah, why fine tuning? So regardless of the use case, we have found that customers come to scale and open AI to help with fine tuning for three main reasons. The first and foremost is improving performance. To teach the models how to respond to a large variety of problems represented above real world scenarios in the particular domain of the use case they were not. This enables our customers to build a general AI application that work really well for the particular use case they have in mind. The next one ends up being more about improving confidence the ability of the models to do the right thing and the right tonality. and the right voice. Scales, expertise in human in the loop, testing, and evaluation gives our customers confidence that their customized models, their fine-tune models will continue to perform really well and will do so safely and responsibly. And the last is around maximizing the ROI. As we have seen, the prompts ends up becoming very and we all the very complicated to manage. And then, of course, they end up taking the token, as well as part of being passed around in the context to the LLM. So customers want to scale their use cases to production as quickly as possible while reducing the API latency as well as the token inference consumption with lower prompt sizes that are made possible if you are able to find human, you know, GPT-8.5 or GPT-4. to go to the next slide, Toy. Fine tuning works. One example, there are lots, lots of areas, use cases that we have been seeing. Fine tuning really improved the performance of 3.5 and 4. This is one example where we fine tune both 3.5 and 4 for text to SQL use case and test that on spider dev set benchmark. And we were able to get significant uplift in performance for both of those models and get very close to the state of the art models of them. And to show you what that looks like in practice, this is just a query SQL query in National Language Front. And the base model gets the job done, but it's kind of as you can see, it's difficult to see this, it's difficult to read through this. And then it's overly complicated for the task it's trying to accomplish. Once you fine tune the model learns the terminology used to reference the database features schema. And these would otherwise require extensive prompt engineering to feed into model. And later in the presentation, we'll walk through even more extensive example of what takes to SQL use case could be done easily with the drag and fine tuning to model. Fine tuning greatly improved. As you can see in the right-hand side, the query structure, given the user provided examples in this particular case. And there are a variety of use cases for fine tuning from domain and in the C specific applications, and advanced capabilities like text to SQL, enabling better retrieval on-mortage generation with drag fine tuning. That will talk in a minute. Awesome. Thanks a lot, love. So before diving into fine tuning and the details of how to actually squeeze the most performance out of fine tuning, I wanted to take a step back and contextualize where it sits in the overall topic of LLM optimization. So optimizing LLM applications often sounds linear. So this diagram is kind of a typical colloge example of what you might see. You start with prompt engineering? you go to RAC and then retrieve a logmented generation, and then you move on to fine tuning. But this can be problematic because there's no guarantee that the optimization that you're choosing, retrieval augmented generation or fine tuning, is fixing the problem that you've actually got. So if we move ahead, there's a matrix that we've developed at OpenAI, which helps you think about how to approach optimizing your LMS. So on the access going upwards, you can see context optimization. So what the model needs to know, like what's the actual information or domain knowledge that it needs to know to solve the task you've given it. And on the x-axis, you can see LLM optimization. So how does the model need to act? What's like the methodology? It needs to build a SQL query. So that's the thing it needs to do. What it needs to know are the columns and the schema, which is actually going to use to produce that SQL query. So if we click ahead, what we often see is folks starting in the bottom left corner. And this is sort of where I'd recommend everybody start with optimization. You start with a prompt, simple prompt and you try to. and put some context into it and some explanation of what the task is. And you can very quickly iterate and see kind of whether your prompt is actually giving you any success at all. And what I'd recommend you do is start with prompt, get to an evaluation so you have a baseline, and then you can decide what to do next. If the model needs more information, then if we move ahead, we will see retrieval augmented generation come in. So if you, for example, add a few shot examples, prompt output pairs that actually increase the performance of prompt engineering, that tells you that more context is actually helping you. And in fact, you might want to use retrieval augmented generation to actually industrialize that process and bring contextually useful knowledge into the prompt to help the model solve that problem. But maybe the problem it's having is actually it's creating particularly incorrect SQL queries with the context that it's getting. And in this case, you might consider fine tuning. And these things are not mutually exclusive. So because they solve different problems, they actually work really well. well together. So we see a lot of customers and in fact what love is going to show you today, often the best of both worlds is actually to have retrieval augmented generation to bring context and fine-tune. So given that these are the options you have, the approach we generally see is something like this. You start at the bottom left with a prompt. Sorry, if you click ahead, you start with a prompt. You might then add few shot examples. If you add those, then you might say, "Okay, cool, those few shot examples helped. Let's add retrieval." And then now that we've got retrieval, maybe the syntax isn't quite correct. So maybe now we're going to choose to fine-tune a model. And then now that we fine-tune a model, maybe our accuracy is still not quite what we need. So we're going to jump back over to the retrieval augmented generation and we're going to optimize that by maybe adding like hypothetical document embeddings for retrieval or adding a fact-checking step. And now that we've added the retrieval, we're now going to go back and fine-tune our model again so that our model now expects, expects rag examples in the output. And this is an example of the typical kind of optimization approach that you might see. So it can be sort of difficult, though, given this, to like know how to proceed. And I guess like the first thing that I'll leave you with is that the flow to think about is very simple. You start with something. And sorry, cool, if you just jump ahead. So, yeah, try something, evaluate, and try something else. And this is the most critical thing. Often people jump all the way to fine tuning, all the way to rag without actually a labeled set, without an evaluation set that they're actually gonna use. You should always make deliberate steps. You should never just jump to fine tuning just because, or jump to retrieval augmented generation, just because. You should always have a baseline, and we decide in which of these two things am I optimizing for. So if we jump ahead, this is like a model that again, we've drawn up to kind of help you think through the process of like, what do I actually need to optimize for here? So in this case, again, start with prompt engineering. Evaluate, identify the gap. And then decide, is this like a short-term memory problem? because the model need more to information the to specific enter question, into or is the this model. a long-term memory problem, where it actually needs to replicate a particular style or format or structure to deliver the right thing? And I'll give you a quick metaphor to think about this. Imagine you want to write an exam. The prompt tells you how to answer the exam. The long-term memory or the fine-tuning will teach it like the methodology. So if this is a math exam, it will teach you the concepts that you need to know to actually answer. And then Rag is like then giving them an open book. So now that the model knows the method, it can open up the book and actually look up the specific piece of information because it knows where to look. And this is how you can think of those things working together. And that's why they're additive, not exclusive. These will stack for optimal performance, which level tell you a lot more about as we move on. So just, I'm going to dive just very quickly into Rag and fine-tuning. I know a lot of the folks on the call will know these things, so I'll just try to zip through these. But I just want to quickly recap exactly what I'm talking about. So with Rack, Rack is all about giving the model access to domain specific context or relevant information. So typical rags cycle, so you might start off with some knowledge base. So you have some, let's just imagine we have some documents, we embed them, we make a knowledge base. I know that folks out there will have existing search services, they met up tools, whatever it is. For rags, you need some kind of external source where this extra context is going to come from. Then you're going to have a user and they're going to ask a question. So let's say what is the population of Canada? And in this case, rather than just giving that question to the LLM and asking, trusting them to answer, we're going to ground that with information from the knowledge base. So in this case, we pull some information and luckily our knowledge base does tell us the population of Canada. And if we move ahead, then we take the question and content that we've just retrieved from the retrieval augmented generation solution, stick those into the prompt and then move ahead to give that to the LLM where the LLM gives us a nicely worded answer. So quick recap of what rags is. So there's some intuition we've developed at Webinii as to like when you should use So if you want to give your then Raga's LLM likely domain the knowledge. best next step. And what Raga is good for is introducing new information to the model to update its knowledge. So the model was trained on a huge amount of information, but it doesn't know your company and your information. And that's why Raga is so strong. It will bring that new information to the model and allow it to work with it. It's also great at reducing hallucinations by controlling the content. A very common use case is to give the model the instruction only use this content. Do not use your in-belt knowledge. I only want you to use this content to answer the question. And then putting guard rails around it to then stop it from using any other information. What it's not good for? First of all, embedding, understanding of a broad domain. So, RIG is great for bringing-- you're limited by the context window here. So, you can fit in as much knowledge as you can fit into the context window. But there are scalability issues there, of course. So, RIG is great for bringing in some relevant context, but it doesn't scale to hundreds or thousands of examples. Also, teaching the model to learn a new language format or style, again, for the same reason. You can't show it enough examples for SQL syntax, as Love mentioned. I mean, there's so many combinations. You really need a larger data set to actually embed that knowledge in the model. And lastly, reducing token usage. So, for RIG, inevitably, you're going to start adding more and more context if you get-- or more specific context if you're having accuracy issues. And so, RIG is not great for efficiency. It's often when folks have a problem and they're hill climbing to just like try and get the accuracy solved. And then we're going to do the efficiency later. It's often why they start with RIG. So, click in the head. I also want to give a quick cautionary tail on RIG. These are RIG is like a great solution. But there's a problem that we all face, which is that search is unfortunately still not a solved problem. And I'll give you a great example from one of my customers. So, they were trying to use RIG to reduce hallucinations. So, in their case, they told it, make sure you only use the context given to answer the user's question. And the user thought they were being funny. And they said, what's a great tune to get pumped up to? And they got an answer saying, don't stop believing by journey. And this was identified by a labeler as a hallucination and reported. Unfortunately, their content actually contained an article written by a junior financial analyst that said, when I want to get fired up for financial analysis, I listen to don't stop believing. So, the model did exactly what it was told to do. And it was actually the search that was the problem, or the content that was the problem. And so, RIG is great when you have relevant context, but the method of pulling that context in, so the search, and then the content itself is incredibly important. If you're making the model ground itself on the content, the content better be right. And this is something to keep in mind when you're working with RIG. So, jumping forward to the topic, so what is fine tuning? So, again, just a quick recap, so along with some intuition. So, continuing the training process on a smaller domain-specific data set to optimize a model for a specific task. We saw before, GPT-4-based model, very strong, not credit English literature. Can we train it? We'll show it in English literature. So some of the benefits, you can improve the model. performance on a specific task. So again, fine tuning isn't the best solution for introducing brand new knowledge to the model, but these base models have an incredibly wide amount of expertise. And by showing them more examples, it will adjust the weights of that model to make it much stronger at that particular task. So it's often an effective way, more effective way, long term of improving model performance than prompt engineering or fuchsia learning slash write. It's also great for improving the model efficiency. So let's say we've done rag. We've added tons of context. We've now got right answers, but we've also got very expensive slow prompts. You can reduce the number of tokens needed to get the model to perform well in your task. And you can also distill the expertise of a large model into a smaller one. So again, sharing a little bit of intuition. So if prompt engineering isn't helping, fine tuning likely isn't right for you use case fees. You're kind of seeing that like more examples, more instructions are not helping. So fine tuning is great for emphasizing knowledge model. It's also great for customizing the structure or tone of responses. Again, SQL is a great example or Python or any coding language because there's already that knowledge exists in the model. It was trained on tons of code. So we're just going to show it a bunch of specific examples of that code and it will learn, it's almost like a show and tell approach. You don't need to tell it in the prompt. These are showing it with a huge variety of examples. And that's why it can also help for teaching the model very complex instructions. So what is not good for? Again, adding new knowledge to the base model. So currently the best method of doing that is RAG. That also has its limitations for what we've said before, these are the context window. And that is why that has been the custom models that were announced at DevDay by OpenAI. That is where those start to come in. We're actually need to do further reinforcement learning and actually introduce it like in the pre-training process and actually introduce more knowledge that base model before then taking on again RAG and fine tuning in these other approaches. So fine tuning will not do that for you, but it will emphasize that knowledge that already exists very well. And it's also not great for quickly iterating. So if you jump straight to fine tuning, you're now putting yourself in a cycle where every time you have to retrain a model, do evaluation, all this kind of thing. So that's why we do suggest starting with prompt engineer. And again, a cautionary tale. So this comes from a great blog, which is referenced at the bottom, where-- so a little bit of context on this. So the writer of this blog wanted an AI writing assistant that could better replicate his writing style for help writing social media posts, blog posts, all this kind of thing. So what he did is he downloaded 140,000 messages of his from Slack and made a data set out of them. Great, right? Like that's going to learn exactly his writing style. And then what he did was give it the instruction to write a 500 word blog post on prompt engineer. And the model responded. Sure. I'll work on that in the morning. So interestingly here, I mean, sort of what went wrong here. He gave it-- I mean, he gave it a ton of examples of his texts. But what he didn't give it examples of is actually What do you want it to get out of the model? We see kind of what the next interaction like tells it right now and he says okay Because this is what is slack examples actually look like and this is like kind of one of the really key things with fine tuning Again, I mean with rag we saw that like the content really matters with fine tuning the data that you feed it with having high quality Examples is absolutely the most important thing quantity then is the other like very important lever that you've had and loves Gonna tell you loads about this but really like for it for fine tuning LLMs High quality examples that resemble what you're looking for is the most important thing and in this case Obviously what he might have done is is maybe use a bunch of emails as a as as as like he gives an instruction like write a 500 word blog post and the completion that he's looking for is like a nicely written email in his style This is the example so the the the prompting completion that you're selecting for your task is very important and That is that is kind of it on the cautionary tail so just something to keep in mind when you're working with fun too So before I hand over to Lav, I just also wanted to quickly recap a little bit of the news from earlier in the week for folks that might not have seen the Leap-to-Deb Day livestream. So these are the new costs for input and output tokens. This is for GPT 3.5 Turbo 4K and 16K. So these are both the same price now going forward. And so we've seen a 75% drop in the input token price for 3.5 Turbo and a 62% drop in the output price. So hopefully, we're very hopeful that this will make more use cases economical for yourself and increase the amount of fine tuning that we see happening out in the world. And with that, I will hand you on to Lav to tell you exactly how to get the best out of fine tuning. Well, those cautionary tales are amazing. So thank you for sharing those and the cost benefit that the token price is coming down by such a huge percentage point is amazing for everyone who is trying to make more and more use of these models and build their application. a little bit about is what it the... actually take to find the models and what the process looks like. And this has happened a ton. So speaking from some experience here, a lot of the times you look at the space and you start thinking about, "Oh, OpenAI has a fine tuning API and how hard would it be to just like throw some data at it and Valai should get a better fine tuning model out of that." And that seems to be naive, but what you would expect, kind of an expectation coming into this. But the reality is if you go to the next one, you'll see that it is actually quite a bit of calibrated, but quite a bit of iterative process to go from base model and then try lots of different techniques as we saw in Collins' shot also, When he's moving around between drag and fine tuning. And then within those, you're trying different experiments and different data sets even in augmenting those data sets after doing evaluations. I think the first step is always starting about thinking through, actually, if you do my thing in the back, see this one, I'll just go one more second. Building out like a really high quality data set, and I'll cover that in a bit more like what we even did by that, but data matters a ton, and the data could come from your existing data set, could also be from subject matter experts, from your own enterprise as well as from scale. And then trying to, trying to solve process of training and by tuning, and then evaluating the performance. So figuring out what your North Star metric is for a given use case is quite important. Defining the shape of the product, defining the shape of what they expected outcome from the model is is quite important is this. So you then evaluate the performance through, it could even be the models like GPD4, could come a handy in evaluating the performance for GPD3.5 fine tune model could also be through other humans, subject matter experts, and on a variety of different dimensions on which you are. you want to evaluate the model. And then trying out these different techniques, like prompt engineering, fine tuning, different optimizations, drag, and tool usage, to tilt-climb on the quality and see your model perform better and better over time. What do we mean by data? What it does look like? So quality matters a ton. And there's a few dimensions to that. High quality data means it's representative of how you want your model to behave in the real world. Is your data accurate? Is it free from formatting errors? Does it reflect the brand voice and tone that you want the model to perform with? Would you feel comfortable using this data to train an employee on your processes or your product and have a reasonable expectation that they'll be successful in their roles? This is the same level of attention that is needed to quality for improving the modiculity and then doing fine. We have seen time and again that in even small parts of data or data set that is corrupt, that has formatting errors is not clean, actually is quite detrimental to the performance of the model at the end of the day. And the next piece ends up being more on diversity. So this is kind of saying that cover the basis, like cover the use cases that you want your model to perform at, right? So appropriately cover the problem space and include a very wide variety of unique examples. So high quality and also diverse. And the third, it's important as well is that the right amount of sufficient amount of data to improve the performance. And you can start out with like on a few hundred examples very easily and see quite a bit of performance improvement right at the back. What we have noticed is as you try to take these applications and applications into production, the order of magnitude tends to be in the thousands of from response fares get to- to really high quality fine-tune model. So again, to recap, it's quality diversity of the data and then the right amount of quantity or expectation amount of quantity that's needed to fine-tune the model. Again, where can this data come from? We've seen the, you know, it starts from in many cases, you might already be sitting on a treasure trover data. Again, you could, you might have to go through a data cleaner exercise to make sure that you again are going back to a high quality and diverse data that exists, but you might already have that and that can be cleaned up and formatted into nice, compressed parts pairs of two trained clean-tune models. You could also generate new data and this is something we have seen a ton of as well. It could be starting completely brand new, getting access to subject matter experts within your enterprise or through scale or could also be identifying gaps in your existing data and then filling those through SMEs in the process. regardless, like, these are some of the techniques that we have seen through which you build out a really amazing data set. And again, you know, as I've sort of alluded to, it is not like one-time process. Typically, you start out with something, you fine-tune, you evaluate the performance model, and then you come back and see where the gaps are in your data set, and you continue to augment the data set to help you implement the quality. We'll pivot a little bit and talk about some customer use cases that we have seen across the rag and fine-tuning, and how we have seen performance for these use cases improve. The first one is from Brex, which is a Fintech company. The use case of the challenge that Brex had was to build out a product that took high quality, who could generate high quality expense memos, to help the burden of compliance requirements that are needed for employees, and then say, "Who here hasn't spent time filling reports, through you know, their at the expense?" at the at the moment, trying to get it in and then like get asked like also the questions for finance team, the accounting team and the tank comes. So they were using like you know started on using GPT 3.5 as the base solution and then see like how much of that process that product could then be further improved, both for cost latency while still maintaining the quality of the solution. So we ended up taking quite a few examples from the existing dataset from Brex. We generated that more with our subject matter experts on the scale data engine and build out a really diverse high quality dataset training dataset for 3.5, fine tune 3.5 with it, went through few iterations and the outcome was that fine tune model outperform 3.5, 66% of the time and just enabling enabled greater automation of expense member generation. So this is something that Rex has been able to take in and already put to production and has been quite beneficial for them again from the performance as well as a cost profile point of view for transition, the gender deviation. Could the next example, Colin, you will talk to this. Yes, absolutely. Thanks, love. So this example, I was really keen to share because this one actually mainly uses prompt engineering plus RAC. But I want to bring you back to that kind of optimization flow that we discussed earlier. It's almost like we basically, what we did here was we started with prompt engineering and every time we evaluated the problem, 99% of the time was actually the information that it was using to answer the question. The LLM was doing fine. It was answering like as best it could with the information we were given, but giving it more specific information or more precise information was really the problem we were trying to solve for here. So the customer problem was that they had two knowledge bases And then one LLM, that would... effectively fire a semantic search query and try and bring from these two knowledge bases the right information to answer its questions. And these questions were pretty nuanced. So sometimes people would ask like a fairly long question that actually involved a couple of different, both domains. And what we found, perhaps predictably, is that when we started with prompt engineering and a basic retrieval is our baseline was about 45% accurate. And this was as measured by human laborers who could score at one of three conditions. Correct, correct, but incomplete and incorrect. So what we then tried, and this kind of shows you the like hill climbing aspect of this, you may start and think how are we ever going to get to production, and that's what we were thinking in this case. But we stuck with it, and we decided, right, we've got a context problem. So let's see if we can bring better context for the model to answer with. So the first thing we tried was hypothetical document embeddings for retrieval. So instead of putting in the user's question to retrieve the most similar and the most similar. research or operations information, we instead got the 3.5 model to generate a hypothetical answer, and then we used that to search for the most similar content. That actually works pretty well in situations where it's like a Q&A sort of situation, but given the variety of research content that this customer had, we actually did an experiment and found that that actually made the performance worse. So we removed that. That's why where you see the ticks and crosses here, this shows what the optimization flow often looks like. You try a ton of things, and this is why it's so critical to evaluate every time and be very clear what baseline you're working from and what you're changing each time, and test each change systematically. So what we try the next thing was to try and fine tune the embeddings. So you might have seen a process of actually providing Q&A pairs, and actually trying to then further fine tune embeddings so that you change the embeddings base to return more relevant results. And we found that that actually worked pretty well from a prediction perspective, but for non-functional reason. So it was slow and expensive and it basically made the whole thing un-economical. So in that case we had to eliminate that option even though it increased the accuracy. And what we then tried was a whole bunch of chunking and embedding strategies. So for the retrieval augmented generation, should we chunk like them in 300 token chunks or should we intelligently chunk them? So if we think we find a section header, we'll break that into a chunk and then we'll retrieve those. And we played around a time with that and what we found actually was that there was about, we ended up supplying about 8,000 token chunks and that was received, that was getting the best accuracy. But still, only 65 percent. Still could be a lot better. So what we tried next was then a couple of other techniques. So re-ranking. So we started off trying to train a crossing coder to actually re-rank the results. So our retrieval would over-fetch. We'd maybe pull like a hundred results and then we'd use this crossing coder which was really good at figuring out of those hundred, it could push the most relevant ones to the top which we then passed the LLM. And we got pretty good results. with actually not a crossing coder, but actually just a rules-based re-ranker. So because it's research, generally the most recent stuff is the best, and generally the most popular stuff is, well, that's maybe self-reinforcing as a thing, but we did find, in this case, it did result in better results. So we added a couple of recent seed, a recent seed metric, and a popularity metric, and we used those to deterministically re-rank the hundred results, and we got an accuracy bump. We also tried a classification step. So instead of having the same GPT-4 prompt handle both domains, we would first do a very quick classification, and then change the context of the prompt slightly, and actually change the re-ranking approach as well, depending on what domain we used. And we again got a bump in performance, 85%. So that's pretty great already. I mean, we've already gained 40 percentage points of accuracy. We're kind of within striking distance of production, but unfortunately, still not good enough for production. So what got us to 98% was actually then going all the way back to the start, and doing a whole bunch of prompt engine. and saying, OK, based. So now we've got pretty good search. So let's actually jump back to the LLM then. And this actually might have been where we consider fine tuning. And actually, to be honest, in the future, we may actually add fine tuning to this process to even iterate further. So everything up till then was actually search optimization. That was the search problem. And going back to the flow from the start, that's why it's so critical when you evaluate to look at the examples and assess, like, what is the problem you're trying to solve here? Because this customer, it was just constantly, what's the best context? The other thing we did was looking at the incorrect answers. We saw that there were some structured data questions, which it was answering poorly, because it was trying to pull the numbers out of this, like PDFs. And actually, what we did was then add a tool where it could fire a query on a SQL database and pull back structured information to answer those structured data questions. And the last thing was query expansion, which is a fairly simple but extremely useful approach, where you interpret the user's question into a list of questions. And again, this is where fine tuning can work really well. You show it user questions, and you make it very good and expanding the say, knowing what's the like latest on that, Tesla but and Microsoft? not You would split something that into two questions. you What's the latest want on Tesla? What's the to latest in eat Microsoft? Pull two sets of context and then give them both the model. Because actually, if you embed the original question, you're probably going to land somewhere in the embedding space between those two objects and get some not so good content. And this actually bumped us all the way from 85 to 98% accurate. So again, it's worth taking a step back here that we're considering complete and are correct and correct, but in complete answers in that correctness that gets us 98%. So there's still a lot of room for improvement here. And actually, that's where fine tuning is then probably the logical next step for this customer. But I also want to show you this to give you confidence that actually, if your problem is still context and you're only doing prompting and prompting and rags, you should not just instinctively jump to fine tuning if that's not your problem. If the LLM is doing fine, but it's the context the problem, continue to optimize your rag. and that is a very critical thing to keep in mind. But the next example actually shows, really, illustrates the power of fine tuning. So this is a leading graphic design company. Effectively, they have a user is going to type into a box. They're going to say something like, "I want a red gradient, I want a profile photo, I want an Instagram post." This will generate some beautiful graphic design assets. What it gets converted to by this customer is then this structured format on the right. And they want the model to get very good at taking these generic inputs from users and constructing these very creative outputs to then go into their template and generate beautiful graphic design assets. And the reason I love this case is because this was evaluated by a team of graphic designers who looked at the output and scored them. So instead of actually doing some sort of esoteric, like, OK, how good was the next token prediction or a loss function or anything, this was actually expert labelers looking at these assets and saying, This is a more beautiful asset than this one, and they don't know So which what model we found, created. if we move ahead one, is this is one of those funny ones going back to that slide at the very start. 3.5 and GPT-4 at this use case were actually very, very, very, performed very similarly. And this is why it's important to consider like this shows that there's some capability that already exists in the model. So the find, so what we need to find to do is really find the things that are most appropriate for this use case and bring that out of the model, like adjust the weights of those parameters in the model, so that it will bring out the like the particular features which will actually improve this performance. And I'd be like there's so many times that I see this case where it's like GPT-4 and GPT-3.5 are both not great. You can see here on this scale that goes up to 2.0, they both score like just over 1.0. So not too great. I think they considered 1.5 like the lowest that they would go or like 1.4. And if you click ahead, you'll actually see that GPT, so fine tune GPT-3.5 actually vastly outperformed GPT-4 at this use case. And this was actually only with, I think they started with 100 labeled examples. But as love alluded to earlier, these were extremely high quality labeled examples that really like brought out, they showed a breadth of like different user inputs and then very creative outputs that they were generated into. This was like an expertly curated dataset. And since then they've continued to fine tune on top of this model. With the fine tuning endpoint, there is the ability to fine tune to fine tune models. So we see a lot of customers, they start small with distilled outputs like 100 outputs or manually created out curated outputs. And then they just keep adding more data. And they see the improved the model continue to improve. And this is really like the power of fine tuning. And that is, I guess, the story of that use case. So touching moving slightly on. So the focus for the rest of this, we've been talking a lot of 3.5 fine tuning. But actually, what we're very excited to share is for those folks who weren't at the livestream on Monday is that we've started experimental access to GP-4 fine And we've also opened up GPT 3.5 16K model for fine tuning, which is available at the same cost as 4K, as I mentioned before. The one thing I wanted to quickly highlight before handing on to Love to show you how GPT 4 fine tuning actually performs at practice is, in case you missed it as well, you're now able to add function calls into your fine tuning as well. So if you find that, for example, your model has a particular, if you're using function call and your model is very poor, identifying the parameters for a function, or it never calls us, so the function selection is poor. This is a great way to actually introduce that model, or to introduce that knowledge to the, or improve that knowledge in the model. And with that, I will hand you on to Love who will take you through the power of GPT 4 fine tuning. Yeah, thanks, Holly. Do it's an example of the use case of phenomenal, and also showed it again back to the point you've been making to own this webinar, that these are all tools in your, you know, in your toolkit. essentially. And these are different techniques you can try out given use gives given the context to then figure out a way to get better performance and cost straight off out of these models. And then as here has been partners with OpenAI for such a long time, we've also had access early access to the GPT-4, fine tuning APIs. So we've been playing with this for a while, providing product feedback, back to OpenAI, product team, and as well as like creating the right recipes to which we did fine tune models. So this is an example of a text to SQL. As you'll see in the next slide, I don't probably have to make the case of why this is important. It's like converting national language queries into functional SQL statements, allow you to then all of a sudden use just pure language to query structure data. An example right at the bottom of the slide is which God models are below average in MPG. in all of this in LLM, takes, looks at that, understands all your possible schemas and databases that has access to writes a beautiful, functional SQL query, and then goes to database and post-bugs the right data that you're looking for. It just makes it so much easier for anybody in the organization to play with data, the use data, and generate insights from the data to do whatever they're trying to do within their data day lives. So, very plain, base approach for this could be something like you start out building a system where the entire schema of your database is given to the model as a prompt. This allows the model to see, like here are the possible tables and the column names and understand what the structure of the data might look like for a generate equal. Now, model already has, these models have been trained so much information out there that they are pretty good at writing SQL from a natural language. But what they don't know is like, what are the different terms that are used? What is the schema looking like even to begin with? So this first approach sort of gets you somewhere where you sort of feed in the whole schema inside the prompt and then the model actually performs pretty pretty well. However, it has like a couple of shortcomings as you can imagine, right? Databases in real life are pretty large for context window. You're also just adding all this, you're stuffing all this context into the prompt. So that sort of makes everything slow and of course it's a very token cost, you know, sort of starts to factor in as well. So your token cost are going up as part of that. And many times the business terms don't actually match the column names. Like, do you all know this where a column is x and then what it really means is something else? And that knowledge is not even in the schema. So you know, the model is not going I'm going to be able to pick on that that easily and then the output might not be So how do you evolve that further with all the possible tools in the toolkit? We could be what we tried was we literally did drag on the schema first. So the question comes in and you go before you perform the whole schema, you just find the right tables and the right columns within those tables that are relevant for that particular question that they use in last week. So that allows, instead of putting the whole schema in there, you're just giving the right example and sort of pointing the model in a way to the right place in the database to start looking at. Like that could be one like, you know, intuitive thing about that. And then addition, you provide some user examples also in the context. So this is kind of saying like for this kind of a question, this is how you write a SQL query and that also enables, you know, just like few, you know, K short examples in the prompt enables the model to create better SQL query. We took it one step further and then also fine tune the model itself with giving it like more examples. of what a question to SQL query could look like. So this is kind of augmenting the knowledge and then also steering the model towards instead of like K-short examples, but like in addition to that, just initially giving you more examples. So it is fine tuned towards that particular task of right to SQL query. So now if you put it all together, starting from doing a rag, plus like in context learning through some few short examples, that's kind of from engineering, plus on top of like all this, you're doing basically fine tuning, it gives you a much rober system and much performance in the way that, it's as few tokens as possible in the context time. And then you would take this approach and try it on both Tupor and Pire and four, which is kind of what we did. So in the next slide, you'll see how much of a performance boost you're able to get on the spider dev set, benchmark here. If you, the red one is essentially plain vanilla, you just stuff in the whole schema. but then the moment you start to drag context plus learning. a You minute. see pretty decent boost for both 3.5 and 4. But if you were to fine tune the models, you see like even bigger boost in terms of the performance and both 3.5 and 4. A performing really well compared to the state of the art. I think state of the art of the leaders like 84% of the leaderboard and those models have been quite a bit customized for this use case. So it's like the the supports that we're trying out here is quite generalizable to any text to SQL application that you might be building. So we're pretty proud of like the kind of improvement viewable to get on board 3.5 and 4 and also proves out the methodology of starting out with a base model from engineering, rag plus fine tuning. And I think the unsung hero in all of this is the evaluation piece. You can't do these iterations and I can't stress on this enough that you really need a robust evaluation methodology. Now in this case it's of course very easy because you actually have Yes, benchmark I to begin with. Real do. life use cases are more complicated as you saw in the financial service examples that are currently showing. You really had to do the spot finding exercise, get to the right place and feel comfortable with the room before you put this out in front of your employees or your users. So then I tried identifying exactly that the use case and how you were evaluated the performance of the system and to become quite a critical piece. So with this, I think we can sum up some of the webinar and sort of talk through how do you even get started on this. I think you're sort of been leaving the breadcrumbs along the way in the webinar, Chloe, if you want to close us up on some insights on this. Yeah, absolutely, and thanks so much to Colin Love for kind of walking us through these details. And I think in particular, like the use cases and this kind of step-by-step process and hearing about the different thought process behind each of these things has been very illuminating for my me. So life. to kind of recap what we have talked about so far, like how would one get started? Of course, the first thing you would do is you would start with a state of the RLLM, this to be GPT 3.5 or as, let's call a mention GPT 4 now also has a support of fine tuning capabilities as well. And then so as your next step, you would convert then your raw proprietary data into high quality and diverse training data. And of course, the last element that Love and Mentions was quantity as well. And of course, bring your domain experts, this could be your own, you could leverage scale experts to generate new data as well to make sure that you have enough data to train on as well. The next step here is that then you would fine tune your your your model, whether or not this is 3.5 or 4 for improved performance on the keys cases are top of mind for you. And then the next step and then is to implement RAG, right enabling GPT 3.5 or 4 to accurately reference your knowledge base and its response. And then of course to make this usable, then the next step after that, you would want to make sure you'd integrate this with enterprise tools and the open AI plugins that you have available and make and build this into the generative AI applications for your end use case within your enterprise. And then as love mentioned, to the un-thung hero piece, you definitely have to test and evaluate your model to make sure that you have ultimate confidence, in performance, in safety, and reliability before you roll it out to users, whether or not that's within your organization or more broadly to a wider base of users and consumers. And of course lastly, you want to make sure that you deploy your customized GPT-835 or 4 and generative AI applications to unlock your key use cases and continue to learn about what user needs are and of course perhaps like revisit parts of this process over time as well. So that's kind of a quick recap on how you might approach would love this to open process this up and to Q&A. ## 以下是 QA 環節 now So I that know that we folks have have been putting about in some questions 10 into the minutes chat, so left. we're going to pick a couple that have been coming through there. So starting with one that we see, when should we use RAG versus inserting information in context, especially now that we can score 128K context length model? Colin, do you want to keep this on? Yeah, yeah, totally. Yeah, this is a great question. And yeah, I'm glad the folks asked this. I'm actually annoyed because there's a slide that I had that would sort of touches on this, so I'll open it all verbally, take you through. So basically, going back to the framework that kind of optimization flow, if you're starting with prompt engineering, then start off, like definitely get your evaluation prepared. And then I would just start stuffing stuff in context and continue evaluating. Now, the reason why a lot of my customers in production actually prefer RAG is connected to kind of two things. So the first one is this problem called the lost in the middle problem. There's a paper which was written on this, which I'd recommend you read. And it's like with these long-conduct models, they can't pay, they're like sort of like people, they can't pay attention to everything equally across all that context you're adding. And you might start to find that as you've stuff more information in, they lose some of the details. And if it's a use case where accuracy is like super critical, that could be too much. But what's going to tell you that is your evaluations. And the reason I say this is that Rags can often give you like more, a smaller amount of more relevant context. So the LLM has less to pay attention to and less like area to make mistakes. And I'm going to give a shout out to the open source community here because there's some folks at exploding gradients who created a framework called Rags. Now, there's many frameworks for assessing Rags application. So this is just one of them. But I love this one because it has a number of metrics, which help you assess effectively the signal to noise of the context that you're providing. and they have a metric called the context precision. you effectively measure how much of each bit of retrieved context was actually used to answer the question. And actually with that financial services example, I shared, that's one of the examples where we're currently going through voices of looking at that. And what we're finding is our signal to noise. The ratio is very low and is a very high ratio of noise to signal. And what we're finding is that actually even though we're only giving eight chunks of context, we could actually get by with probably giving even less and still getting the same amount of accuracy. But taking a step back, the way we know that is because we have an evaluation approach we're using the ragass framework. And we're taking the approach of giving it only a small amount of very relevant context. And that means that our search needs to be very, very strong to actually support this. But that's kind of just one example of like the difference between inserting all the information in context, some of the issues you might run into and why some folks go with rag instead. But again, the way that you tell is by constantly evaluating because you may find that, but it all in context works for you, use case. Great. Thank you. to Colin. Looking at our list of questions again, and moving away from rag to fine tuning, does fine tuning reduce the general capabilities of the models? Love, do you want to take this one? Yeah, I can, I can, I can take a crack at it and call in our plug here at the beginning as well. Yes, we have seen some evidence if there've been like papers out there that sort of let's try this. We also saw some examples, you know, Colin had one in his presentation where, you know, something was pre-known, like just lack messages and all of the new askets to do something out of ban. So this is a very important piece here for fine tuning in general. I think this is the common pitfall as well. While fine tuning really helps honing in the model on a particular use case, a particular problem, pretty way of solving a particular problem, it might have, you know, sort of effects on other capabilities of the model. And the more fine tuning you're doing, and the more narrow the domain is, you might have impact on. on anything that is not that many centuries. And again, this has to be a very conscious, very mindful decision as part of you building out the GNI application. And it's a common pitfall that we should worry about. The way to catch on this and then make sure that you have the right framework in place is again sort of goes back to the e-bought piece. I was talking about earlier. I think it's really important that you set the guardrails for what the use kits isn't what the application is supposed to be doing. And then you have a way to monitor it, first evaluate it before it goes out of production, but then also have a way to monitor it over time to ensure that the model is performing the way it's supposed to perform. And then at any point, you have issues with safety guidelines or safety guardrails. I would say that is the point that you have double down on like assessing the quality and doing a bit of that teaming. to make sure that the model is performing in our cork honey, go to that. Thank you, love. Call on nothing, anything to add here? No, no, the very comprehensive. Great. I guess next question, let's see. What are the hyper parameters that are available with the fine tuning endpoint, and are there any new ones with the GPT-4 endpoint? Yes, great question. So with 3.5 fine tuning to date, there was only epochs available. We've added two more, so we've added a learning rate and batch size. So these would function very much for those of you familiar with libraries like PyTorch and stuff and training your own model. So these have been added. And if you go into the OpenAI documentation for fine tuning, we've also got a couple of recommendations as to when some of the parameters make sense to use. The one thing I would encourage you to do I mean, this is probably extremely obvious, like with these, but obviously evaluate each time. And it's useful to make use of the validation file, which you can upload with the fine tuning end point so that you can evaluate on your validation set every time you tweak one of these. If you do want to do a grid search across these parameters, for example. Thanks, Colin. Let's see. Okay, here's another one. It's an interesting one given to some of the recent announcements. What's the difference between fine tuning versus creating just the new GPT assistant from the latest like open AI dev day releases? Yeah, yeah, yeah, for sure. So I'd say that they're sort of like different things. So let's just firstly clarify GPT assistance. So there's a first party version called GPT's, which you can create through chat GPT itself. So you can go in and use that like fun UI, which Sam used to create like the startup one, which was super fun and But you cool. don't have control over which model that when that when uses you like it's basically just like 3.5 or 4 and I believe it just uses 4 I need to double check that. But basically this is more of like a kind of on rails experience. You also then have the third party version that you build on the API which is assistance API and that's like a sort of it's meant to eventually be like quite a similar experience. But you are able to like write code for all of it and build your own your own UI and experience around it rather than being sort of chained to the chat to be T1. So where fine tuning comes in is that assistant that has like some tools and some instructions. It's also going to need a model to power it. So if you look at the assistance API documentation for example, it's probably the easiest way to visualize it because when you actually create an assistant there's three parameters. There's the instructions, there's the tools which are like code interpreter or whatever it might be or for GPs you could imagine you could create like you could hook it up to some other API and get it to use it. And then the third thing is the model. and currently on the Assistant API, you're only allowed 3.5 and 4, but what you could do in future is fine-tune a model, and then you could have it use the fine-tune model. So let's take an example. I have Assistant's API. I'm gonna give it the code interpreter tool, and then I'm also gonna give it a function where it can like, it'll return a function call, which I'm then gonna use on one of my other systems. If the function calls selection or like the parameters are not being produced accurately, I could like fine-tune a model and then get that Assistant to use that fine-tuned model, which would then allow it to make best use of like the function and the tool that I've given it access to. This is like kind of a slightly spurious example right now, just because we don't allow you to use fine-tuned models, but this is maybe I right now, because that is innovative, but I believe that the longer term plan is that you should be able to like attach, you know, your own model, your own fine-tuned models to those assistants. Yeah. Got it. Yeah, that's something. Thanks, Colin. I think we've had a time for one last quick question. One question we have here, and love maybe you can take this one around us out, is like, how many exams? do you need for fine tuning? Like how large did these data sets be? Yeah, I think it's part of it something small. Like you know, 100 to 100 examples is pretty pretty fancy to start experimenting with with this. Again, remember quality and diversity matters more than quantity, right? So you want high quality and diverse sort of examples. What do you see in as a good production? Again, very useful use case to use case. Something in the range of like, you know, thousand to ten thousand examples is where you really get the right performance and, you know, cost rate off out of these models. So I think that's what you should be planning for as you go to production, but you can start out with like, you know, a couple of hundred examples. Good quality ones. Great. All right, I think with that we are at time. Thank you everybody for tuning in. I know I definitely learned a lot and hope to see you at the next webinar. Great. Thanks a lot folks, I really appreciate it. Thank you.