Mark Chen-NTHU-20230927 - ihower's Notes

> 歡迎訂閱我的 [AI Engineer 電子報](https://aihao.eo.page/6tcs9) 和瀏覽 [[Generative AI Engineer 知識庫]] ![[Pasted image 20230927193002.png]] >以下是演講逐字稿配投影片，用 Whisper 轉出來的，沒有人工校對過。 ## 主持人開場介紹 I will give a brief introduction to Mark Chen. Mark Chen is the head of multi-model and frontiers research at OpenAI. He graduated from MIT with a bachelor's degree in mathematics with computer science. He contributed to the development of GPT-3 and created ImageGPT. He led the teams that produced "Delhi 2" and introduced version to GPT-4. And by the way, "Delhi Delhi 2" is a big step for next image generation, giving a promote to generate original, realistic photos that combines concepts, attributes, and styles. Applications include in-painting, out-painting variations for designers, video games, urban planning, et cetera. And Mark, as a research scientist at OpenAI, he also led the development of Codex, which is the AI system that powers GitHub's compiler. And Codex, OpenAI, parses natural language and generates codes in response. Codex is a descendant of OpenAI's GPT-3 model and fine-tuned for useful programming applications. He led the work and published a paper evaluating large language models trained on code on July 2021. A former quantitative trader, he is also a former quantitative trader before he joined OpenAI at several proprietary trading firms. Actually, it is in Wall Street, so it is a big finance company. So including Jam Street Capital and where he builds machine learning algorithms for equities and futures trading. And there's a very unique job Mark doing is he's the coach for the USA Computing Olympiad team. All right, so the talk Mark will offer later is he will give a look into how CHET GPT was created and begin with an introduction to GPT models and reasons OpenAI invested. So heavily in them, we focused on compression and scaling laws. And he delved into the use of reinforcement learning from human feedback to transform these GPT models into user-friendly chat GPT models. And then finally, we discussed how chat GPT can be trained to use external tools such as a brother. All right, so let's welcome Mark to give us this talk. ## What's GPT (演講共30m) ![[Pasted image 20230927185223.png]] Yeah, thank you very much for that kind introduction. Thank you for inviting me, Dean Chu, president, of course. And thank you all for coming here today to see the talk. Just before I get started, how many of you guys have used ChatGPT before? Oh, great. > 幾乎都舉手 That's really encouraging to see. And how many of you have any knowledge of transformer architecture? > 數十人舉手 OK. How many of you have trained your own transformers and written them up in PyTorch? Oh, great. > 個位數舉手 OK, cool. I think this talk is at the right level, so let's just get started. So my original title of the talk was "From Compression to ChatGPT. And I realized after some time I only have 30 minutes to give a talk, so I used some compression. And now we're just really going to talk about chat GPT today. I think Confession itself could be another half hour talk. So I've taken the liberty to change the title to "Getting the Most Out of GPT. And I think one of the lessons that I really want you to take away today is that these GPT models are behaving more and more like humans. And sometimes you can coax a lot of abilities by treating them as if they are the kind of agents that you could talk do, almost like a human counterpart. ![[Pasted image 20230927185441.png]] So let's start with what is GPT. At its core, GPT is a very simple idea. It's all built on next word prediction. And it's a probabilistic model that takes a bunch of words as a prefix and then calculates a distribution over probabilities for the next word. So we model the probability of the next word given all the previous words. For instance, if you have a sentence, you would decompose the probability of the sentence into the probability of the first word times the probability of the second word given the first word times the probability of the third word given first and second word, and so on. And the reason we do this decomposition is there's over 100,000 words in the English language. And what we're doing is we're trying to constrain the output space so that we don't have combinatorial blow-up in terms of what we're modeling. So, one cool thing you can do about language models is that it can complete a sentence that you start, right? ![[Pasted image 20230927185528.png]] And I can show you an example here. This is called prompting. ![[Pasted image 20230927185605.png]] So when GPT-2 first came out, this was the result that really kind of blew me away. And you can see an example of prompting here. When you give what's highlighted in yellow to the model, the model produces the text that's written in white. So you have some controllability off the bat. So here, the prompt is, "In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. And you can see the completion, it makes sense, right? They talk about an evolutionary biologist named George Perez. So this is very, very fitting to the prompt, right? ![[Pasted image 20230927185812.png]] It really kind of understands what the setting is and tries to create characters that are consistent with the setting. But even more importantly, what GPT-2 really taught us was that almost every single task can be described using language. And so this model, even though it was trained to do no task in particular, kind of is able to do every task that can be described in language. So for instance, we used to have machine learning benchmarks that are question and answering benchmarks. And usually you would have a passage and then some questions and you would have to answer in the natural language. And what we can frame this as is a prompt, right? You can copy the entire context into the chatGPT prompt, you ask a question, and then to coax an answer out you can put answer colon, right? And the model should understand that its task is to produce an answer in in this situation. ![[Pasted image 20230927185829.png]] Now jumping forward to GPT-3, we really see the importance of scale. So GPT-2, it was just a couple billion parameters. GPT-3 took it another 100x in terms of parameter count. So what you're seeing on the x-axis here is number of parameters. And on the right axis, you're seeing the ability of a human to detect the difference between a computer-generated text and human-generated text in this specific domain of generated news articles. And you can see that as the models get bigger, it becomes harder and harder for humans to distinguish between what's real and what's fake. And GPT-3, sorry, as a milestone, was one of the models that was first very close to being able to fool humans, which is this 50% random chance bar right here. So capability-wise, that is what GPT was able to accomplish. ![[Pasted image 20230927185853.png]] But one other insight that we had with GPT-3 was that new tasks can be described in context. What I mean here is even if the model has never seen this task before in pre-training, we are able to, within the context or the prompt that we provide to the model, specify a task that the model has never seen before. And the model should pick up what is required of it as it processes these characters. So here, I made up a task. I have some English words, and then I insert garbage characters into the words. So I'm adding some exclamation marks, some slashes. And the idea is we want the model to be able to remove these words, these characters, and produce the actual English word. So if we give the model a couple of examples, we find that the model is able to generalize to some new example here. And even though it's never seen these types of examples in the training data before. So this kind of in-context learning, first of all, gets much better as you scale the model size. And secondly, the number of examples in context. You can see that it's learning as it sees more and more examples. This is the x-axis here. As you show more examples, it's learning by seeing these examples and getting better performance within the same model size. ![[Pasted image 20230927185905.png]] And of course, earlier this year, we released GPT-4. And this was our latest experiment in scaling the models. Now, the reason that we've scaled these models isn't just kind of throwing a bunch of compute and kind of hoping that good behavior emerges. We found that we can very accurately predict the performance characteristics of models as we scale them. So it's actually a very precise formula. We know when we train GPT-4, is it matching up with expectations or not? And what we were finding was GPT-4 is a much better reasoner than previous models. So you can see that in a bunch of these standardized examinations for the United States, the model is oftentimes able to score in the roughly 90th percentile of human test takers. And this is versus GPT-3, which often is in the 10 to 50th percentile. So we're seeing big improvements in specific reasoning style data sets. ![[Pasted image 20230927185925.png]] Another thing that we did with GPT-4-- and some of my teams were responsible for putting this and was introducing vision into the models. So the models are able to take visual inputs like pictures, and it's able to process the text. ![[Pasted image 20230927185952.png]] I think in the most basic level, it's able to do OCR and just convert the text into natural things. But you can also ask some questions. So you can, let's say, give this chart, and then ask the question, what's the sum of the average daily meat consumption for, let's say, Georgia and Western Asia? And it's able to read these numbers, the 79. 84, 69. 62, and actually do some mathematics here. And even further, these models can use all the base capabilities of GPT models. So they have all the language understanding that GPT-4 does. And we can even produce a physics problem in a different language, ask the model to think about the solution, and it will generate the solution for you in English. ![[Pasted image 20230927190013.png]] This model is also able to explain jokes to people. So this is a typically cited meme of how deep learning has nothing novel. We just scale layers to infinity. But the model understands what we're doing. It says the humor comes from the contrast between the complexity and specificity of statistical learning versus the simplicity and generality of the neural network approach. So it's able to take these oftentimes humorous things, which humans contextualize and understand why it's funny and explain it to us. ## GPT -> ChatGPT So now I'm going to talk about how GPT becomes chat GPT. ![[Pasted image 20230927190152.png]] So language models, what they're trained to do is predict the next word. They're not trained to be helpful. So if you have a prompt that says, explain the moon landing to a six-year-old in a few sentences, and you just run a GPT model, it's not going to give you what you want. So it's going to say, OK, what's the most likely text that comes after? And in many online news articles, there are just lists of questions. So it might say something like, oh, now explain the theory of gravity to a six-year-old. Explain the theory of relativity. So this isn't really what you want. The human wants, people went to the moon, they took pictures and sent them back to Earth. So we can all see that. ![[Pasted image 20230927190230.png]] So our solution here is to use this technique called reinforcement learning from human feedback. And what this does is it uses a small amount of feedback to teach the model how to achieve a goal without having a human to hard code what the goal is. So here is the interface that we provide to a human, for instance, to teach them a little stick figure, how to do a back flip. And in each case, all the human has to do is say, is the left side better, or is the right side better? They don't have to say, what is a back flip? That's actually very difficult to specify, because maybe There's something about the center and maybe the side moving in a circular motion. So that you don't have to specify at all. And you can see that just simply by rating roughly 50 comparisons, we can teach a model to produce this backflip motion without actually having to say what a backflip is. And so this is a very powerful way of learning from just a small number of examples. Now, how do we apply this same kind of idea to language models? Well, reinforcement learning from human feedback at scale. So we want to run this at a larger scale than just having humans sit there and click which language completion is better. So what do we do here? Well, we train a model to mimic the preferences of the human. So what we do is we have a GPT model. It produces some responses. ![[Pasted image 20230927190257.png]] And then it goes to retrain a reward model that tells you how good the response is from a human perspective. In this way, we can replace the human with the reward model and the reward model is scoring instead of humans sitting there clicking. So here's the pipeline. And the reason we use RLHF for language models is that aligning a language model and having it do what you want, that's a very fuzzy goal. What does it mean for the model to follow my instruction perfectly? The right way to do it is to show it a bunch of examples and say, hey, this one looks a little bit more correct to me. So what we do is first we collect comparison data and train a reward model. So we have the model produce several responses to explain them in landing to a six-year-old. And then we have real humans create a ranking of the preferences across the responses. And then this is used to train a reward model. So this reward model will learn the same preference model as the humans. And then now we can optimize against this reward model. We use an algorithm called PPO. And it uses this reward model to provide scores for the different outputs. And then this is fed back into the model to improve it. So the way that people generally score language completions is known as the Likert score. It's a number between one through seven. And it's literally you ask a human, hey, was this responsible? You want it. Can you rank it from one to seven? And you can see that a GPT model, if you don't do any reinforcement learning from human feedback. The language scores usually hover a little bit over two out of seven, pretty bad. ![[Pasted image 20230927190422.png]] But after the RLHF procedure, you're bringing it up to oftentimes more than five out of seven. So we see that running RLHF really improves the usability of these models, especially from a human perspective. And this is also able to mitigate a lot of other side effects that we care about with language modeling, right? So there's toxicity, like are the responses kind of toxic? Do they reflect kind of bad behavior on the internet? There's truthful QA, which is the response I'm getting accurate or truthful. ![[Pasted image 20230927190437.png]] ![[Pasted image 20230927190444.png]] Is it hallucinating? It's hallucinating a lot less than base GPD models. And we can also see, even in certain situations, it's kind of better for customer assistance. So we see that across a bunch of metrics, even though we don't specify what it means to be better at these things, Just by having humans compare, we are improving on all these metrics. Just for kicks, I used our vision model. ![[Pasted image 20230927190453.png]] ![[Pasted image 20230927190459.png]] I fed in the paper, the PDF, just pictures of the PDF. And I fed it to GPT and asked it, hey, can you read and summarize it to me? So it actually does a pretty good job here. And I won't go too much over the details here, but it's another piece of proof that the vision models are starting to get quite good. And you can also kind of say, oh, can you explain the process described in figure two, which is the one that I just showed you. So here, you can see that there is a reward model training step and then a reinforcement learning using PPO step, and then they are iteratively performed to train the final ChatGPT model. So just to reinforce what I just said, ChatGPT, the procedure for turning GPT into ChatGPT, is to first fine tune GPT with conversational data. So this way, it kind of understands, like, hey, there's an assistant interacting with the user. And then a reward model is trained by having humans rank completions from this model. And then we run RLHF, this procedure at scale. ![[Pasted image 20230927190521.png]] ## Extracting Reasoning form GPT So now I want to move on to some results that have really started to emerge this past year. How do you make the most out of ChatGPT? And how do you extract the most reasoning from ChatGPT? Again, one of the criticisms is these models, they seemingly know a lot of information, but they make some small reasoning mistakes in ways that humans wouldn't make. So one of the big seminal papers here was an idea called chain of thought prompting. ![[Pasted image 20230927190601.png]] ![[Pasted image 20230927190649.png]] And the core idea is when humans solve problems, we often decompose tasks into simpler tasks. And then we solve each of the small steps before returning the final answer. But what we're asking our model to do often is to give the final answer immediately. So you can see there's a bunch of standard benchmarks in AI, where there's, for instance, math word problems, common sense question answering, strategy question answering, et cetera. And you can see they usually consist of questions and then some kind of explanation before the answer. So typically, when we prompt our model to solve these kind of questions, what we do is we give an example question, an example answer, and then we give a new question, and then we ask it to come up with the answer. And we ask it to come out with it immediately. So before, the answer was 11, so now it says, OK, I see what the structure is. The answer is 27. But chain of thought prompting, what we do is, instead of asking the model for the answer immediately, we show it examples of a person reasoning through how to solve the problem and doing the intermediate steps. So here, in our prompt on the right side, before we provide the answer, we actually show the mathematics and the step-by-step breakdown. And the model's like, oh, hey, I understand the format now. I should do some reasoning before I produce the final answer. And so we see that this makes the model much more accurate at producing correct answers, just by asking or showing examples of doing reasoning. Here's a very cool plot. ![[Pasted image 20230927190718.png]] So this only works if your base model is smart enough. So on the x-axis here, across the plots, is the number of parameters in your model. And this is a benchmark accuracy on the y-axis. And we see that, unfortunately, the labels are missing, but only after about a $50 billion parameter model does this chain of thought prompting work. So this is the difference between the blue circle here and the black line over here. So we see that for small models, they don't really understand what reasoning is well enough. But once the model is large enough, all of a sudden, hey, I know what I'm supposed to do. I know I'm supposed to eliminate reasoning. And suddenly, the performance improves much more. So the model has to be powerful enough to become a good reasoner. And this kind of supports our GPT-4 analysis. Now, sometimes it's unsatisfying. What if you don't have examples of reasoning in the task that you want? Well, the answer here is also very simple. ![[Pasted image 20230927190748.png]] ![[Pasted image 20230927190800.png]] We can just add, let's think step by step to the prompt, instead of showing examples of other reasoning. So what we do is almost this two-step procedure. We extract reasoning, and then we extract the answer. So you can see this on the left side. We have the chain of thought idea from before. So we show an example where there's reasoning, and then it mimics the reasoning. Instead, let's just tell the model to think step by step. And you don't need a previous example. And so the model will kind of say, OK, yeah, I know what I'm supposed to do. I'm going to do some reasoning and really organize it as steps. So you can see this performs fairly well. And you can see that comparison of different types of instructions to extract reasoning here. So let's think step by step is the best one. You can see some other ones that work pretty well. Let's think about this logically. You can also see some counter style prompts. So if you have misleading prompts like don't think, just feel, these don't really improve the model too much. And irrelevant prompts, of course, like it's a beautiful day or something that's going to decrease the performance relative to the baseline. So before we tried searching over human-written reasoning prompts, one thing we can do is actually ask the model to write these prompts for us. ![[Pasted image 20230927190828.png]] Like why do humans have to try a bunch of these outcomes? So what we can do is we can start with a human-written prompt and then ask the model to come up with different alternatives that might work better than this prompt. So you're seeing in this diagram a model will propose a bunch of different alternatives. We take the best scoring ones and then ask the model to come up with more things like this. And we can run an iterative loop. And the result is actually something that improves on this thing step by step. It sounds like a small improvement, but these things do matter. So let's work this out in a step-by-step way to be sure we have the right answer. That works better than let's think step-by-step. Now, how else can we imagine how humans solve problems? ![[Pasted image 20230927190848.png]] Well, sometimes humans don't just solve problems by themselves. If you go to a classroom, maybe you're in a classroom with other students. And the best way, maybe in that sense, to get a reliable answer is to ask everyone to try to solve the problem. And let's take the majority answer. So this is an idea called self-consistency. We just imagine we're asking a classroom full of students the same problem with reasoning, of course. And then we take the majority of answer. So every one reasons slightly differently. Some of them are correct. Some of them are incorrect. Remember, these large language models are trained over the full distribution of the internet. Some are going to be incorrect. And we just take the most common answer here. So again, this improves the performance on reasoning tasks as well. And one other thing we can do on top of this is using debate to improve reasoning. ![[Pasted image 20230927190916.png]] ![[Pasted image 20230927190923.png]] Let's say you have a classroom of students now. Instead of just stopping at one round, why not have multiple rounds? So let's have a bunch of students try to produce reasoning traces. And then we feed what the different traces are back to the students. And they all reflect on it. And then they try another round of coming up with a solution. And the interesting thing you can see here is sometimes both students get it wrong on the first try. But by seeing each other's work, on the next round, they both get it right. And the idea here is they can combine the parts of the solution that really made forward progress and just integrate them together. They often have this reflective capability. And so, yeah, you can see, for instance, on the top left corner, two people are trying to solve a simple arithmetic problem, and they get it wrong on the first try, but then they converge on the correct answer. So we've seen a lot of examples in the previous section of how just kind of by thinking, hey, what would humans kind of do in a certain situation? We can coax the model to improve its behavior. ## GPT + Tools Now let me talk a little bit about GPT+ tools. This is another way to improve the performance of GPT models. Now one early notable example of this is WebGPT. ![[Pasted image 20230927190946.png]] > live demo https://openai.com/research/webgpt And WebGPT is basically GPT with an external browsing tool, something that can look on the internet and basically get more reliable answers by virtue of having grounding in real web pages. So let me show you an example of what WebGPT looks like. So here, you can see if a user types in how does a neural network work. It goes and it browses through. It clicks on web pages. It reads relevant paragraphs. So it's going and scrolling. It's clicking into other links. It's also going there and scrolling as well. And eventually, it comes up with this answer that cites various sources. So this basically says that why force your model to memorize all this behavior when you could have an interplay with external databases? And techniques like this help the model stay relevant as well. It's impossible for a model to know if every single restaurant, when it's open, when it's closed. And so tools like this allow us to interface with the real world in a reliable way. So there's also another important paper that came out this year called Toolformer. So models can solve new tasks from only a few examples, but they struggle with arithmetic and factuality lookup. ![[Pasted image 20230927191051.png]] ![[Pasted image 20230927191058.png]] That was one of the motivations of this paper. And their solution is to fall back on external tools for these types of problems. So for instance, if we have something like the New England Journal of Medicine is a registered trademark of, at this point, you might want to issue a request to browse and figure out what this answer is from a database, and then just extract that answer and put it into the context. So how do you train a system like this? Well, we use the technique that we just discussed in the previous section. So we can ask the model to do it through prompting. So what we do is we take real input and output examples, and then we just ask the model, hey, where would you put an API call to browse the internet? Where would you use a calculator? And actually have the model just insert that for you where you think it would belong. So you can see here, if the input is Joe Bynum was born in Scranton, Pennsylvania, the model will just know, hey, at this point before returning Scranton, I should probably make an API call. And then what we do is we fine tune on these calls and see which of them actually improves the accuracy. And finally, something that we launched earlier this year was plugins. ![[Pasted image 20230927191113.png]] And here, what the developer can do in this case is just simply specify a behavior that they want, and the model will decide if calling the API is appropriate. So this is actually a very lightweight way of introducing APIs into GPT. You just produce what's called a manifest document, which tells the model, hey, this API is used for this purpose. And the model will just read that and be like, hey, OK, I think it's an appropriate time to call the API. So I think two other notable recent tool use examples came out in the last month. And I would like to demo those very quickly. So Dolly 3 is also a project that I was involved with. And I would like to show you an example of what Dolly 3 is able to do. > DALL-E 3 live demo So it's an integration with chat-jpt, and it generates images. So let me play it. OK, there is some sound, but I can just read it out. So we're trying to create variations of a character in different settings. (Laughter) So unfortunately, the sound is not working. But there's also a voice mode now that we launched yesterday. And so you can actually have the model narrate these stories, and you can speak to the model directly. And these are all implemented through a tool use. Basically, these are external tools that the model can call when it's appropriate. Lastly, I want to give a quick glimpse into the future. So there is this, I think, quite prophetic paper published from Persigliang's lab called Generative Agents. ![[Pasted image 20230927191216.png]] And what they do is they create this 2D world where there's a bunch of agents. They can move around. And when they run into other agents, they can talk to them. They can have a conversation and interact. And what they do is they provide backstories for each of these agents. Let's say this person, what are their characteristics like, what is their personality like. And then they just unleash them into a world. And they find a lot of surprising emergent behavior here. It turns out a lot of the agents, they all come together and they'll say, hey, I'm planning a party, and it's at this time. And they'll tell other agents, and they all converge in the same place. So we actually see a lot of human-like behaviors emerging just by giving people backgrounds. and the ability to interact with each other. And they also have a reflection stage where people see all the observations, and then they bake that into their personality as well. Now, I think this is a fairly undirected environment where there's no goal. But think about what you could do with reasoners from the previous section, capable reasoning models who are goal-directed. Could you imagine having your own agent interacting with other people's agents? or let's say a company builds an agent, that interacts with your agent. I really think the future we're moving towards is some agentic world where we can trust on the model's reasoning capabilities and have it act on our behalf on many times. ## OpenAI ![[Pasted image 20230927191248.png]] Finally, I wanna say a couple of words about OpenAI. So our goal is twofold. First, we wanna build safe and beneficial artificial general intelligence. And we wanna ensure that AGI benefits all of humanity. ![[Pasted image 20230927191255.png]] One thing that we care a lot about is safety. So we know that GPT models, they're capable of generating outlets that are untruthful, toxic or reflect harmful sentiments. And these kind of things can be increasingly dangerous in a world where we have elections, right? A lot of kind of discourse in the public. And one thing that's notable is we used the six months after GPT-4 was trained to address these concerns by red teaming it with external researchers and users. ![[Pasted image 20230927191319.png]] So we did not launch it immediately. I think we spent a lot of time trying to make it more and more safe. And this really is the key to our strategy. We want to make state-of-the-art AI broadly available and easily accessible. And we think that this is the right way to allow people to adapt to the change that AI brings. Safety, you can't solve it by just sitting in a room and thinking about it. You really have to have the models contact the real world, have people adjust to it, and then to evolve the models and address all the harms that you currently see. ![[Pasted image 20230927191327.png]] So this is our strategy. We feel like we've done probably more than any other company to bring the most powerful models to people and allow them to use it. And we hope to keep doing that in the future. Thank you. ## QA 時間 (1hr) ![[Pasted image 20230927191343.png]] ## Q1 Hi. Same for the wonderful talk. But I'm very curious about what compression you want to talk about. So would that be possible that you utilize 10 minutes to talk about that? For example, what kind of the knowledge or wisdom you try to compress from the human language, the corpus, something like that? Right, right. So I think I was trying to give some motivation of why we believe so much in scaling these models before other people did. And I think one of our motivations was from compression. So good predictors are good compressors as well. I think from an information theory perspective, there's an equivalence there. And in some sense, what the models are doing is they're searching over programs to produce a particular output. And I think in some sense, having the minimum shortest program that produces an output is the best compressor in a certain way. So there's a talk by Ilya, I think, within the last month where he articulates a lot of these points quite clearly. And I would refer to that one for a more detailed description. ## Q2 I think for the great talks, it's really impressive to see everything build up and also have this really strong of iteratively releasing it, make progress, and beneficial to the real world. So I'm from the computer vision community. So we went through building more than a thousand-layer model. But then realized it cannot be deployed and then go down through compression, something related to your topic. So I'm just wondering, do you think there's a bright future to make model much, much smaller, but still capable of doing everything that language can describe? Or what will that future looks like? Do we can imagine that? - Yeah, yeah, that's a very good question. So all the evidence so far has suggested you need a base amount of capacity in the model to do really good textual reasoning. So if you want to lean on the textual reasoning abilities of the model, you do need it to be a certain size. But there's a lot of work actively distilling these models into smaller models. And I think that's actually one of the most interesting research directions now. How do you make a small model act like a bigger model? And maybe it is something like chain of thought, like what I just talked about. Perhaps you can imagine this big model as something that's very powerful and knows a lot of things. And a small model needs a lot of steps of reasoning to get to the same understanding. So I think it's an active research direction, definitely worth pushing on. But how do you turn bigger models, or how do you turn smaller models into the capabilities of bigger models? ## Q3 And as a professor or teacher, I'm interested in, do you think there is a corresponding between training and engineering [INAUDIBLE] with educating students, especially educating a child? Because training a model from scratch is very similar to educating a child. And do we have any experience or you can recall anything between these two? Thank you. Yeah, so I do have an almost two-year-old. I guess it's too early to tell what the similarities are. But honestly, at OpenAI, I think this analogy holds a lot of water. We see the pre-training phase, where we're just kind of dumping a lot of data at the model as almost a kind of childhood phase for the model. It's just learning a lot of things. Maybe teaching it reasoning is more of an adolescence phase or something like that. But I think the analogy does hold water. And unfortunately, I don't have enough experiences from real life. ## Q4 Hi. So my name is Susanta. I'm from India. So thanks for your wonderful talk. And I have two questions, one related to the AJI or the age version of the chat GPT. So another one is related to the generation of the coding of the programming. So the first one. So now is there any plan from OpenAI that there will be an age version or the smaller version or any quantized model that we can implement it on the embedded system? Maybe this one first. Yeah, I can't talk too much about our future plans, but it isn't out of the question. So the next one is, OK, related to the generation of the codes. So most of the-- as we know, this is language model, OK, not a coding generative, something like that. But we do the generation of the codes, and we get the help from there. But most of the time, we find that, OK, it fails after a certain label. So do you have any comment on that? Is there any chance of improvement or something? Yeah, yeah. So I do think, like you said, there are gaps in reasoning. And also, as the length of the code increases, you're more likely to hit bugs. And I do think part of the issue is that humans also don't generate code character by character line by line. There's a lot of iterative testing that you do. You complete a module. You try out certain cases. You also hold examples in your head. So I think kind of mimicking the human prior here should do a lot. And I don't think necessarily that means you have to move out of language. You can do stuff almost like we've seen with chain of thought, like hey, let's imagine simulating this input, for instance. So I just think a lot of these techniques will carry forward and really improve the abilities of these models. Now with code in particular, I think one thing we showed in the Codex paper was various types of discriminators, meaning models that go and judge the full completion, the whole output of code, stacking that with the generators leads to much improved performance. And that actually isn't necessarily a given. You could imagine maybe the model just thinks of one incorrect approach and just tries it all the time. But it turns out, actually, it does try a diversity of approaches. And if you did have a perfect model to pick out which one is correct, you can really improve the performance. So I think we're almost there. The model is able to come up with it often enough, and we Thank you. Thank you for your presentation. I asked the same question with different perspectives. So after you train this big model, any way you can have compression, use your word compression, and become like hierarchical and useful effect for some small company have limited resource. This is another way thinking look like mission impossible, but I still want to ask you if any way after your training and you compressed like hydrologically. Yeah, yeah, that's a good question. So our philosophy I can tell you kind of strategically. What we try to do is we try to always ship the most intelligent models at any given point in time. lead that involves building the largest artifact. But we've also found a lot of demand, for instance, with TryGPT. There's a lot of usage that's the faster inference version of it. And so we understand that's important. I feel like we'll invest more into things like that in the future, but we understand from a usability point. It's not that there's a lot of demand for the smart model, but also a lot of demand for fast, kind of edge-style models. ## Q5 I have two quick questions. So the first one, just wondering, oh, these are more language text-based learning. I wonder if we can learn by observing the physical world, for example, the model is smaller, convergent faster, or do we still require a lot of billions of parameters to get the same kind of reasonability that we're able to do that, drop like Apple, and we know it will fall. So that's my first question. The second, on the usability side, I'm just curious, it still requires a certain level of language ability to be able to write that problem, do the time thing. Is there a plan, or in the future, how do we see that as being open to people that maybe not as capable of doing that, but still get the same kind of services and tools, which is out of curiosity? Yeah, yeah. Those are both really good questions. So I think on the first one, we're talking about learning from vision. And one interesting trend over the last couple of years is that the architectures people use for training vision are converging with the ones that they're using for training language, right? So like the vision transformers using, you know, very similar architectures to the text transformers. And my view is actually a little bit the opposite. So I do think, you know, it's probably easiest to learn a base amount of real knowledge from text, but at a certain point, maybe like the, the human text abstraction is no longer enough, right? And we have to start giving visual inputs And the model can start looking at the world and making a princess for itself, not filtered through this bottleneck of what humans have invented. So I do think maybe in the future, to really go beyond text, we'll need to do a lot of unsupervised learning from vision. In terms of making the models more usability for people with different backgrounds, I think one interesting thing with model like Jeff, is the behavior you get is very sensitive to what the backgrounds of these kind of people who are ranking the completions are. And I think we've discovered that maybe we should make this set very diverse, like have kids do this, and have people from places where they aren't as literate do this. I think that could go a long way towards making the models have the outputs that you would expect, even if you weren't as literate. ## Q6 Hello. Thanks for the wonderful talk. And I have a question. It's like there are many concerns talking about the performance of GPTs dropping. Maybe it's due to the long exposure on the internet. Or maybe it's because it's learning from the content that is generated by AI. So is that real? If it is real, maybe does that mean the community is not ready to completely live in the real internet world or something else? Thank you. -That's a good question. So actually, I've seen a lot of articles of this form, like it was deployed, and then this month the performance dropped 40%. So generally, a lot of that isn't accurate. I think if you see the follow-up investigations, sometimes they tried some different formatting, and it breaks the model in a certain way. We do try to be very rigorous when we deploy. There's a pre-training pipeline and a post-training pipeline, and we run extensive A/B tests before we deploy something to the general masses. And while I can't promise that in every single benchmark you'll never have a regression, overall we do find We are shipping models that are more capable. And I think some of those news articles can be a little bit overblown. One question, please. Does there's any limitation of the knowledge amount of GPT can learn? Does it reach in the limitation of so-called knowledge amount, or there's no limitation? So yeah, one of the great discoveries in terms of GPTs Are these scaling laws? So basically, you can predict very accurately, like if you keep scaling data, keep scaling the number of parameters, what is the level of performance? Basically, the amount of compressibility you can get at a certain scale. So far, we have not seen that trend break. And our intent is to push it as long until it does break. But hopefully it doesn't break. Thank you. ## Q7 Hi. So thanks for the talk. So previously when GPT-4 was released, the blog post says that OpenAI decides not to open source-- hi, yeah-- it says that OpenAI decides not to open source the model. And I'm wondering, because for doing research, reproducibility is a very important factor. How would you recommend current researchers to do their research now that many other models and companies are following open AI steps and not open sourcing their models. And my second question is that, what is your definition of AGI? Because it seems that every other famous researcher on Twitter has a different definition for AGI. Thanks. Yeah. I think to the first question, there's a very tight rope that any kind of company developing these large language models has to walk. And one is, I feel like many researchers themselves are quite aligned with open source. But there is also a strong safety community. And I think when we did GPT-2, some people were like, hey, this is irresponsible. You can't open source this. Some people said, well, what's the danger? You need to accelerate research. And I think we tried to walk this balance by making things available. And then after a period of time, considering to open source this. And I think we will continue to follow this strategy. I know it's probably not the best strategy from a researcher's point of view. But I do think it's also important to be aware and be careful about the harms. At GPT-2's scale, we realized, hey, it wasn't really a big deal, and subsequently, open sourced the model. I would venture to say even GPT-3 probably hasn't panned out to have long lasting harms. We may follow similar strategies there. ## Q7-2 I think to your second question, can you remind me quickly-- AGI. Oh, definitions of AGI. Yeah. I mean, of course AGI is always kind of a moving goal post. There have been many kind of AGI tests over history, versus the Turing test. And then I think Gary Marcus now has maybe like one test a month. And I think-- yeah, unfortunately it is different, even within OpenAI. Some people define it as something that can accomplish economically valuable tasks at the same level a human can. But again, it does vary from person to person and a little bit in the eyes of the follower. ## Q8 I have a question about the recent inability of [INAUDIBLE] you believe in. So last day, I asked [INAUDIBLE] to play a goal game against AlphaGo. And how many parameters do you think a chair GP needs will be to have a go? That's a very good question. So one particular instance that's not go is chess. I remember seeing on Twitter that someone tried the chess kind of-- just like native text chess against GPD-4. And it was somewhere like 1,700 or 1,800. Yeah, it is hard to say. Sometimes these abilities, they suddenly shoot up like a train of thought. So it's hard to estimate, but we do see predictable scaling in most cases. So unfortunately, it's hard for me to answer. ## Q9 Thanks, Mark, for being here with us today. I have a couple questions. The first one is, from my experience when I was working with simple neural network, after training, some of the connections and parameters are sparse or close to zero. And to me, at least intuitively, I feel that can be an indication of how much smaller the model can be, of compressing the model and things. I'm curious, when you guys are working with GPT at OpenAI, is there any sort of insights or metrics you guys see that can lead to an estimation of how much smaller we can make the large language model? So that's my first question. My second question would be, is there any work in the future or currently that's teaching Chai GPT on how to ask the right question? That can be for automatic science discovery, or asking a person that the church is talking to for additional context and things. Thank you. Yeah, both really good questions. So I think the first one-- it's hard for me to remember. Can you just give me like a-- yeah, I'm just curious if there's any metrics or-- so I can give a concrete example. And this is from the sparse transformer paper. So one thing that we did there was we trained a dense transformer without much adaptive bias. And then we went in and looked at the attention masks. We found that oftentimes the model, even though it could attend to all positions, it decomposes into this kind of row column style attention over the images. And so I believe strongly in approaches like this. We continue to do things like this at OpenAI. And so the idea is give the model the flexibility and see what it uses and what it doesn't. And that can give you efficiency wins. I think-- sorry, you need to-- Is this still relating to the first question, or will you-- Oh, yeah, the second. Oh, the second question is that, yeah, is there any current work in line that's teaching Chai GPT on asking questions? Have you looked out, going to real world, and look for information and data and symbols and things? Yeah, so actually, we care a lot about data collection. I think ChatGPT is an amazing vehicle for data collection. I think just the volume of data we're getting is immense, right? And we can think about ways of how we direct that data collection towards the active learning boundary of some of our models. So there's a lot of active research in that. I think we also care a lot about experts. So we've teamed up with a bunch of experts in mathematics, medicine, other domains. And they help us interact with the model and tell us, hey, qualitatively, these are areas of improvement. So I think that's another way to drive at this. I think there's also very good data sets that are rich in the style of questions. So for example, papers, published papers. They often have very leading questions or interesting research questions that people ask. So just figure out how to amplify that, too. ## Q10 [INAUDIBLE] Hi, Mark. It's Jim from [INAUDIBLE] So I was at the Monday forum as well. And thank you for a great talk in both events. And my question is relevant to the gentleman at the front. So if you compare OpenAI's solution with Spark or other, what was Tesla's doing, how do you compare the pros and cons and accuracy or the commercial value in the future? - Yeah, yeah. I guess, yeah, now it's hard to have unified benchmarks, but I think there have been a couple of open source efforts that kind of like, a lot of leaderboards. And consistently, we still come out on top on the kind of open source leaderboards. Yeah, at the end of the day, It's like what some of the recent papers have showed is if you really want to target an LLM towards a particular capability, you can get away with smaller models and training on very small and narrow distributions. But I do think you sacrifice a lot of generality. And you do need that generality for extrapolation and reasoning. So we still take the approach of trying to create the most general purpose model. But yeah, again, there's no one gold standard benchmark right now, I mean, thank you. ## Q11 Hello. A lot of information on the internet is generally all rights reserved. So what is ChatGPT's approach towards copyrights in general? There's a surprising amount of material that's not available for usage by a model. Even though there are certain online communities which merely could treat it as a suggestion, this is more publicized. So what is the approach towards that? - So I think we wanna play by the books, right? Like we have a large team of lawyers. We lean towards the side of kind of whatever we feel like there's ambiguity. So for instance, you might have noticed there's this GPT bot kind of posted on Hackerny's awhile. So when we were running our large web scripts, we tell people, hey, look, this is kind of a traffic you can see, and this is how it hopped out of being scraped. I do think in the US, the copyright law is such that derivative content-- you can kind of train on copyright content, and derivative content is OK. And I guess my personal view is we also take inspiration from a lot of copyrighted works, we will read a book. And we will understand things from the book. And we'll create. And so it's probably going to be an interesting legal landscape over the next couple of years as well. But I do think it is simplistic to be like, just because something is copyrighted means you can never view it in training. systems. ## Q12 Hello. Nice talk. I really would like to have two questions. The first one is about the--how you are constraining the large general model. So, do you think--how much do you think that it is--we are far for having the scientific method inside of the constrain of the large network model. Right now, you mentioned something like the change of thoughts. There is a way to have a scientific method in the constraint on how the model is performing a hypothesis, looking for evidence, and accessing to the internet and trying to come around with a new understanding and new knowledge for that. How far we are for having that in the large network model? What is missing? And from your thoughts, what we can do? Yeah, so I actually think you nailed the problem on its head. It's like right now, the model doesn't really interact with the world much and isn't able to carry out the experiments it wants to. And your suggestion of using tools, I think, is the right way towards that. So I think as we move more into this agentic world, where the model comes up with an experiment, and then it says, hey, now I'm going to interact with the real world and run the experiment. And then it comes back and reflects. So I think moving to this agentic framework is the right first step, and also just giving it as many tools as possible. Right now, I think it knows experiments it wants to run, But it can actually be done. And I have a second question. So from the large language models, when we are training to predict the next word and having large amount of data embedded in that model, these emergent properties of new thoughts or prompting the model to make some directions, it's kind of how the human brain is thinking or is conceiving the word. What are the lessons that we can learn from the large language model to our own intelligence as a human species? Is there any lesson that we have learned from that? Or is it always from what is something similar to human beings, but is nothing that we can learn from the large language model like behavior that we can learn as a human species about our own language, our own thoughts? Do you have any thoughts about it? Yeah, yeah. It is very interesting. It's like the way that the models learn is still very different from how humans learn. And so I don't know if there's any experiment where we've done some kind of neuroscience-- maybe that's a loaded term-- but on our language models and had it generalized to some kind of process for figuring out how humans work, unfortunately. But it is very exciting. I wish we had more progress in doing things like that. ## Q13 I have a question based on the previous demo of DALI 3. So I'm wondering what's your perspective on possible direction of multimodal development, since as for DALI 3, it already used voice image text that the common sensory inputs. So I wonder if it's like you just mentioned about maybe we should take more information from vision, for example, video or motion clues or optical flow, et cetera. Thank you. Yeah, yeah. That's a great question, because I feel like one of the things that makes multimodal so exciting is, unlike kind of language models, where we've more or less kind of converged to a base architecture, it's still very much the wild west. Like there's diffusion models are getting in popularity. Potentially GANs are coming back and rivaling diffusion models. And so there's a really big space of model architectures to try out there. I think it's a great place to do a lot of basic research. We do a lot of it ourselves in-house. So like Yang Song has been pushing a new class by himself called consistency models. Kind of a close cousin of diffusion in motivation, but actually quite different. So I think it's quite exciting. It's one of the biggest reasons I love working at Monte Carlo. ## Q14 And I just want to ask that, how far are we from AI interreading into our world seamlessly? We don't even know that's AI. And to what extent do we trust this AI so that it's capable of doing this and to not too much so that we overly rely on this AI? Yeah, that's a really good question. So I think the reality is AI is already in our world. I think when we look at collecting data these days, we can actually detect that a lot of it was AI generated. And I do think the fact that we've released things in this iterative way, like people are aware that AI is in our world, AI content is in our world. And I think kind of people have to adapt their mindsets also to kind of, you know, have this prior that when they see some information, like it could be AI generated. So I think that's a very important part of it, just that education kind of the gradual soaking in. And you had a second part of your question. - The trust part. - Yeah, yeah, trust. - Yeah, so I think there's, it's like a multi-dimensional problem. I think one thing is, you can't build AI just for one demographic or one type of person. So I think having customizability is super, super important. I think actually kind of handing off control to the user in many cases is also very important. The user should be making the key decisions. It should feel like a tool that's aiding you. And the model should be able to explain to you what it's doing, right? should be in the chain of thought world. You want to know what the model's thinking and what it wants to do. So I think there's so many different things to explore, even from the interface side. But I do hope we can trust these models more. ## Q15 So following up from the previous question, I've been wondering if there have been any measures that have been taken by OpenAI to actually prevent the misuse of [INAUDIBLE] and from the articles it produces from creating fake news to the public, especially when we see from your presentation that was, until now, from the research, GPT-3 models has the lowest articles produced, that was 50% to the distinguishable fake or not fake news. Thank you. Yeah. So we have online monitoring for usage all the time. We're always kind of looking live. What are the usage trends? What are people using it for? And we set out strict content policy as well. It's like if you want to have a relationship with GPD, we restrict use cases like that. And we try to be very careful about certain dangerous territories. And we spell that out in content policy. ## Q16 I have question about how-- how ChirpTVT prevent ChirpTVT from generating harmful and toxic contents like if you let ChirpTVT interact with real world, this-- you can't be human supervised that's to ensure that Chibchi won't generate any harmful and toxic content. So how do you how does OpenAI make sure that's so yeah, that's my questions. Yeah. So there's a kind of two pronged approach, right? There's some very, very harmful content or very bad content, you know, like, maybe like, child sexual content that you just prevent by default, and you prevent it in the content policy, you block it in as many ways as possible. And then I think, when it comes to kind of toxicity in the more general setting, it's very hard to precisely spell out what that means. So we use the RLHF approach. And we really have humans kind of grade these things and guide the model away from these space of harmful generations. So unless you can really spell it out in words, and oftentimes for the most egregious content you can, you have to rely on these softer metrics, like RLHF. ## Q17 Thank you for a lovely talk. So here I have three questions I think. First of all, so the previous questions mentioned about the similarities of artificial intelligence, the neural network module and the human brain. So is there some planning research of applying the techniques in neuroscience in the training of artificial intelligence? The second question is-- Maybe I can answer that question first. Yeah, so I do think this is still a very active and exciting area of research. But in general, recent attempts to take things that are biologically inspired and to build their digital equivalents haven't been as fruitful as they have been. So I think what has been interesting is trying to do interpretability work on the neural nets themselves. And there's really good work from, for instance, Chris Ola, showing that you can actually interpret some circuits within the neural network. Like you have a car detector, and it's composed of a window detector and a wheel detector, and convolve in a certain way. And I do think that kind of stuff is very exciting. It's kind of like how neuroscientists split the human brain into some areas for different responsibility, something like that. Yeah. Yeah, so the interpretability work does try to do that. I think one difference-- maybe some neuroscientists should tell me if I'm wrong. But one difference is that there's a lot of redundancy built into neural networks. So even if you say, hey, this is responsible for this kind of behavior. When you do the knockout experiment, often you find that the behavior is also baked in somewhere else. So it is a little bit hard to say, hey, this compartment is responsible for this, and that one's responsible for that. So the second question is, you mentioned that to filter toxic content. So the term "toxic" itself is already very vague. For example, if I want to create a fake story, if I want to create a fake news, I can tell Chaju Bitti and say, oh, I want to create a story based on some real event, but it's not true. But in the end, I used these content to publish that as face news. So it has a lot of real-life examples of these kind of events. So is it really possible to filter out all toxic content? So the current approach of RLHF is just a temporarily measurement. So is it actually working to make the model better? That's a good question. So I think it does a lot to make the model quite a bit better. I think no one at OpenAI thinks RLHF with a final algorithm for doing this. So there's a lot of active research. And I think we should target eliminating it. RLHF is what we have now. So the final question is, a lot of articles are mentioning about the financial state, financial situation of an AI. So if I remember right, the current financial income of an AI, especially for chartered VCs from the membership Is it collected and the money becomes the need to pay for it? So is this actually does the financial. . . Is this the stable financial model for keeping the company alive? Well yeah, I can't comment on our revenue numbers unfortunately, but we do continue to receive investment and we're very optimistic in our outlook. Yeah. There should be no problem at all. I have one more question about safety. So what kind of restrictions do you have on religious content? Do you have a pension briefing on it, or have some kind of strong rule about it? That's a good question. Unfortunately, I'm not the right person to answer. I'm sure there's something there. Except I do research in multi-modal reasoning. So I think someone else sets that policy. ## Q18 And I'm curious about what's your opinion on her hallucination. Since it has a safety issue, but it's not a human injury in the area, but it's a matter of something that may be harmful. And this type, maybe six therapies are already very trustworthy. So if he says something wrong or dangerous, then people may take it as a truth and try to do it. And I am curious, how long term or future approach the OBI May take to address this issue? Yeah, yeah. So I think there's a couple reasons hallucinations arise. So one thing is, again, we chain on text that people write on the internet. And when people don't know the answer, people don't usually waste the text to write down that they don't know the answer. So a lot of the training data confidently plows forward towards answering something. And our models kind of inherit that behavior, because humans very rarely-- they say, I don't know. But they don't often write, I don't know. And so it's kind of like giving the models this capability to be calibrated and often say, I don't know. But it doesn't know. I think that's one important thing that we want to give it. Another thing is I think the reasoning also helps a lot with the hallucination. So the more that the model can kind of reflect on what it's trying to say, if it sees any logical fallacies, then it We made certain types of hallucinations there as well. And I think the third thing is just browsing and retrieval. So the more we can lean on stuff like a web page that tells you the accurate information, and we can maybe do some consistency there, like look at five or six reputable sources. I think this kind of stuff will help reduce hallucinations. ## Q19 I'm here. Thank you for being with us today. I have a question. I think that the usage of the creativity nowadays is like, for example, I got a mathematic question. I ask, creativity, and I close the tab. It is in a short period of time. My question is, I think, is ChatBitty's ability to learn the context of the whole conversation. For example, if I want to build a chatbot that can memory what I say in the long-term memory, I think that ChatBitty is now currently lacking this ability to do this. Is there any progress or methods to do this ability? Yeah, yeah, yeah. So I think long context is a very big research direction, probably across all of the labs. And I think even with GPD4, we've released long context versions of that, which developers are using for persisting information. And I do think tool use is another way to persist that information. So if you build tools or plug-ins that can retrieve from external databases, that's another way to have long-term state of people. OK, thanks. ## Q20 Hi, Mike. Because for your PowerPoint, you talk about you use reinforcement learning to improve the LGBT. So I want to ask about if our lab don't have such a big model, we only use a small model, maybe we don't can't provide a good environment for reinforcement learning. But whether I can use the reinforcement learning to improve my model to make the agent perform better. Yes, Antoine. - Yeah, I mean, probably in the language modeling setting, it's a little bit more difficult, but I mean, reinforcement learning has been hugely successful in kind of narrow domains. I think it's just a lot of the work even that we've done in the past, I think there is a project we published a while back. It's like if you try to create a diverse kind of agent that you try to trade with reinforcement learning in a diverse amount of settings, you really need a lot of kind of training data for that. And so I do think if you have a narrow, well-defined environment, you're still able to do that on a swing scale. But scaling it to a very diverse set of environments, it's a little bit harder. Oh, OK. ## Q21 Thank you for your wonderful talk. I am also from India. So I had one question regarding one of the slides, which was the chain of thought prompting. So how do you, like, can you just explain it in simple terms, like how it is built into the model, like you showed in Lambda, PAM, and et cetera, those models? And what's the basic-- OK, so this is pretty model agnostic. Yeah, basically, I think any language model is-- you're able to apply chain of thought to it. So very simply stated, it is before giving the model a question, you show it another related question and a long reasoning trace that produces the answer. And the model, just by virtue of prompting, will try to mimic that same type of behavior in producing the next answer. So it is kind of like the agnostic of whether it's POM or some other type of model. Any language model should have its capability. I have one more question. Am I out of-- So is the thing you said about small models performing very-- giving good performances compared to larger models. So can transfer learning also play a big role in that? Maybe compress the models, and still it's giving us a good performance? Yeah, I mean, I think any-- yeah, transfer learning, I see it a little bit maybe as an orthogonal thing. So transfer learning will help your big models and your small models kind of equally. There may be-- yeah, I should reflect more if there's some kind of a distillation angle to transfer learning. But nothing comes to mind right now. I think there's one more over there. So that's the last one. ## Q22 [INAUDIBLE] OK, thank you for giving me the last question. So there is a new attack method against the MLM called prompt injection where the attacker tries to trick the model to output what they want. Like a couple of months ago, there were some users tricking the Bing chat from Microsoft to output the activation key for Windows 10 and Windows 11 by phishing for simplicity. So what's the approach from OpenAI to protect your model or like chat GPT against these kind of attacks. Honestly, even RLHF as a base technique, continuing to tune on that is quite a good defense mechanism. And I think over time what we've seen is that the prompts that you need to do to get prompt injection to work have become more convoluted and more kind of like involved. And ultimately, I'm not sure this is the right necessarily attack vector to defend at even at at this point in time, because you really have to want to-- it's not like you accidentally have a prompt injection attack. You really have to be a malicious user to come up with the strings that you need these days. So in some sense, you're already violating the content policy, so I do think at this point, we'll continue to make it harder and harder, but it doesn't feel as relevant. Like a user accidentally produces behavior. OK, thank you. ## Q23 Actually, I have a question for myself. (from 主持人) (audience laughing) So have you ever thought about how these AI tools like creativity can shape people's, humans' future lives? Is it, have you ever thought about this? - Yeah, yeah, I mean, I think from my own personal perspective, I see it as a very useful tool today already. Like, I use it in my daily coding, so it helps me very effectively catch up to, let's say, some new framework that I don't know. And even as an educational tool, I feel like it's very useful for me as well. Like, if I want to learn something in biology or in chemistry, it's often easy for me to have this conversation. And it allows me also to kind of drill into the questions at the right level of depth. So I do think for education in the future, it'll be very useful as a companion for learning something like coding. And I do think one underrated use case is also just conversation. A lot of people are lonely in this world, too. And it could be a good conversational partner for people and have that state and try to maybe put you in a better mood or something like that. probably a break in the world. The next job will be replaced by robots. So you have any comment on this? - Yeah, I mean, with technology, right, I mean, there's always uncertainty in the future. And yeah, I don't think brushing that under the rug is the right approach either. But right now, it is most useful as a tool. And I don't think, I think we also can lean on the regulatory frameworks. Like this is, a lot of what Sam Ullman did is world tour for it. You know, like setting up the right kind of frameworks to make sure that we're developing a right ecosystem, and figuring out kind of what level of responsibilities the AIs can take on, right? So I do think, I trust in the world to kind of regulate the AI in a way such that we're preventing massive job loss. And honestly, I do think, you know, generally I'm a tech optimist. I am optimistic about a future where we have more tech. And I think that's generally been the pattern in history too. Tech improves our lives. [ Pause ] >> So, I think it's time for your class. So let's thank Mark again for this very wonderful talk.