What We Learned From A Year of Building With LLMs

Youtube: https://www.youtube.com/live/vaIiNZoXymg?t=28569s

請注意,本網頁為程式自動產生,可能會有錯誤,請觀賞原影片做查核。網頁產生方式為影片每5秒截圖、去除重複的影像,使用 whisper 模型做語音辨識字幕、使用 gpt-4o 做中文翻譯,以及 Claude 做摘要。

  1. 戰略考量
    1. 模型不是護城河
      • 專注於產品專業知識和現有產品
      • 找到並深耕自己的利基市場
      • 構建模型提供商沒有的東西
    2. 將模型視為 SaaS 產品
      • 當有明顯更好的競爭對手時迅速切換
      • 高 MMLU 分數不等於產品
    3. 持續改進循環
      • 評估和數據是核心
      • 類似於 MLOps、DevOps 和精益創業的迭代改進
    4. 展望未來
      • 每 12-18 個月成本降低一個數量級
      • 為未來的經濟可行性做準備
  2. 運營考量
    1. 避免過度依賴工具
      • 不要認為更多工具就能解決所有問題
      • 平衡探索和利用現有機制
    2. 謹慎雇用機器學習工程師
      • 不要急於微調模型
      • 先考慮如何構建有價值的產品
    3. AI 工程師角色定義
      • 避免使用模糊的工作頭銜
      • 明確所需技能和期望
      • 將評估和數據素養納入核心技能
    4. 人才管理
      • 根據公司成熟度階段雇用合適人才
      • 強調評估和數據素養的重要性
  3. 戰術考量
    1. 建立有效的評估系統
      • 將複雜任務分解為可測試的小部分
      • 使用斷言式測試
      • 考慮使用評估器模型
    2. LLM 作為評判者的考量
      • 易於原型設計,但難以精確對齊標準
      • 考慮資源和應用開發階段
    3. 定期查看數據
      • 部署後持續監控
      • 尋找易於表徵的數據切片
      • 追蹤代碼庫和模型版本變化
    4. 實施自動化防護欄
      • 基本檢查:毒性、個人信息、版權、預期語言
      • 開發無參考評估作為防護欄
    5. LLM 應用的技術債務
      • 傳統 MLOps 原則仍然適用
      • 需要維護模型周圍的大量基礎設施

圖片

[MUSIC PLAYING] [APPLAUSE]

【音樂播放中】 【掌聲】

圖片

圖片

Hey, everyone. So you're about to experience something of a strange talk,

嘿,大家好。你們即將體驗一場有點奇怪的演講,

圖片

圖片

and not just because Brian and I are strange, but because something kind of strange happened. Over the last year, a bunch of us were posting things

不只是因為 Brian 和我很奇怪,而是因為發生了一些奇怪的事情。過去一年裡,我們一群人一直在發布一些東西

圖片

on Twitter. We were writing blog posts complaining about LLMs. And we formed a little group chat.

在 Twitter 上。我們寫了抱怨 LLM 的部落格文章。然後我們組成了一個小群聊。

圖片

And we were continuing to complain about LLMs

而且我們一直在抱怨 LLMs

圖片

to each other and sharing what we were working on, when we realized we were all about to write

彼此分享我們正在做的事情,當我們意識到我們都即將寫作時

圖片

the exact same blog post, what we learned in the last year. So we got together, and we turned

完全相同的部落格文章,我們在去年學到了什麼。所以我們聚在一起,然後我們轉變了

圖片

what was initially a couple of short blog posts

最初只是幾篇簡短的部落格文章

圖片

into a long white paper on O'Reilly,

成為 O'Reilly 上的一篇長篇白皮書,

圖片

combining our lessons across strategic, operational, and tactical levels of building LLM applications.

結合我們在戰略、操作和戰術層面構建 LLM 應用的經驗教訓。

圖片

And the response to that white paper was overwhelmingly positive.

而對於那份白皮書的回應是壓倒性地正面。

圖片

We heard from everybody, from people

我們聽到了每個人的聲音,來自各種人們

圖片

who contribute to Postgres, to venture capitalists, to tool builders, saying, we loved

對 Postgres 做出貢獻的人、風險投資家、工具建造者說,我們喜歡

圖片

what you wrote in that article. I felt that pain, too.

你在那篇文章中寫的內容。我也感受到了那種痛苦。

圖片

And we were invited on the strength of that to give this keynote address.

基於此,我們受邀發表這場主題演講。

圖片

And so we faced a kind of funny challenge,

於是我們面臨了一種有趣的挑戰,

圖片

which is part of the appeal of this blog post, of this article, was that the six of us all came together to write it. As Scott Condren put it, it was like an Avengers team up.

這篇部落格文章的吸引力之一在於我們六個人一起合作撰寫。正如 Scott Condren 所說,這就像復仇者聯盟的合作。

圖片

So we had to figure out a way to deliver one keynote talk from six people.

所以我們必須想出一個方法,讓六個人一起發表一場主題演講。

圖片

So we pulled the Avengers together for one night

所以我們把復仇者聯盟召集在一起過了一晚

圖片

only to deliver some of the most important insights

僅僅是為了傳達一些最重要的見解

圖片

from that 30-page article, to add some of our spicy extra takes that ended up on the cutting room floor,

從那篇 30 頁的文章中,加入一些我們辛辣的額外觀點,這些觀點最終被剪掉了,

圖片

and to respond to the allegations. I'd like to state, unequivocally,

並且回應這些指控。我想明確地說明,

圖片

that we are not, in fact, crypto bros who just found out

事實上,我們並不是剛剛發現的加密兄弟

圖片

that GPT-4 was the new Web3. We all trained our first neural networks back

那 GPT-4 是新的 Web3。我們都訓練了我們的第一個神經網絡

圖片

when you had to write the gradients by hand.

當你必須手寫梯度的時候。

圖片

So we split the article up into three pieces. We split the talk into three pieces. First, you're going to hear from me and Brian, talking about the strategic considerations for building LLM applications.

所以我們把文章分成三個部分。我們把演講分成三個部分。首先,你會聽到我和 Brian 談論構建 LLM 應用的策略考量。

圖片

How do you look to the future? How do you see around corners? How do you make big decisions?

你如何展望未來?你如何預見未來的轉折點?你如何做出重大決策?

圖片

Then we're going to hand the clickers and the stage over to Hamid Hussain and Jason Liu, who are going to share the operational considerations.

接下來,我們將把遙控器和舞台交給哈米德·侯賽因和劉傑森,他們將分享操作考量。

圖片

How do you put together processes? How do you put together teams?

如何組織流程?如何組建團隊?

圖片

How do you think about workflows around delivering LLM applications? And then they will hand over the clickers and the stage to Shreya Shankar and Eugene Yan,

你如何看待交付 LLM 應用程序的工作流程?然後他們會將點擊器和舞台交給 Shreya Shankar 和 Eugene Yan。

圖片

who will talk about the tactical considerations for building LLM applications.

將討論構建 LLM 應用的戰術考量。

圖片

What are the specific techniques, tactics, and moves that have stood the test of one year's time

經過一年的時間,哪些具體的技術、戰術和動作經受住了考驗

圖片

for building LLM applications?

用於構建 LLM 應用程式?

圖片

All right, so Brian, how do you build an LLM application

好吧,Brian,你是如何建立一個 LLM 應用程式的

圖片

without getting outmaneuvered and wasting everybody's time and money? Yes, yes.

不會被巧妙地超越並浪費大家的時間和金錢嗎?是的,是的。

圖片

Well, many of you may be thinking that there's really only one way to win

嗯,許多人可能會認為只有一種方法可以獲勝

圖片

in this new, exciting, dynamic, and very scary industry.

在這個新穎、令人興奮、充滿活力且非常可怕的行業中。

圖片

And that, of course, is to train your own custom model--

當然,這就是訓練你自己的自訂模型--

圖片

pre-training, fine-tuning, a little RLHF here and there. You better start from scratch, buddy. Not quite. The model is actually not your moat.

預訓練、微調、這裡那裡一點 RLHF。你最好從頭開始,夥計。不完全是。這個模型其實不是你的護城河。

圖片

For almost no one in this audience, the model is the moat.

對於這個觀眾群中的幾乎所有人來說,這個模型並不是護城河。

圖片

You all, as AI engineering devotees,

你們這些 AI 工程愛好者,

圖片

should be building in your zone of genius.

應該在你的天才區域內建設。

圖片

You should be leveraging your product expertise or your existing product. Maybe you've got one.

你應該利用你的產品專業知識或現有的產品。也許你已經有一個了。

圖片

And you should be finding your niche and digging into that niche, exploiting it.

而且你應該找到你的利基市場並深入挖掘,充分利用它。

圖片

You should be building what the model providers are not.

你應該建構模型提供者沒有的東西。

圖片

There's a high likelihood that the model providers

模型提供者很有可能

圖片

have to build a lot of things for all of their customers. Don't waste your calories on building these things.

必須為所有客戶建造很多東西。不要浪費你的精力在建造這些東西上。

圖片

The Sam Altman phrase of steamrolling is appropriate here.

Sam Altman 所說的「壓倒性」在這裡是恰當的。

圖片

And you should be treating the models like any other SaaS product.

而且你應該像對待其他 SaaS 產品一樣對待這些模型。

圖片

You should be quickly dropping them when there's a competitor that's clearly better.

當有明顯更好的競爭對手時,你應該迅速放棄他們。

圖片

No offense to GPT 4.0, but SONNET 3.5 looking pretty sharp.

無意冒犯 GPT 4.0,但 SONNET 3.5 看起來相當不錯。

圖片

圖片

It's important to keep in mind that a model with high MMLU scores, that's not a product.

請記住,一個擁有高 MMLU 分數的模型,並不是一個產品。

圖片

87% on Spyder SQL, that doesn't automate all data requests,

在 Spyder SQL 上達到 87%,這並不能自動化所有數據請求,

圖片

or even 87% of them. You can't sell human eval pass at 67.

甚至 87% 的人。你不能以 67 的通過率來銷售人類評估。

圖片

At least my GTM team doesn't know how.

至少我的 GTM 團隊不知道怎麼做。

圖片

An excellent LLM-powered application

一個優秀的 LLM 驅動應用程式

圖片

is an excellent product. It's well-designed.

是一個優秀的產品。設計得很好。

圖片

It solves a job to be done.

它解決了一個需要完成的工作。

圖片

And it enhances your user. Why are we so excited about AI?

而且它提升了你的用戶體驗。我們為什麼對 AI 如此興奮?

圖片

Human enhancement. So what should you build if not all these things?

人類增強。那麼,如果不建造這些東西,你應該建造什麼呢?

圖片

Things that generalize to smarter and faster models.

能夠推廣到更聰明和更快速模型的事物。

圖片

Things that help you maintain your product's quality

幫助您維持產品品質的事項

圖片

bar under uncertainty.

在不確定性下的標準。

圖片

And things that help you continuously improve. Whoa, Brian.

還有那些幫助你持續改進的東西。哇,Brian。

圖片

Continuous improvement. That's my trigger phrase. The idea of continuous improvement

持續改進。這是我的觸發詞。持續改進的理念

圖片

has been brought to the world of LLM applications

已被引入 LLM 應用的世界

圖片

by this shift in focus that we've all felt since the previous AI Engineer Summit

自從上次的 AI Engineer Summit 以來,我們都感受到這種重心轉移。

圖片

to focus on evaluation and data.

專注於評估和數據。

圖片

It's nicely synecdochalized by this diagram from our co-author, Hamo Hussain,

這張圖表很好地體現了我們的共同作者 Hamo Hussain 的觀點,

圖片

showing this virtuous cycle of improvement. It has evals and data at the center,

顯示這個改進的良性循環。它以評估和數據為中心,

圖片

but the core reason to create those evals, the core reason to collect that data,

但創建那些評估的核心原因,收集那些數據的核心原因,

圖片

is to drive forward this loop of continuous improvement.

是推動這個持續改進的循環。

圖片

And despite what your expensive consultants or many of the LinkedIn influencers posting about LLM

儘管你的昂貴顧問或許多在 LinkedIn 上發文的影響者都在談論 LLM

圖片

圖片

apps might say, this is not actually the first time that engineers have tried to tame a complex system

應用程式可能會說,這其實不是工程師第一次嘗試駕馭複雜系統。

圖片

and make it useful and valuable. This same loop of iterative improvement was also at the core of ML Ops, at the operationalization

並使其有用且有價值。這種相同的迭代改進循環也是 ML Ops 的核心,在運營化過程中

圖片

of machine learning models before LLMs. This figure from our co-author, Shreya Shankar's paper, had that same loop of iterative improvement centered also on evaluation and on data collection. ML Ops was also not the first time

在 LLM 之前的機器學習模型。這張來自我們共同作者 Shreya Shankar 論文的圖表,也有相同的迭代改進循環,重點同樣放在評估和數據收集上。ML Ops 也不是第一次

圖片

that engineers faced this problem, the problem of complexity, the problem of non-determinism

工程師面臨這個問題,複雜性的問題,非決定性的問題

圖片

and uncertainty. The DevOps movement that gave ML Ops its name also focused on this kind of iterative improvement

和不確定性。給 ML Ops 命名的 DevOps 運動也專注於這種迭代改進

圖片

and on monitoring information in production

以及監控生產中的資訊

圖片

to turn into improvements to products. But, dear reader, DevOps was not the first time

變成產品改進。但是,親愛的讀者,DevOps 並不是第一次

圖片

that engineers tackled this problem of uncertainty

工程師們解決了這個不確定性的問題

圖片

and solved it with iterative improvement.

並通過反覆改進解決了它。

圖片

DevOps built on the ideas of the lean startup movement from Eric Ries that was focusing not just

DevOps 建立在 Eric Ries 的精實創業運動理念之上,這些理念不僅專注於

圖片

on building an application, not just

在構建應用程式,而不只是

圖片

on building a machine learning model or an LLM agent, but on building the entire business.

而是建立整個業務,而不是建立機器學習模型或 LLM 代理。

圖片

And it used this same loop centered on measurement and data

並且它使用了同樣以測量和數據為中心的循環

圖片

to drive the improvement and building of a business.

推動業務的改進和建設。

圖片

This idea itself was not invented in Northern California, despite what some people might say. It has its roots in the Toyota production system

這個想法本身並不是在北加州發明的,儘管有些人可能會這麼說。它的根源在於豐田生產系統。

圖片

and in the idea of Kaizen, or continuous improvement. Genchi get butsu is one of the core principles from that movement that we can take forward into the development of LLM applications. It means real things, real places. And at Toyota, that meant sending executives out to factory floors, getting their khakis a bit dirty. For LLM applications, the equivalent is looking at your data. That data is the real information about how your LLM application is delivering value to users.

並且在改善或持續改進的理念中。現地現物是這個運動中的核心原則之一,我們可以將其應用於 LLM 應用程式的開發。這意味著真實的事物,真實的地方。在豐田,這意味著派高層管理人員到工廠車間,讓他們的卡其褲沾上一點灰塵。對於 LLM 應用程式來說,這相當於查看你的數據。這些數據是真實的信息,顯示你的 LLM 應用程式如何為用戶提供價值。

圖片

There's nothing that is more valuable than that.

沒有什麼比那更有價值的了。

圖片

Finally, there's lots of people selling tools

最後,有很多人在賣工具

圖片

at this conference, including myself. It's easy to get overly excited about the tools and the construction of this iterative loop of improvement and to forget where value actually comes from. And there's a great pithy, earthy statement from the Toyota production system,

在這次會議上,包括我自己在內,很容易對這些工具和這個迭代改進循環的構建過於興奮,而忘記了價值實際上來自哪裡。豐田生產系統有一句非常簡潔、樸實的話,

圖片

from Shigeo Shingo, that I really like. Value is only created when metal gets bent.

來自新鄉重夫的一句話,我非常喜歡。價值只有在金屬被彎曲時才會產生。

圖片

So we have to make sure that we don't get lost just building our evals and calculating concept drift,

所以我們必須確保我們不會只顧著建立我們的評估和計算概念漂移而迷失方向,

圖片

and we instead make sure that we continue

而我們則確保我們繼續

圖片

to get out there and bend metal and create value for our users. - Not gonna lie, I might have misunderstood it earlier

出去彎曲金屬並為我們的用戶創造價值。- 不會說謊,我之前可能誤解了

圖片

when you said let's get bent. Okay, so right off the bat,

當你說讓我們放縱一下。好吧,那麼一開始,

圖片

we need to spin that data flywheel, Bob. Oh wait, sorry, wrong game show.

我們需要轉動那個數據飛輪,Bob。哦,等等,抱歉,搞錯節目了。

圖片

Point is, we need to get this moving. We need to get this in front of users and human beings. We need to express the goals for our system. And how do we do that? With evals. Remember, evals are not convenient, weird, bespoke metrics.

重點是,我們需要讓這個進展起來。我們需要把這個展示給用戶和人類。我們需要表達我們系統的目標。我們怎麼做到這一點呢?通過評估。記住,評估不是方便的、奇怪的、定制的指標。

圖片

Evals are objectives. They're what we want our system to do. Any system for capturing this behavior is good enough.

評估是目標。它們是我們希望系統執行的任務。任何能夠捕捉這種行為的系統都足夠好。

圖片

圖片

I don't have an evals framework to sell you, but what I do have to sell you is this idea

我沒有一個 evals 框架要賣給你,但我有一個想法要賣給你。

圖片

that you should be getting out there. You should be getting started. But wait, Brian, I'm really nervous. What if this isn't good enough for my customers? Fear is the mind killer.

你應該要出去行動了。你應該要開始了。但是等等,Brian,我真的很緊張。如果這對我的顧客來說不夠好怎麼辦?恐懼是心靈的殺手。

圖片

(audience laughing) Put it out there in beta.

(觀眾笑聲)把它以 beta 版推出。

圖片

If it's good enough for these incredible companies like Apple Intelligence, Photoshop, and Hex, that's me,

如果這些像 Apple Intelligence、Photoshop 和 Hex 這樣的了不起公司都覺得夠好,那我也覺得夠好。

圖片

it's good enough for you. You need to collect this data.

這對你來說已經足夠了。你需要收集這些數據。

圖片

You need to put something in the wild. You need to start looking at your user interactions. The real user interactions, LLMs, responses,

你需要將某些東西放到野外。你需要開始觀察你的用戶互動。真正的用戶互動、LLMs、回應,

圖片

deserve human eyes.

值得人類的眼睛。

圖片

You can give it some AI eyes too,

你也可以給它一些 AI 眼睛,

圖片

but definitely look at it with your human eyes. Binary human feedback is valuable.

但一定要用你的肉眼來看。二元人類反饋是有價值的。

圖片

It's nice to add some rich feedback too. That can be interesting. But start with binaries.

一開始先從二元選項開始,但加入一些豐富的反饋也不錯,這樣會更有趣。

圖片

And finally, user requests will reveal the PMF opportunities that lie below your product substrate. Where is your PMF? Everybody wants to know. It's in your user interactions.

最後,用戶請求將揭示出在你的產品基質下的 PMF 機會。你的 PMF 在哪裡?每個人都想知道。答案就在你的用戶互動中。

圖片

What are they asking your chatbot that you haven't yet implemented? That's a really nice direction to skate

他們問你的聊天機器人什麼問題是你還沒有實現的?這是一個非常好的發展方向。

圖片

if that's where the puck's going. - And despite the focus on the user interactions

如果那是冰球要去的地方。- 儘管專注於用戶互動

圖片

that you can have today,

你今天可以擁有的,

圖片

the things that you can ship right now, it's important to also think about the future. The best way to predict the future is to look at the past, find people predicting the present, and copy what they did. In designing many of the components

目前你可以運送的東西,重要的是也要考慮未來。預測未來的最佳方法是回顧過去,找到那些預測現在的人,並模仿他們的做法。在設計許多元件時

圖片

of the personal computing revolution, Alan Kay and others at PARC adopted as a core technique projecting Moore's law out into the future. They built expensive, unmarketable, slow and buggy systems themselves so they could experience what it was like and build for that future and create it. We don't have quite the industrial scaling information that Moore had when he wrote down his predictions, but we do have the beginnings of those same laws. There's been an order of magnitude decrease every 12 to 18 months at three distinct levels of capability. At the capability of DaVinci, the original GPT-3 API that brought, that excited a lot of us about the idea of building on foundation models. The capabilities of TextDaVinci-2, the model lineage underlying ChatGPT that brought the rest of the world to excitement about this technology. And the latest and greatest level of capabilities with GPT-4 and SONNET. In each case, around 15 months is enough time

在個人電腦革命中,Alan Kay 和 PARC 的其他人採用了一種核心技術,即將摩爾定律投射到未來。他們自己構建了昂貴的、無法上市的、緩慢且有漏洞的系統,以便親身體驗並為未來構建和創造它。我們沒有摩爾當初寫下預測時所擁有的那種工業規模信息,但我們確實有這些相同定律的開端。每隔 12 到 18 個月,在三個不同的能力層次上,能力都有一個數量級的下降。在 DaVinci 的能力上,原始的 GPT-3 API 讓我們許多人對基礎模型的構建感到興奮。TextDaVinci-2 的能力,這是 ChatGPT 背後的模型譜系,讓世界其他地方對這項技術感到興奮。以及最新最強大的 GPT-4 和 SONNET 的能力。在每種情況下,大約 15 個月的時間就足夠了。

圖片

to drop the cost by an entire order of magnitude.

將成本降低整整一個數量級。

圖片

This is faster than Moore's law. And so the appropriate way to plan for the future

這比摩爾定律還快。因此,規劃未來的適當方式

圖片

is to think what this implies

要思考這意味著什麼

圖片

for what applications that are not economical today will be economical at the time that you need to raise your next round. So in 2023, it cost about $625 an hour to run a video game where all the NPCs were powered by a chat bot. That's pretty expensive. In 1980, it cost about $6 an hour to play Pac-Man, inflation adjusted. That suggests that if we just wait

對於哪些應用今天不經濟,但在你需要籌集下一輪資金時會變得經濟實惠。因此,在 2023 年,運行一個所有 NPC 都由聊天機器人驅動的視頻遊戲每小時大約需要 $625。這相當昂貴。在 1980 年,玩 Pac-Man 每小時大約需要 $6,經過通貨膨脹調整。這表明如果我們只是等待

圖片

for two orders of magnitude reduction or about 30 months from mid-2023,

從 2023 年中起,大約 30 個月內實現兩個數量級的減少,

圖片

it should be possible to deliver a compelling video game experience with chat bot NPCs at about $6 an hour

應該可以以每小時約 6 美元的價格提供具有聊天機器人 NPC 的引人入勝的電子遊戲體驗

圖片

and people will probably pay for it.

而且人們可能會為此付費。

圖片

So you can't sell it now, but you could live it and you can design it and you can be ready

所以你現在不能賣它,但你可以住在裡面,你可以設計它,你可以準備好。

圖片

when the time comes. So that's how to think about the future and how to think strategically when building LM applications. I'd like to call to the stage my co-authors,

當時機來臨時。所以這就是在構建 LM 應用程式時如何思考未來和如何進行戰略思考的方法。現在我想請我的共同作者上台,

圖片

Jason Liu and Hamo Hussain, to talk about the operational aspects.

劉傑森和哈莫·侯賽因,討論運營方面的問題。

圖片

Let's give them a hand. (audience applauding)

讓我們給他們掌聲鼓勵。(觀眾鼓掌)

圖片

- All right, so Hamo and I have basically

- 好的,所以 Hamo 和我基本上

圖片

been doing a lot of AI consulting in the past year.

過去一年來一直在進行大量的 AI 諮詢。

圖片

We've worked with about 20 companies so far and we've done something from pre-seed

到目前為止,我們已經與大約 20 家公司合作過,並且我們從種子前階段就開始做一些事情。

圖片

all the way to public companies and I'm pretty bored of giving generic good advice, especially because there's such a range of operators here. And so instead, I'm gonna invert.

一直到上市公司,我對於給出一般性的好建議感到相當厭倦,特別是因為這裡有各種各樣的經營者。所以,取而代之,我要反其道而行。

圖片

My goal today is to tell you how to ruin your business. First of all, everyone knows that in the gold rush, you sell shovels and so if you wanna get gold, you gotta buy shovels too, right? You know, if you wanna find more gold, keep buying shovels.

我今天的目標是告訴你如何毀掉你的生意。首先,大家都知道在淘金熱中,你賣鏟子,所以如果你想要得到黃金,你也得買鏟子,對吧?你知道的,如果你想找到更多的黃金,就繼續買鏟子。

圖片

Where do I dig? Keep buying shovels. How do I know when to stop digging?

我應該在哪裡挖?繼續買鏟子。我怎麼知道什麼時候該停止挖掘?

圖片

The shovel will tell you. And how do I dig one deep hole versus making investments in a plenty of shallow holes?

鏟子會告訴你。而我該如何挖一個深洞,而不是在很多淺洞上投資?

圖片

Again, the answer is more shovels, clearly, right?

再說一次,答案顯然是更多的鏟子,對吧?

圖片

And this might be generic, so I'll give you some more specific advice. If your rag app doesn't work, try a vector database, a different vector database. If the methodology doesn't work, implement a new paper. And maybe if you update the embedding model, you'll finally find product market fit. Because truth be told, success does not lie

這可能有點籠統,所以我會給你一些更具體的建議。如果你的 rag 應用程式無法運作,試試向量資料庫,換一個不同的向量資料庫。如果方法行不通,實施一篇新的論文。也許如果你更新嵌入模型,你最終會找到產品市場契合點。因為說實話,成功並不在於

圖片

in developing expertise or processes.

在發展專業知識或流程方面。

圖片

Try more tools. There's no need to balance between exploring and exploiting the mechanisms that work for you.

嘗試更多工具。你不需要在探索和利用對你有效的機制之間取得平衡。

圖片

Change the tools. And the processes and the decision-making frameworks

更換工具。以及流程和決策框架

圖片

don't matter. The right tool will solve everything. Number two, find a machine learning engineer who can fine-tune as quickly as possible. A $2,000 per month open AI bill is very expensive. And instead, hire someone for a quarter of a million dollars, give them 1% of their company, to fight CUDA build errors and figure out server cold starts, right? Because what's the point of growing your company

不重要。正確的工具會解決一切。第二,找到一個能夠盡快微調的機器學習工程師。每月 $2,000 的 OpenAI 帳單非常昂貴。相反,雇用一個年薪 25 萬美元的人,給他們公司 1% 的股份,來解決 CUDA 構建錯誤和伺服器冷啟動問題,對吧?因為發展你的公司有什麼意義呢?

圖片

if you're just a wrapper? And if your margins are too low, try fine-tuning.

如果你只是個包裝器?如果你的利潤太低,試著微調。

圖片

It's much easier than figuring out how to build something worth charging for.

要比想出如何建造值得收費的東西容易得多。

圖片

I cannot reiterate this enough. It's very important to hire a machine learning engineer as quickly as possible, right? Even if you have no data generating products. They love fixing Vercel TypeScript build errors.

我無法再強調這一點了。盡快聘請一位機器學習工程師非常重要,對吧?即使你沒有生成數據的產品。他們喜歡修復 Vercel TypeScript 構建錯誤。

圖片

And generally, if you hire a full-stack engineer

而且一般來說,如果你雇用一位全端工程師

圖片

who's really caught the LLM bug, they're gonna lack real experience. And this is because Python is a dead language, right? Machine learning engineers, research engineers, can easily pick up TypeScript. And the ecosystem that exists in Python could be quickly re-implemented in a couple of weekends. Right? The people who wrote Python code for the past 10 years doing data analysis, they're gonna easily be able to transition their tools.

真正迷上 LLM 的人,他們會缺乏真正的經驗。這是因為 Python 是一個過時的語言,對吧?機器學習工程師、研究工程師可以輕鬆學會 TypeScript。而且 Python 中存在的生態系統可以在幾個週末內快速重新實現,對吧?過去 10 年來用 Python 進行數據分析的人,他們將能夠輕鬆轉換他們的工具。

圖片

And if anything, it's really easy to teach things like product sense and data literacy

而且如果有什麼的話,教導產品感知和數據素養其實非常容易。

圖片

to the JavaScript community. And most important of all,

給 JavaScript 社群。而最重要的是,

圖片

in order to find this kind of magic talent, we need to create a very catch-all job title.

為了找到這種魔法般的人才,我們需要創造一個非常包羅萬象的職稱。

圖片

Let's use words like ninja and wizard, or data scientist, or prompt engineer,

讓我們使用像忍者和巫師,或數據科學家,或提示工程師這樣的詞語,

圖片

or even the AI engineer. In the past 10 years, we've known that this works really well, right? Every time, we know exactly who we want. As long as we catch a very wide net of skills, it doesn't really matter whether or not

甚至是 AI 工程師。在過去的 10 年裡,我們知道這個方法非常有效,對吧?每次,我們都確切知道我們想要的是誰。只要我們捕捉到非常廣泛的技能範圍,是否真的重要並不重要。

圖片

we don't know what outcomes we're looking for. Anyways, to dig me out of this hole,

我們不知道我們在尋找什麼結果。不管怎樣,讓我從這個困境中脫身,

圖片

I'll have Hamel explain, and you know, take a deep breath, think out loud, step by step.

我會讓哈梅爾解釋,你知道的,深呼吸,大聲思考,一步一步來。

圖片

- Thank you, Jason. (audience applauds)

- 謝謝你,Jason。(觀眾鼓掌)

圖片

So, that was really good.

所以,那真的很好。

圖片

I mean, let's just step back from the cliff a little bit.

我的意思是,我們先稍微退一步,不要那麼接近懸崖。

圖片

And let's kind of linger on the topic of AI engineer. I heard some booing in the audience.

那我們就稍微停留在 AI 工程師這個話題上。我聽到觀眾席中有些噓聲。

圖片

And so, I love the term AI engineer, like much props to SWIX for kind of popularizing this term.

所以,我喜歡 AI 工程師這個詞,感謝 SWIX 推廣這個詞。

圖片

It allows us all to get together and have conversations like this.

它讓我們大家能聚在一起進行這樣的對話。

圖片

But I think that there's a misunderstanding of the skills of AI engineer,

但我認為人們對 AI 工程師的技能存在誤解,

圖片

what skills you need to be successful, and there's a lot of inflated expectations.

你需要哪些技能才能成功,還有很多過高的期望。

圖片

As a founder or engineering leader, your talent is the most important lever that you have. And so, what I'm gonna do is,

作為創辦人或工程領導者,你的才能是你最重要的槓桿。因此,我要做的是,

圖片

I'm gonna talk about some of the problems, and perhaps some solutions, when it comes to this talent misunderstanding. So, just to review, what is an AI engineer? So, this is a diagram that everyone has probably seen.

我要談談一些問題,或許還有一些解決方案,關於這種人才誤解。所以,讓我們回顧一下,什麼是 AI 工程師?這是一個大家可能都看過的圖表。

圖片

There's a spectrum of skills in the AI space, and there's this API dividing line in the middle.

在 AI 領域中有一系列的技能,而中間有一條 API 分界線。

圖片

And kind of to the right of the API dividing line, we have AI engineer.

在 API 分界線的右邊,我們有 AI 工程師。

圖片

The AI engineer skills are focused on things like chains,

AI 工程師的技能專注於像鏈條這樣的東西,

圖片

agents, tooling, and infra. And auspiciously missing from the AI engineer

代理、工具和基礎設施。並且 AI 工程師中顯著缺少

圖片

are tools like evals and data.

像 evals 和 data 這樣的工具。

圖片

And I think a lot of people have taken this diagram too literally, and taken it to heart, and say,

而且我認為很多人把這個圖表看得太過字面意思,並且銘記於心,然後說,

圖片

hey, we don't really need to know about evals, for example. The problem is, is that you can go

嘿,我們其實不需要了解評估,例如。問題是,你可以去

圖片

from zero to one really fast.

從零到一真的很快。

圖片

In fact, you can go to zero to one faster than ever before, with all the great tools out there. Just by using vibe checks, and implementing the tools

事實上,現在有這麼多優秀的工具,你可以比以往更快地從零到一。只需使用 vibe checks 並實施這些工具。

圖片

that we talked about. However, without evals, you can't make progress.

然而,沒有 evals,你無法取得進展。

圖片

Quickly leads to stagnation, because if you can't measure

很快就會導致停滯,因為如果你無法衡量

圖片

what you're doing, you can't make your system better. And you can't go beyond zero to one.

你在做什麼,你無法讓你的系統變得更好。而且你無法從零到一。

圖片

So, what can we do about this?

那麼,我們可以怎麼做呢?

圖片

About this evals, skill set, and data literacy? So, Jason and I have found that you can actually get really good at writing evals and data literacy,

關於這些評估、技能組合和數據素養?所以,Jason 和我發現你其實可以在撰寫評估和數據素養方面變得非常出色,

圖片

with just four to six weeks of deliberate practice. In fact, very effective. And we think that these skills, evals and data, should be brought more into the core of AI engineer. And really, it helps solve this problem. And it's something that we see over and over again. So, the next thing I wanna talk about

只需四到六週的刻意練習。事實上,非常有效。我們認為這些技能、評估和數據應該更多地融入 AI 工程師的核心。這確實有助於解決這個問題。而且這是我們一再看到的。所以,接下來我想談的是

圖片

is the AI engineer job title itself. And so, vague job titles can be problematic.

是 AI 工程師這個職稱本身。因此,模糊的職稱可能會引發問題。

圖片

What we see over and over again in our consulting,

我們在顧問工作中一再看到的是,

圖片

is that this kind of catch-all role have very inflated expectations.

這種包羅萬象的角色有非常膨脹的期望。

圖片

Anytime anything goes wrong with the AI,

每當 AI 出現任何問題時,

圖片

people look towards that role to fix it. And sometimes, that role doesn't have all the skills they need to move forward.

人們期望那個角色來解決問題。但有時候,那個角色並沒有所有他們需要的技能來推進。

圖片

And we've seen this before, with the role of data scientists.

而我們以前也見過這種情況,數據科學家的角色。

圖片

Titles and names really matter. And what I wanna emphasize,

標題和名字真的很重要。我想強調的是,

圖片

I think AI engineer is very aspirational. And you should keep learning.

我認為 AI 工程師是非常有抱負的。而且你應該持續學習。

圖片

And it's a good thing to strive towards. But you need to have reasonable expectations.

而這是一個值得努力的目標。但你需要有合理的期望。

圖片

And just to kind of bring it back to data science,

而且只是想把話題拉回到數據科學,

圖片

we've seen this before in data science as well. Where we had, kind of a decade ago,

我們在資料科學中也見過這種情況。大約十年前,我們曾經有過這樣的情況,

圖片

when this role was coined. It was a unicorn that had all these skills.

當這個角色被創造出來時,它是一個擁有所有這些技能的獨角獸。

圖片

Software engineering skills, statistics, math, domain expertise.

軟體工程技能、統計學、數學、領域專業知識。

圖片

We found out as an industry that we had to unroll that into many other different roles,

我們作為一個行業發現,我們必須將其展開到許多其他不同的角色中。

圖片

such as decision scientists, machine learning engineer, data engineer, so on and so forth.

例如決策科學家、機器學習工程師、數據工程師等等。

圖片

And I think similar things may be happening with the role of AI engineer. And it's good to keep that in mind.

而且我認為類似的情況可能也會發生在 AI 工程師的角色上。記住這一點是好的。

圖片

And what I see, or what we both see in consulting, is that it's helpful to be more specific.

而我所看到的,或者我們在諮詢中都看到的,是更具體會更有幫助。

圖片

To be more deliberate about what skills you need, and at what time.

更有意識地了解你需要哪些技能,以及在什麼時候需要這些技能。

圖片

And depending on your maturity, it's very helpful to not only specify what the skills are,

而且根據你的成熟度,不僅要具體說明這些技能是什麼,

圖片

but what kinds of products you'll be working on. So these are some job titles from GitHub Co-Pilot,

但你將會從事哪些類型的產品呢?這些是來自 GitHub Co-Pilot 的一些職位名稱:

圖片

that kind of are very specific about the skills you need

那種對你所需技能有非常具體要求的

圖片

at that time. And really it's important to hire the right talent at the right time, on the maturity curve.

在那個時候。而且在成熟曲線上,確實在適當的時間雇用合適的人才是很重要的。

圖片

So when you're first starting out, you only need application development,

所以當你剛開始時,你只需要應用程式開發,

圖片

software engineering, and/or AI engineering to go from zero to one.

從零到一的軟體工程和/或 AI 工程。

圖片

Then you need platform and data engineering to capture that data.

然後你需要平台和數據工程來捕捉這些數據。

圖片

And then only after that, you should hire a machine learning engineer. Do not hire a machine learning engineer without having any data.

然後只有在那之後,你才應該聘請機器學習工程師。在沒有任何數據的情況下,不要聘請機器學習工程師。

圖片

But again, you can get a lot more mileage out of your AI engineer,

但同樣地,你可以從你的 AI 工程師那裡獲得更多的效益,

圖片

with deliberate practice on evals and data. We usually find four to six weeks practice does the job.

透過對評估和數據的刻意練習。我們通常發現四到六週的練習就能達到效果。

圖片

So, in recap, one of the biggest failure modes is talent.

所以,總結來說,最大的失敗模式之一是人才。

圖片

We think that AI engineers are often over-scoped

我們認為 AI 工程師的範圍經常被過度擴大

圖片

but under-specified, but we can fix that by learning evals.

但規範不足,但我們可以通過學習 evals 來解決這個問題。

圖片

Next, I wanna give it over to Shreya Shankar and Eugene Yan to talk about,

接下來,我想把時間交給 Shreya Shankar 和 Eugene Yan 來討論,

圖片

dive into this evals and data literacy. (audience applauding)

深入探討這些評估和數據素養。(觀眾鼓掌)

圖片

- Thanks.

- 謝謝。

圖片

Question. - Thank you, Jason. Thank you, Hamel.

問題。- 謝謝,Jason。謝謝,Hamel。

圖片

Next up, Shreya and I are gonna share with you about the tactical aspects of building with LLMs in production, specifically evals,

接下來,Shreya 和我將與你們分享在生產環境中使用 LLMs 的戰術方面,特別是評估。

圖片

monitoring, and guardrails. So, here's a Hacker News quote.

監控和防護措施。所以,這裡有一段 Hacker News 的引用。

圖片

How important evals are to the team is a differentiator between teams shipping out hot garbage

評估對團隊的重要性是區分團隊是否推出劣質產品的關鍵因素。

圖片

and those building real products. I would agree. I think here's an example of Apple's recent LLM, where they shared about how they actually collected 750 summaries of push notification and email summarizations,

以及那些正在打造真正產品的人。我同意。我認為這裡有一個蘋果最近 LLM 的例子,他們分享了他們實際上如何收集了750個推送通知和電子郵件摘要,

圖片

because these are data sets that are representative of the actual use case.

因為這些是代表實際使用案例的數據集。

圖片

So, how do we build evals for our own products? Well, I think the simple thing is to just make it simpler. For example, if you're trying to extract product attributes from a product description, break it down into title, price, rating,

那麼,我們如何為自己的產品建立評估呢?嗯,我認為簡單的方法就是讓它更簡單。例如,如果你正在嘗試從產品描述中提取產品屬性,可以將其分解為標題、價格、評分,

圖片

and then you can just simply do assertions. Similarly, for summarization,

然後你就可以簡單地進行斷言。同樣地,對於摘要,

圖片

instead of trying to eval that amorphous blob of a summary, break it down into dimensions,

與其嘗試評估那個模糊不清的摘要,不如將其分解成不同的維度,

圖片

such as factual inconsistency, relevance, and informational density.

例如事實不一致性、相關性和資訊密度。

圖片

And once you've done that, assertion-based tests can go a long way. Are we extracting the correct price?

一旦完成這些,基於斷言的測試可以發揮很大作用。我們是否提取了正確的價格?

圖片

Are we extracting the correct title? Or if you're doing natural language to SQL generation,

我們是否提取了正確的標題?或者如果你正在進行自然語言到 SQL 生成,

圖片

is it using the expected table? Is it using the expected columns? These are very simple to eval,

它是否使用了預期的表格?它是否使用了預期的欄位?這些都很容易評估,

圖片

and reiterate what Hamel has mentioned about keeping it simple. Lastly, assertions can do everything,

並重申 Hamel 所提到的保持簡單。最後,斷言可以做任何事,

圖片

but they can only go so far. So, therefore, consider evaluator models.

但它們只能走到這一步。因此,考慮評估模型。

圖片

Maybe training a classifier for factual inconsistency

也許訓練一個分類器來檢測事實不一致性

圖片

or reward model for relevance. This is easier if your evals are classification and regression-based. But that said, I don't know how I feel about LLM as a judge. - What do you mean you don't like LLM as a judge?

或相關性的獎勵模型。如果你的評估是基於分類和回歸的,這會更容易。但話說回來,我不知道我對 LLM 作為評判的感覺如何。- 你是什麼意思,你不喜歡 LLM 作為評判嗎?

圖片

I personally am super bullish on LLM as a judge,

我個人對 LLM 作為法官非常看好,

圖片

and I'm curious how many of you are exploring LLM as judge

而且我很好奇有多少人正在探索 LLM 作為法官

圖片

or have implemented it? - No. - Yeah?

或已經實施了嗎?- 沒有。- 是嗎?

圖片

There's a judge right here. You wanna stand up? - No. - Actual judge, LLM judge here, yeah.

這裡有一位法官。你想站起來嗎?- 不。- 真正的法官,LLM 法官在這裡,對。

圖片

Anyways, we're gonna go through some points on what to consider when deploying LLM as judge.

總之,我們將討論在部署 LLM 作為裁判時需要考慮的一些要點。

圖片

First of all, it's a no-brainer.

首先,這是顯而易見的。

圖片

LLM as judge is the most easy to prototype. You just have to write a prompt to check for the criteria

LLM 作為裁判是最容易原型化的。你只需要寫一個提示來檢查標準。

圖片

or metric that you want, and you can even align this towards your own preferences

或您想要的度量標準,甚至可以根據自己的偏好進行調整

圖片

by providing few-shot examples of good and bad for that criteria. On the other hand, fine-tuned models or LLMs that you have to collect a lot of data

通過提供一些符合該標準的好壞範例。另一方面,微調模型或 LLM 需要收集大量數據。

圖片

and set up a pipeline to train as your evaluator are not super easy to prototype

並建立一個管道來訓練作為您的評估者並不容易原型化

圖片

and have a lot of upfront investment. - Yeah, but that said, LLM as a judge is pretty difficult to align it to your specific criteria in the business. Who here has not had any difficulty aligning

並且需要大量的前期投資。- 是的,但話說回來,作為法官的 LLM 很難與您的業務特定標準對齊。這裡有誰沒有遇到過對齊困難的問題

圖片

the LLM as a judge to your criteria?

根據你的標準,將 LLM 作為裁判?

圖片

Anyone? Okay, we're gonna talk later, Shreya. I think that if you just have a few hundred

有人嗎?好吧,我們等會再聊,Shreya。我認為如果你只有幾百

圖片

to a few thousand samples, it's very easy to fine-tune a simple model

對於幾千個樣本來說,微調一個簡單的模型非常容易。

圖片

who can do it more precisely. Second, if you wanna do LLM as a judge

誰能更精確地完成這項工作。其次,如果你想作為法官使用 LLM

圖片

and have it fairly precise, you sort of need to use chain of thought. And chain of thought is gonna be,

並且要相當精確地做到這一點,你需要使用思維鏈。思維鏈將會是,

圖片

I don't know, five to eight seconds long. On the other hand, if you have a simple classifier or reward model, every request is maybe 10 milliseconds long.

我不知道,大約五到八秒鐘。另一方面,如果你有一個簡單的分類器或獎勵模型,每個請求可能只有 10 毫秒。

圖片

That's two orders of magnitude lower

那低了兩個數量級

圖片

and would improve throughput. Next, we wanna think about technical debt.

並且會提高吞吐量。接下來,我們要考慮技術債務。

圖片

Okay, when we're implementing our validators in production,

好的,當我們在生產環境中實施我們的驗證器時,

圖片

even if they run asynchronously or they run in the critical path, how much effort do we need to put in to keep these up to date? With LLM as judge, if you don't make sure your few-shot examples are dynamic

即使它們是異步運行或在關鍵路徑中運行,我們需要投入多少努力來保持這些更新?有 LLM 作為判斷者,如果你不確保你的 few-shot 示例是動態的

圖片

or some way of making sure your judge kind of prompt

或某種方式確保你的評審提示

圖片

aligns with your definition of good and bad, then you're toast.

符合你對好壞的定義,那你就完蛋了。

圖片

And kind of, the effect is not as pronounced for fine-tuned models, but if you don't continually

而且,這種效果對於微調模型來說並不那麼明顯,但如果你不持續地

圖片

fine-tune your validators on new data, on new production data, then they will also be susceptible to drift.

在新的數據、新的生產數據上微調你的驗證器,那麼它們也會容易出現漂移。

圖片

So overall, when do you wanna use LLM as judge?

總的來說,什麼時候你會想要使用 LLM 作為判斷者?

圖片

It's honestly a resources question and where you are in your application development.

這老實說是個資源問題,還有你在應用程式開發中的位置。

圖片

If you're starting to prototype it, you need quick evals with minimal dev effort and need something, you have a lowish volume of evals, start with LLM as a judge and kind of invest

如果你開始製作原型,你需要快速評估並且開發工作量最小,而且需要一些東西,你的評估量較低,可以從 LLM 作為評判開始並進行一些投資。

圖片

in the infrastructure to align that over time. If you have more resources or you know

在基礎設施中隨著時間的推移進行調整。如果你有更多的資源或你知道

圖片

that your product is gonna be sticky, go for a fine-tuned model. Next, I'm gonna talk about looking at the data.

如果你的產品要具有黏性,那就選擇一個微調過的模型。接下來,我要談談如何查看數據。

圖片

Eugene mentioned you should create evals on your custom or bespoke criteria, but how do you know what criteria you want? Simple answer, look at your data.

Eugene 提到你應該根據自訂或專屬標準來創建評估,但你怎麼知道你想要什麼標準呢?簡單的答案是,查看你的數據。

圖片

Great AI researchers, but we changed that to engineers, great AI engineers look at their data.

優秀的 AI 研究人員,但我們改成了工程師,優秀的 AI 工程師會查看他們的數據。

圖片

So how do we do this? The first question actually before how

那麼我們該怎麼做呢?其實在問「怎麼做」之前的第一個問題是

圖片

is when do you look at this? I know people who never look at their data at all or people who look at it initially after deployment,

什麼時候你會看這個?我知道有些人從來不看他們的數據,或者有些人在部署後最初看一下。

圖片

wrong answer, you wanna look at it regularly. I work with a startup that whenever they ship

錯誤的答案,你應該定期查看。我與一家初創公司合作,每當他們發貨時

圖片

a new LLM agent, they create a new Slack channel with all of the agent's outputs that come in real time. After a couple of weeks, they transition this

一個新的 LLM 代理,他們會創建一個新的 Slack 頻道,將所有代理的輸出實時傳送到該頻道。幾週後,他們會轉換這個

圖片

to kind of daily batch jobs and make sure that they're not running into errors that they didn't anticipate. Second thing is what specifically are you looking for?

要處理日常批次作業,並確保它們沒有遇到未預期的錯誤。第二件事是,你具體在尋找什麼?

圖片

You wanna find slices of the data that are pretty simple or easy to characterize in some way. For example, data that comes from a particular source or data that has a certain keyword or phrase

你想找到一些數據片段,這些片段在某種程度上是相對簡單或容易描述的。例如,來自特定來源的數據或包含某個關鍵字或短語的數據。

圖片

or is about a certain topic. Simply just saying all of these are bad, but having no way of characterizing them

或是關於某個特定主題。僅僅說這些都是不好的,但沒有辦法對它們進行描述

圖片

and then improving your pipeline based on that is not gonna help.

然後根據這些改進你的流程是沒有幫助的。

圖片

Finally, some things to keep in mind throughout this whole looking at your data experience is that your code base is very rapidly changing over time probably. Your prompts, components of the pipeline, and et cetera. When you're inspecting traces, it's super helpful

最後,在整個查看數據的過程中需要記住的一些事情是,你的代碼庫可能會隨著時間迅速變化。你的提示、管道的組件等等。在檢查追蹤時,這非常有幫助。

圖片

to be able to know what GitHub commit or what model version or prompt version

能夠知道是哪個 GitHub commit 或是哪個模型版本或提示版本

圖片

did this correspond to. I think this is one of the very successful things that traditional MLOps tools did, like MLflow, for example.

這是否對應。我認為這是傳統 MLOps 工具(例如 MLflow)非常成功的事情之一。

圖片

They made it very easy to trace back and then hopefully you could replay something.

他們讓追溯變得非常容易,然後希望你可以重播一些東西。

圖片

I see the judge shaking his head. (laughs)

我看到法官搖頭。(笑)

圖片

Great. Finally, when using LLMs as APIs, pin model versions.

太好了。最後,在將 LLM 作為 API 使用時,請固定模型版本。

圖片

LLM APIs are known to exhibit different behavior

LLM API 以行為不同而聞名

圖片

that is very hard to quantify for certain tasks. Pin GPT-4, 11.06, pin GPT-4.0,

這對某些任務來說很難量化。釘選 GPT-4,11.06,釘選 GPT-4.0。

圖片

whatever it is that you're using.

無論你正在使用的是什麼。

圖片

- Shreya mentioned that we need to look at our data, but how do we look at our data all the time?

- Shreya 提到我們需要查看我們的數據,但我們怎麼能一直查看我們的數據呢?

圖片

I think the way to do this is via an automated guardrail. Here's Brandolini's law adapted. The amount of energy to catch and fix defects is an order of magnitude larger than needed to produce it.

我認為解決這個問題的方法是通過自動化的防護措施。這裡是改編過的布蘭多里尼法則。捕捉和修復缺陷所需的能量比產生缺陷所需的能量大一個數量級。

圖片

And that's true. It's really easy to call an LLM API and just get something.

而這是真的。調用 LLM API 並獲取一些東西真的很容易。

圖片

But how do we know if it's actually bad? I think it's really important that we do have some basic form of guardrails, and some of them are just table sticks. Toxicity, personally identified information,

但是我們怎麼知道它是否真的不好呢?我認為我們確實需要一些基本的防護措施,這一點非常重要,其中一些只是基本的規範。毒性、個人識別信息,

圖片

copyright, and expected language. Now you may imagine that this is pretty straightforward,

版權和預期語言。現在你可能會認為這是相當簡單的,

圖片

but sometimes you don't actually have control over the context. For example, if someone's posting an ad

但有時候你實際上無法控制上下文。例如,如果有人在發布廣告

圖片

on your English website that's in a different language, and you're asking your LLM to extract the attributes

在您的英文網站上使用不同語言,並要求您的 LLM 提取屬性

圖片

or to summarize it, you may be surprised that for some non-zero proportion of the time,

或者總結一下,你可能會驚訝地發現,有一部分時間並非零。

圖片

it's actually in a different language. Similarly, hallucinations happen more often

事實上,它實際上是用不同的語言。類似地,幻覺更常發生。

圖片

than we would like. So imagine you're trying to summarize a movie based on a description,

比我們想像的還要多。所以,想像一下你正在嘗試根據描述來總結一部電影,

圖片

you just have a description for the trailer. It may actually include spoilers, because it's trying so hard to be helpful.

你只是有一個預告片的描述。它可能實際上包含劇透,因為它非常努力地想要幫助。

圖片

But that's actually a bad user experience. So sometimes you will include information that's not in there. Here's a tip. If we spend a little bit more time building reference-free evals, we can use them as guardrails. So reference-based evals are when we generate some kind of output and we compare it to some ideal sample. This is pretty expensive, and you actually have to collect all these gold samples. On the other hand, if we have these labels, we can train an evaluator model and just compare it to the source document. So for example, if you're comparing summarizations, we can just check if the summary entails or contradicts the source document,

但這其實是一個糟糕的用戶體驗。所以有時候你會包含一些不在其中的信息。這裡有個提示。如果我們花多一點時間建立無參考的評估,我們可以用它們作為護欄。所以基於參考的評估是當我們生成某種輸出並將其與某個理想樣本進行比較。這相當昂貴,而且你實際上必須收集所有這些黃金樣本。另一方面,如果我們有這些標籤,我們可以訓練一個評估模型,並只將其與源文件進行比較。所以例如,如果你在比較摘要,我們可以只檢查摘要是否包含或矛盾於源文件。

圖片

and now we have a hallucination eval. So therefore, if we spend some time

現在我們有一個幻覺評估。因此,如果我們花一些時間

圖片

building reference-free evals once, we can use it to guardrail all new output. - Thanks, Eugene.

建立無參考評估系統後,我們可以用它來規範所有新的輸出。- 謝謝,Eugene。

圖片

So we're gonna wrap up next minute or so

所以我們大概在下一分鐘結束。

圖片

on some high-level, bird's-eye view, 2,000-foot view, whatever you wanna call it, takeaways.

從某種高層次、鳥瞰、2000 英尺的視角,不管你怎麼稱呼它,重點。

圖片

First off, how many of you remember this figure

首先,有多少人記得這個數字

圖片

from this pretty seminal paper in MLOps that came out maybe 10 years ago? 2015, so nine years ago. Yeah, so I think this paper really communicated the idea that the model is a small part, and when you're productionizing ML systems, there's so much more around the model that you have to maintain over time.

這篇在 MLOps 領域相當具有開創性的論文大約在十年前發表?2015 年,所以是九年前。對,我認為這篇論文真正傳達了這個概念,即模型只是其中的一小部分,當你將 ML 系統投入生產時,還有很多圍繞模型的部分需要隨著時間進行維護。

圖片

Data verification, feature engineering, monitoring your infrastructure, et cetera. So you might be wondering, we have LLMs.

資料驗證、特徵工程、監控基礎設施等等。所以你可能會想,我們有 LLMs。

圖片

Does any of this matter? Yeah?

這有關係嗎?有嗎?

圖片

Yeah, I'm seeing a few nods here. Absolutely. When we have LLMs, all of these tech debt principles

是的,我看到這裡有幾個點頭。沒錯。當我們擁有 LLMs 時,所有這些技術債務原則

圖片

still apply, and you can even think of the exact mapping for every single component in here to the LLM equivalent. For example, maybe we don't have feature engineering pipelines, but casted in new light, it's RAC. We're looking at context, we're trying to retrieve what's relevant, engineer that to not distract the LLM too much.

仍然適用,你甚至可以為這裡的每個組件找到對應的 LLM 等價物。例如,也許我們沒有特徵工程管道,但在新的視角下,它是 RAC。我們在看上下文,試圖檢索相關的內容,並對其進行工程處理,以避免過多干擾 LLM。

圖片

We have a ton of experimentation around that. All of this is something that needs to be maintained over time, especially as models change under the hood. Similarly for data validation and verification, we have evals, we have guardrails that need to be deployed. It's not just simply wrap your model or GPT

我們在這方面進行了大量的實驗。所有這些都需要隨著時間的推移進行維護,特別是當模型在背後發生變化時。同樣地,對於數據驗證和核實,我們有評估,我們有需要部署的防護措施。這不僅僅是簡單地包裝你的模型或 GPT。

圖片

in some software and ship it.

在某些軟體中並將其發佈。

圖片

No, there's a lot of investment that needs to happen around the model.

不,還需要對這個模型進行大量投資。

圖片

- All right, so I'd like to end with this quote from Kapati Senpai. There's a large class of problems

- 好的,那麼我想以這句 Kapati Senpai 的話作結。有一大類問題

圖片

that are really easy to imagine and build demos for, but it's extremely hard to make products out of.

這些東西很容易想像和製作示範,但要將其變成產品卻非常困難。

圖片

For example, Charles dug up this paper

例如,Charles 挖出了這篇論文

圖片

of the first car driven by a neural network. That was 1988. 25 years later, Andre Kapati took his first demo drive

由神經網路駕駛的第一輛車。那是1988年。25年後,Andre Kapati 進行了他的第一次示範駕駛。

圖片

of Waymo, 2013.

Waymo,2013 年。

圖片

10 years later, I hope all of you had a chance to try the Waymo. We got the driverless permit for Waymo in San Francisco. Maybe in a couple more years, we'll have it for the whole of California.

10 年後,我希望你們都有機會試乘 Waymo。我們已經獲得了在舊金山運行 Waymo 的無人駕駛許可。也許再過幾年,我們就能在整個加州使用它。

圖片

The point is, going from demo to production takes time. So therefore, that's all we had.

重點是,從演示到生產需要時間。所以,這就是我們所擁有的一切。

圖片

Thank you. Let's build. (audience applauding)

謝謝。讓我們一起努力。(觀眾鼓掌)

圖片

(audience cheering) (audience applauding)

(觀眾歡呼)(觀眾鼓掌)

圖片
[ Applause ]