> 歡迎訂閱我的 [AI Engineer 電子報](https://aihao.eo.page/6tcs9) 和瀏覽 [[Generative AI Engineer 知識庫]]
演講的副標題是 Going from prototype to production
影片: https://www.youtube.com/watch?v=XGJNo8TpuVA
* 目前有 2M developers 了
* ChatGPT: 2022/11
* GPT-4: 2023/3
* GPT has gone from a tou -> tool -> capability
* Prototypes are easy. Demos are cool. Production is hard.
* Scaling non-deterministic apps from prototype to production is difficult without a framework
* 這場 talk 給你一個 framework 和一些策略
## 1. User Experience
![[Pasted image 20231115161141.png]]
### Control for uncertainty
* AI assistant UX should augment the user's abilities rather than replacing human judgment
* 增強用戶能力,而不是取代用戶決策
![[Pasted image 20231115161530.png]]
### Manage expectations AI notices
讓用戶知道 AI 可以做什麼、不能做什麼
![[Pasted image 20231115161624.png]]
![[Pasted image 20231115161650.png]]
![[Pasted image 20231115161718.png]]
### Build guardrails for steerability and safety
* Guardrails = safety controls for LLMs
* Great UX brings the best of steerability and safety
![[Pasted image 20231115162355.png]]
![[Pasted image 20231115162433.png]]
DALL-3 會擴寫用戶的 prompt,並且同時也有 safety 的改寫建議作用,而不是直接拒絕用戶
![[Pasted image 20231115162637.png]]
* Guardrails are essential for UX, especially for applications in regulated industries
## 2. Model Consistency
![[Pasted image 20231115161149.png]]
上 production 開始 scale 之後,開始收到各種不同 query,就會面對 model 一致性問題
### Constrain model behavior (在 model level)
* It's often difficult to manage the inherent probabilistic nature of LLMs
* Today we introduced two model-level features to help constrain model behavior
* JSON mode
* JSON mode allow you to force the model to output JSON
* ![[Pasted image 20231116163834.png]]
* 仍無法 100% 保證,但會大幅降低 error rate
* Reproducible outputs
* You can get significantly more reproducible output using the seed parameter
* 有三樣東西會影響 model 不一致的行為
* temperature / top_p
* seed
* system_fingerprint (在 response 裡面紀錄當時 server 的狀態版本)
* ![[Pasted image 20231116164931.png]]
### Ground the model (用知識庫或你自己的工具)
透過給予模型更多額外的事實知識,降低模型不一致性
* When it's on its own, the model can "hallucinate" information
* 若只靠模型自己,很大原因是因為我們強迫模型一定要回答,於是即使模型不知道,他也會掰給妳
* "Grounding" the model
* ![[Pasted image 20231116171102.png]]
* Idea: In the input context, explicitly give the model "grounded facts" to reduce the likelihood of hallucinations
* ![[Pasted image 20231116171219.png]]
也就是 RAG
* ![[Pasted image 20231116171310.png]]
也不一定是 RAG,而是你自己的 service 透過 function calling 使用
![[Pasted image 20231116171437.png]]
![[Pasted image 20231116171447.png]]
* Fact Source 也不一定是 Vector Database,可以是 Search Index, Database, browsing internet 等
* The OpenAI Assistants API offers an out-of-the-box setup to use retrieval
## 3. Evaluating Performance
![[Pasted image 20231115161155.png]]
如何 deliver 一致的體驗沒有 regression
### Create eval suites
針對你特定應用的場景做評估
* Lack of evaluations has been a key challenge for deploying to production
* Model evals are unit tests for the LLM
* ![[Pasted image 20231116173807.png]]
* Types of mistakes to build evals for
* Bad output formatting
* Inaccurate responses/actions
* Going of the rails
* Bad tone
* Hallucinations
![[Pasted image 20231116175013.png]]
![[Pasted image 20231116175045.png]]
* When human feedback is impractical or costly, automated evaluations allow developers to monitor progress and detect regressions
### Use model-graded evals
用 AI 來評估 AI
* GPT-4 is actually smart enough to grade evals for you
* ![[Pasted image 20231116175732.png]]
* *![[Pasted image 20231116180105.png]]
* ![[Pasted image 20231116180155.png]]
* ![[Pasted image 20231116180356.png]]
可微調 3.5 來節省成本
* ![[Pasted image 20231116180448.png]]
## 4. Managing Latency & Cost
![[Pasted image 20231115161207.png]]
### Use semantic caching
![[Pasted image 20231116181355.png]]
![[Pasted image 20231116181414.png]]
### Route to cheaper models
* GPT-3.5-Turbo is cheap and fast, but it's not as smart as GPT-4
* ![[Pasted image 20231116204526.png]]
相比 GPT-4,用微調的 GPT-3.5-Turbo 來降低成本和 latency
* ![[Pasted image 20231116204727.png]]
* 但是需要準備 dataset,上百筆甚至上千筆,人工做很貴
* ![[Pasted image 20231116204735.png]]
* 可以用 GPT-4 來產生訓練資料,也就是從 GPT-4 蒸餾到 3.5 Turbo,讓他幾乎和 GPT-4 一樣好
* ![[Pasted image 20231116204951.png]]
## The new stack and ops for AI
![[Pasted image 20231116205100.png]]
* These strategies are evolving into LLMOps
![[Pasted image 20231116205256.png]]
* LLMOps enables scaling to O(1000s) applications and O(millions) of users
* Let's build, together