The New Stack and Ops for AI (OpenAI DevDay)

> 歡迎訂閱我的 [AI Engineer 電子報](https://aihao.eo.page/6tcs9) 和瀏覽 [[Generative AI Engineer 知識庫]] 演講的副標題是 Going from prototype to production 影片: https://www.youtube.com/watch?v=XGJNo8TpuVA * 目前有 2M developers 了 * ChatGPT: 2022/11 * GPT-4: 2023/3 * GPT has gone from a tou -> tool -> capability * Prototypes are easy. Demos are cool. Production is hard. * Scaling non-deterministic apps from prototype to production is difficult without a framework * 這場 talk 給你一個 framework 和一些策略 ## 1. User Experience ![[Pasted image 20231115161141.png]] ### Control for uncertainty * AI assistant UX should augment the user's abilities rather than replacing human judgment * 增強用戶能力，而不是取代用戶決策 ![[Pasted image 20231115161530.png]] ### Manage expectations AI notices 讓用戶知道 AI 可以做什麼、不能做什麼 ![[Pasted image 20231115161624.png]] ![[Pasted image 20231115161650.png]] ![[Pasted image 20231115161718.png]] ### Build guardrails for steerability and safety * Guardrails = safety controls for LLMs * Great UX brings the best of steerability and safety ![[Pasted image 20231115162355.png]] ![[Pasted image 20231115162433.png]] DALL-3 會擴寫用戶的 prompt，並且同時也有 safety 的改寫建議作用，而不是直接拒絕用戶 ![[Pasted image 20231115162637.png]] * Guardrails are essential for UX, especially for applications in regulated industries ## 2. Model Consistency ![[Pasted image 20231115161149.png]] 上 production 開始 scale 之後，開始收到各種不同 query，就會面對 model 一致性問題 ### Constrain model behavior (在 model level) * It's often difficult to manage the inherent probabilistic nature of LLMs * Today we introduced two model-level features to help constrain model behavior * JSON mode * JSON mode allow you to force the model to output JSON * ![[Pasted image 20231116163834.png]] * 仍無法 100% 保證，但會大幅降低 error rate * Reproducible outputs * You can get significantly more reproducible output using the seed parameter * 有三樣東西會影響 model 不一致的行為 * temperature / top_p * seed * system_fingerprint (在 response 裡面紀錄當時 server 的狀態版本) * ![[Pasted image 20231116164931.png]] ### Ground the model (用知識庫或你自己的工具) 透過給予模型更多額外的事實知識，降低模型不一致性 * When it's on its own, the model can "hallucinate" information * 若只靠模型自己，很大原因是因為我們強迫模型一定要回答，於是即使模型不知道，他也會掰給妳 * "Grounding" the model * ![[Pasted image 20231116171102.png]] * Idea: In the input context, explicitly give the model "grounded facts" to reduce the likelihood of hallucinations * ![[Pasted image 20231116171219.png]] 也就是 RAG * ![[Pasted image 20231116171310.png]] 也不一定是 RAG，而是你自己的 service 透過 function calling 使用 ![[Pasted image 20231116171437.png]] ![[Pasted image 20231116171447.png]] * Fact Source 也不一定是 Vector Database，可以是 Search Index, Database, browsing internet 等 * The OpenAI Assistants API offers an out-of-the-box setup to use retrieval ## 3. Evaluating Performance ![[Pasted image 20231115161155.png]] 如何 deliver 一致的體驗沒有 regression ### Create eval suites 針對你特定應用的場景做評估 * Lack of evaluations has been a key challenge for deploying to production * Model evals are unit tests for the LLM * ![[Pasted image 20231116173807.png]] * Types of mistakes to build evals for * Bad output formatting * Inaccurate responses/actions * Going of the rails * Bad tone * Hallucinations ![[Pasted image 20231116175013.png]] ![[Pasted image 20231116175045.png]] * When human feedback is impractical or costly, automated evaluations allow developers to monitor progress and detect regressions ### Use model-graded evals 用 AI 來評估 AI * GPT-4 is actually smart enough to grade evals for you * ![[Pasted image 20231116175732.png]] * *![[Pasted image 20231116180105.png]] * ![[Pasted image 20231116180155.png]] * ![[Pasted image 20231116180356.png]] 可微調 3.5 來節省成本 * ![[Pasted image 20231116180448.png]] ## 4. Managing Latency & Cost ![[Pasted image 20231115161207.png]] ### Use semantic caching ![[Pasted image 20231116181355.png]] ![[Pasted image 20231116181414.png]] ### Route to cheaper models * GPT-3.5-Turbo is cheap and fast, but it's not as smart as GPT-4 * ![[Pasted image 20231116204526.png]] 相比 GPT-4，用微調的 GPT-3.5-Turbo 來降低成本和 latency * ![[Pasted image 20231116204727.png]] * 但是需要準備 dataset，上百筆甚至上千筆，人工做很貴 * ![[Pasted image 20231116204735.png]] * 可以用 GPT-4 來產生訓練資料，也就是從 GPT-4 蒸餾到 3.5 Turbo，讓他幾乎和 GPT-4 一樣好 * ![[Pasted image 20231116204951.png]] ## The new stack and ops for AI ![[Pasted image 20231116205100.png]] * These strategies are evolving into LLMOps ![[Pasted image 20231116205256.png]] * LLMOps enables scaling to O(1000s) applications and O(millions) of users * Let's build, together