LLM Engineering Structured Outputs

https://www.wandb.courses/courses/steering-language-models https://www.wandb.courses/courses/take/steering-language-models/lessons/51841451-jason-s-introduction > (2024/2) 精通 function callling 必看的課程!! 拿 function callling 來做結構化輸出，並且做 validation > 講者沒啥投影片，就是講 colab 而已 * https://docs.pydantic.dev/latest/ * 作者 * https://twitter.com/jxnlco * https://www.jxnl.co/ * https://jxnl.github.io/instructor/ * 課程 github: https://github.com/wandb/edu/tree/main/llm-structured-extraction ## 1. Asking LLMs for Structured Data ### Jason ![[Pasted image 20240225162612.png]] ![[Pasted image 20240225162630.png]] ### What are we covering in this chapter? ![[Pasted image 20240225162818.png]] * [ ] Better Data Extraction Using Pydantic and OpenAI Function Calls * https://wandb.ai/jxnlco/function-calls/reports/Better-Data-Extraction-Using-Pydantic-and-OpenAI-Function-Calls--Vmlldzo0ODU4OTA3 * 這篇介紹 https://jxnl.github.io/instructor/ ### JSON & dictionaries - issue1 & issue 2 * [ ] colab: Working with structured outputs * https://github.com/wandb/edu/blob/main/llm-structured-extraction/1.introduction.ipynb * Pydantic: https://docs.pydantic.dev/latest/ * **model_validate_json 方法** * **model_json_schema 方法** pydantic 轉 json schema ### Function calling * [ ] Building a Virtual Assistant with Google Gemini Function Calling * https://wandb.ai/byyoung3/ml-news/reports/Building-a-Virtual-Assistant-with-Google-Gemini-Function-Calling--Vmlldzo2MzE1NTY1 * [ ] Using LLMs to Extract Structured Data: OpenAI Function Calling in Action * https://wandb.ai/darek/llmapps/reports/Using-LLMs-to-Extract-Structured-Data-OpenAI-Function-Calling-in-Action--Vmlldzo0Nzc0MzQ3 ### Instructor & other libraries * https://jxnl.github.io/instructor/ * https://www.askmarvin.ai/ * langchain * https://docs.llamaindex.ai/en/latest/examples/output_parsing/openai_pydantic_program.html ## 2. Prompting LLMs * [ ] colab: General Tips on Prompting * https://github.com/wandb/edu/blob/main/llm-structured-extraction/2.tips.ipynb * Tip 1: Classification - 使用 Enums 列舉或 Literals 文字常數 - Tip 2: Arbitrary properties - 定義 class Property 有 key 跟 value - 定義 `List[Property]`，甚至可以在 description 寫長度限制 - Tip 3: Defining multiple entities - 定義 `Iterable[Character]` - Tip 4: Streaming - Tip 5: Relationships - 定義 `List[int]` 是 Relationships to their friends using the id - Tip 6: The Maybe pattern - 使用 Optional type - 允許 LLM 回傳 missing data，避免幻覺 > 這 json schema 其實蠻複雜的，手寫會很難啊，難怪得用 Pydantic 轉 > https://json-schema.org/understanding-json-schema/structuring#structuring-a-complex-schema > 這裡還用到 anyOf, allOf, oneOf 用法 JSON和字典的根本問題不在於它們的不相容性，而在於確保數據的有效性和類型正確性的困難。 Pydantic的主要優勢是提供類型檢查和驗證 ## 3. RAG Applications - [ ] colab: Applying Structured Output to RAG applications - https://github.com/wandb/edu/blob/main/llm-structured-extraction/3.0.applications-rag.ipynb * Applying structured output to RAG applications * [ ] A Gentle Introduction to Retrieval Augmented Generation (RAG) * https://wandb.ai/cosmo3769/RAG/reports/A-Gentle-Introduction-to-Retrieval-Augmented-Generation-RAG---Vmlldzo1MjM4Mjk1 * Improving extractions * 擷取出 topic, keywords, hypothetical_questions, summary 可以作為 embedding 索引之用 * Adding temporal context * 使用結構化輸出來理解查詢，以識別用戶查詢的意圖 * 擷取日期區間，加上 chain_of_thought 欄位 * 擷取(改寫)出 rewritten_query * Experiment tracking * https://docs.wandb.ai/guides * https://wandb.ai/wandbot/wandbot_public/reports/RAGs-To-Riches-Bringing-Wandbot-into-Production--Vmlldzo1ODU5ODk0 * Parallel processing * 產生 queries 查詢條件，裡面有多個 query 針對不同後端去做查詢 * Decomposing questions * 做 query plan 拆解子問題 * 這個竟然可以擷取出 subquestions 來判斷是否子問題有依賴性!! > 可惜這章沒有串完 RAG，只做到 query generation ## 4. Validating LLM Outputs - [ ] colab: Understanding Validators and controlling responses - https://github.com/wandb/edu/blob/main/llm-structured-extraction/3.1.validation-rag.ipynb - 驗證和根據回饋再產生更好的回應 - Defining validator functions - 加上驗證方法，使用 Annotated type - 使用 pydantic 的 AfterValidator > 發現這兩種寫法是等價的: > FullName = Annotated[ str, WithJsonSchema( { "type": "string", "description": "The user's full name", } )] class UserDetail(BaseModel): name: FullName 跟 class UserDetail(BaseModel): name: : str = Field( escription="The user's full name") - Using Field object - 也可以用 Field 搭配 Annotated 來寫驗證方法 - Providing Context - 使用 pydantic 的 ValidationInfo 動態傳參數 - Using OpenAI moderation - 直接用 instructor 的 openai_moderation 方法 - https://platform.openai.com/docs/guides/moderation/overview - Using LLM validator - 使用 instructor 的 llm_validator - Avoiding hallucination with citations - 檢查回傳的 citation 文字有在當初 prompt 裡面的 context 之中 - 呼叫 client.chat.completions.create 時透過 validation_context 參數將 context 傳入給 validator 使用 - 實務上，可以用 regex 或 semantic similarity 來檢查 - Reasking with validators - 當有錯誤時，要求 LLM 重寫 - 使用 max_retries 參數 - https://github.com/openai/openai-python - Connection errors (for example, due to a network connectivity problem), 408 Request Timeout, 409 Conflict, 429 Rate Limit, and >=500 Internal errors are all retried by default. - 這個 openai library 碰到例外本來就會 retry ## 5. Course Project 要交一份 Weights & Biases report 作業可以拿到上課證書還會有 Part 2 課程