https://www.wandb.courses/courses/steering-language-models
https://www.wandb.courses/courses/take/steering-language-models/lessons/51841451-jason-s-introduction
> (2024/2) 精通 function callling 必看的課程!! 拿 function callling 來做結構化輸出,並且做 validation
> 講者沒啥投影片,就是講 colab 而已
* https://docs.pydantic.dev/latest/
* 作者
* https://twitter.com/jxnlco
* https://www.jxnl.co/
* https://jxnl.github.io/instructor/
* 課程 github: https://github.com/wandb/edu/tree/main/llm-structured-extraction
## 1. Asking LLMs for Structured Data
### Jason
![[Pasted image 20240225162612.png]]
![[Pasted image 20240225162630.png]]
### What are we covering in this chapter?
![[Pasted image 20240225162818.png]]
* [ ] Better Data Extraction Using Pydantic and OpenAI Function Calls
* https://wandb.ai/jxnlco/function-calls/reports/Better-Data-Extraction-Using-Pydantic-and-OpenAI-Function-Calls--Vmlldzo0ODU4OTA3
* 這篇介紹 https://jxnl.github.io/instructor/
### JSON & dictionaries - issue1 & issue 2
* [ ] colab: Working with structured outputs
* https://github.com/wandb/edu/blob/main/llm-structured-extraction/1.introduction.ipynb
* Pydantic: https://docs.pydantic.dev/latest/
* **model_validate_json 方法**
* **model_json_schema 方法** pydantic 轉 json schema
### Function calling
* [ ] Building a Virtual Assistant with Google Gemini Function Calling
* https://wandb.ai/byyoung3/ml-news/reports/Building-a-Virtual-Assistant-with-Google-Gemini-Function-Calling--Vmlldzo2MzE1NTY1
* [ ] Using LLMs to Extract Structured Data: OpenAI Function Calling in Action
* https://wandb.ai/darek/llmapps/reports/Using-LLMs-to-Extract-Structured-Data-OpenAI-Function-Calling-in-Action--Vmlldzo0Nzc0MzQ3
### Instructor & other libraries
* https://jxnl.github.io/instructor/
* https://www.askmarvin.ai/
* langchain
* https://docs.llamaindex.ai/en/latest/examples/output_parsing/openai_pydantic_program.html
## 2. Prompting LLMs
* [ ] colab: General Tips on Prompting
* https://github.com/wandb/edu/blob/main/llm-structured-extraction/2.tips.ipynb
* Tip 1: Classification
- 使用 Enums 列舉 或 Literals 文字常數
- Tip 2: Arbitrary properties
- 定義 class Property 有 key 跟 value
- 定義 `List[Property]`,甚至可以在 description 寫長度限制
- Tip 3: Defining multiple entities
- 定義 `Iterable[Character]`
- Tip 4: Streaming
- Tip 5: Relationships
- 定義 `List[int]` 是 Relationships to their friends using the id
- Tip 6: The Maybe pattern
- 使用 Optional type
- 允許 LLM 回傳 missing data,避免幻覺
> 這 json schema 其實蠻複雜的,手寫會很難啊,難怪得用 Pydantic 轉
> https://json-schema.org/understanding-json-schema/structuring#structuring-a-complex-schema
> 這裡還用到 anyOf, allOf, oneOf 用法
JSON和字典的根本問題不在於它們的不相容性,而在於確保數據的有效性和類型正確性的困難。
Pydantic的主要優勢是提供類型檢查和驗證
## 3. RAG Applications
- [ ] colab: Applying Structured Output to RAG applications
- https://github.com/wandb/edu/blob/main/llm-structured-extraction/3.0.applications-rag.ipynb
* Applying structured output to RAG applications
* [ ] A Gentle Introduction to Retrieval Augmented Generation (RAG)
* https://wandb.ai/cosmo3769/RAG/reports/A-Gentle-Introduction-to-Retrieval-Augmented-Generation-RAG---Vmlldzo1MjM4Mjk1
* Improving extractions
* 擷取出 topic, keywords, hypothetical_questions, summary 可以作為 embedding 索引之用
* Adding temporal context
* 使用結構化輸出來理解查詢,以識別用戶查詢的意圖
* 擷取日期區間,加上 chain_of_thought 欄位
* 擷取(改寫)出 rewritten_query
* Experiment tracking
* https://docs.wandb.ai/guides
* https://wandb.ai/wandbot/wandbot_public/reports/RAGs-To-Riches-Bringing-Wandbot-into-Production--Vmlldzo1ODU5ODk0
* Parallel processing
* 產生 queries 查詢條件,裡面有多個 query 針對不同後端去做查詢
* Decomposing questions
* 做 query plan 拆解子問題
* 這個竟然可以擷取出 subquestions 來判斷是否子問題有依賴性!!
> 可惜這章沒有串完 RAG,只做到 query generation
## 4. Validating LLM Outputs
- [ ] colab: Understanding Validators and controlling responses
- https://github.com/wandb/edu/blob/main/llm-structured-extraction/3.1.validation-rag.ipynb
- 驗證和根據回饋再產生更好的回應
- Defining validator functions
- 加上驗證方法,使用 Annotated type
- 使用 pydantic 的 AfterValidator
> 發現這兩種寫法是等價的:
> FullName = Annotated[
str,
WithJsonSchema(
{
"type": "string",
"description": "The user's full name",
}
)]
class UserDetail(BaseModel):
name: FullName
跟
class UserDetail(BaseModel):
name: : str = Field( escription="The user's full name")
- Using Field object
- 也可以用 Field 搭配 Annotated 來寫驗證方法
- Providing Context
- 使用 pydantic 的 ValidationInfo 動態傳參數
- Using OpenAI moderation
- 直接用 instructor 的 openai_moderation 方法
- https://platform.openai.com/docs/guides/moderation/overview
- Using LLM validator
- 使用 instructor 的 llm_validator
- Avoiding hallucination with citations
- 檢查回傳的 citation 文字有在當初 prompt 裡面的 context 之中
- 呼叫 client.chat.completions.create 時透過 validation_context 參數將 context 傳入給 validator 使用
- 實務上,可以用 regex 或 semantic similarity 來檢查
- Reasking with validators
- 當有錯誤時,要求 LLM 重寫
- 使用 max_retries 參數
- https://github.com/openai/openai-python
- Connection errors (for example, due to a network connectivity problem), 408 Request Timeout, 409 Conflict, 429 Rate Limit, and >=500 Internal errors are all retried by default.
- 這個 openai library 碰到例外本來就會 retry
## 5. Course Project
要交一份 Weights & Biases report 作業可以拿到上課證書
還會有 Part 2 課程