> 歡迎訂閱我的 [AI Engineer 電子報](https://aihao.eo.page/6tcs9) 和瀏覽 [[Generative AI Engineer 知識庫]]
課程網址: https://www.edx.org/professional-certificate/databricks-large-language-models
Lab 程式碼: https://github.com/databricks-academy/large-language-models
Databrick 跟 edX 合作的認證課程,針對 developers, data scientists, and engineers,目標是建構 LLM-centric 的應用。
注意: 內容使用主要是用 Hugging Face 和開源 LLMs,而不是用 OpenAI API。雖然論應用的實用性還是直接呼叫 OpenAI API 更簡單效果又好啦。
課程免費,但若要用 Lab 線上環境交作業跟拿認證則要付費。
之前修過 edX 和 Databrick 合作開的 [Introduction to Big Data with Spark](https://ihower.tw/blog/archives/8495) 和 [Scalable Machine Learning](https://ihower.tw/blog/archives/8502) 還不錯,在 Lab 作業上學到很多,所以這次再來修。
> 2023/7/1 Verified Certificate 到手: https://courses.edx.org/certificates/80fba6c099774532a8d6dd04c7349844
> 2023/9 Professional Certificate 到手: https://credentials.edx.org/credentials/89c341dce41244548a69aed8b4236e88
## Module 0: Course Introduction
* Why LLMs?
* Primer on NLP
* Introduction to NLP
* Language Model
* 相比其他 language model 何謂 Large
* Tokenization
* Sentence, Word, Sub-word, Character
* Byte Pair Encoding (BPE) 就是一種流行的 Sub-word 方式
* OpenAI API 的 token 就是使用 BPE
* Word Embeddings 找相似文本
* 如何用 vectors 來代表 context e.g. word2vec
> 從 NLP 開始教蠻特別的,但的確是不錯的打底。畢竟 LLM 是基於 NLP 發展出來的,很多概念跟術語出自 NLP。
> 不過講 word embeddings 竟然沒提一下 google T5 和 openai embeddings ?
* Introduction Resources
* NLP
* https://online.stanford.edu/courses/xcs224n-natural-language-processing-deep-learning
* https://huggingface.co/learn/nlp-course/chapter1/1 HuggingFace 出的 NLP 課程
* Language Modeling
* https://en.wikipedia.org/wiki/Tf%E2%80%93idf
* https://www.kaggle.com/code/vipulgandhi/bag-of-words-model-for-beginners
* https://colah.github.io/posts/2015-08-Understanding-LSTMs/
* https://web.stanford.edu/~jurafsky/slp3/
* Word Embeddings
* https://www.tensorflow.org/tutorials/text/word2vec
* https://www.tensorflow.org/text/guide/word_embeddings
* Tokenization
* https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt
* https://github.com/google/sentencepiece
* https://ai.googleblog.com/2021/12/a-fast-wordpiece-tokenization-system.html
> 你要等 14 天,才能在 Module 4 & 5 使用 Lab Environment (廠商是 [Vocareum](https://www.vocareum.com/) )
notebooks code 在 https://github.com/databricks-academy/large-language-models
## Module 1: Applications with LLMs
### Introduction
* 本 Module 使用 Hugging Face 的 datasets, pipelines, tokenizers 跟 models
* 學會在 Hugging Face Hub 找好用的 models
* NLP Tasks 有 https://huggingface.co/learn/nlp-course/chapter7/1?fw=pt
* NLP ecosystem
* https://github.com/huggingface/transformers 做 Pre-trained DL models and pipeline
* 經典 NLP 工具: NLTK, SpaCy, Gensim, Spark NLP
* OpenAI, LangChain
### Hugging Face
* Hub
* Model: https://huggingface.co/models
* Datasets: https://huggingface.co/datasets
* Spaces https://huggingface.co/spaces
* Hugging Face Pipeline
1. Prompt construction (optional for some models)
2. Tokenizer (encoding)
3. Model (LLM)
4. Tokenizer (decoding)
* Datasets
* https://huggingface.co/docs/datasets/index
### Model Selection
* 以 Summarization https://huggingface.co/tasks/summarization 為例
* Hugging Face Hub 上面有上千個 models,怎麼選?
* Filter by task, license, language, model size
* Sort by popularity, updates
* 是 generalist 或針對特定任務 fine-tuned 的 model
* 可選擇 variants of models: 同一個 base model 不同 size 或 fine-tuned variants
* 可先從小 model 開始測試,再換大的
* 同時也考慮文件好不好、 examples 跟 datasets
* Table of LLMs: https://仔入.stanford.edu/ecosystem-graphs/index.html
* 很多 models 是 family,或是共用 pre-training dataset
* 有些是針對特定任務的 fined-tuned model
* model size 大,不代表一定比較厲害
![[Pasted image 20230701173008.png]]
### Common NLP Tasks
* Summarization
* Sentiment analysis
* Translation
* Zero-shot classification
* Few-shot learning: 不需要針對特定任務 fine-tuning model,而是給 instruction 跟 few example
* Conversation/chat
* (Table) Question-answering
* Text / token classification
* Text generation
> 看到目前還比較無趣,主要是從 NLP 角度切入介紹 LLMs
### Prompts
* Foundation models 跟 Instruction-following models 不一樣
* 前者只會接龍,後者會根據你的指示
* prompts 是給 LLM 的 input or query 來引出 response
### Prompt Engineering
* **Prompt engineering is model-specific!**
* 不同的 model 需要不同的 prompts
* 目前很多 guidelines 都是針對 ChatGPT (or OpenAI models)
* 可能不適用於 non-ChatGPT models
* 不同 user cases 可能需要不同的 prompts,所以 Iterative development 是關鍵
* 好的 prompt 包括: 1. Instruction, 2.Context/Example,3. Input/Question, 4. Output type/format
* 用清楚的指令
* 用到特定的關鍵字: Classify, Translate, Summarize, Extract
* 包括詳細的指示
* 用不同的 prompt 變形在不同的 samples 上,哪個 prompt 平均表現最好
* 如何表現更好?
* 要求不要虛構
* Do not make things up if you do not know. Say 'I do not have that information'
* 要求不要用敏感資訊
* Do not make assumptions based on nationalities
* Do not ask the user to provide their SSNs
* 要求用更多時間思考 (Chain of Thought for Reasoning)
* Explain how you solve this math problem
* Do this step-by-step. Step 1: Summarize into 100 words. Step 2: Translate from English to French...
* Prompt 格式技巧
* 用分隔符號區隔 instruction 跟 context
* Pound sign, Backticks, Braces, brackets, Dashes 等
* 避免 prompt injection, hacking prompt, Prompt leaking, Jailbreaking 等
* 其他處理 prompt hacking 的技巧
* 用另一個 LLM model 再過濾輸出
* 在 prompt 最後再重複一次指示
* 將 user input 用亂數字串包起來
* ihower: 但我覺得 [[Building Systems with the ChatGPT API]] 裡面那招用字串替換比較好
* 要求回傳格式: HTML, jsom, table, markdown...etc
* 提供範例,例如
* Return the movie name mentioned in the form of a Python dictionary. The output should look like {'Title': 'In and Out'}"
* 參考資源 (但有些是 OpenAI only)
* https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api
* https://www.promptingguide.ai/
* https://learn.deeplearning.ai/chatgpt-prompt-eng/lesson/1/introduction
* https://learnprompting.org/docs/intro
* https://github.com/brexhq/prompt-engineering
* Tools
* https://coefficient.io/ai-prompt-generator
* https://replicate.com/kyrick/prompt-parrot
### Module 1 Resources
* NLP tasks
- [Hugging Face tasks page](https://huggingface.co/tasks)
- [Hugging Face NLP course chapter 7: Main NLP Tasks](https://huggingface.co/course/chapter7/1?fw=pt)
- Background reading on specific tasks
- Summarization: [Hugging Face summarization task page](https://huggingface.co/tasks/summarization) and [course section](https://huggingface.co/learn/nlp-course/chapter7/5)
- Sentiment Analysis: [Blog on “Getting Started with Sentiment Analysis using Python”](https://huggingface.co/blog/sentiment-analysis-python)
- Translation: [Hugging Face translation task page](https://huggingface.co/docs/transformers/tasks/translation) and [course section](https://huggingface.co/learn/nlp-course/chapter7/4)
- Zero-shot classification: [Hugging Face zero-shot classification task page](https://huggingface.co/tasks/zero-shot-classification)
- Few-shot learning: [Blog on “Few-shot learning in practice: GPT-Neo and the 🤗 Accelerated Inference API”](https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api)
- [Hugging Face Hub](https://huggingface.co/docs/hub/index)
- [Models](https://huggingface.co/models)
- [Datasets](https://huggingface.co/datasets)
- [Spaces](https://huggingface.co/spaces)
* Hugging Face libraries
- [Transformers](https://huggingface.co/docs/transformers/index)
- Blog post on inference configuration: [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate)
- [Datasets](https://huggingface.co/docs/datasets)
- [Evaluate](https://huggingface.co/docs/evaluate/index)
* Models
- Base model versions of models used in the demo notebook
- [T5](https://huggingface.co/docs/transformers/model_doc/t5)
- [BERT](https://huggingface.co/docs/transformers/model_doc/bert)
- [Marian NMT framework](https://huggingface.co/docs/transformers/model_doc/marian) (with 1440 language translation models!)
- [DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta) (Also see [DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2))
- [GPT-Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo) (Also see [GPT-NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox))
- [Table of LLMs](https://crfm.stanford.edu/ecosystem-graphs/index.html)
* Prompt engineering
- [Best practices for OpenAI-specific models](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)
- [DAIR.AI guide](https://www.promptingguide.ai/)
- [ChatGPT Prompt Engineering Course](https://learn.deeplearning.ai/chatgpt-prompt-eng) by OpenAI and DeepLearning.AI
- [🧠 Awesome ChatGPT Prompts](https://github.com/f/awesome-chatgpt-prompts) for fun examples with ChatGPT
### Lab 01: LLMs with Hugging Face
https://github.com/databricks-academy/large-language-models
* 用 Hugging Face Pipelines 實作
* Summarization,示範用 Google T5 model ,作業則是你自己隨便挑一個
* Sentiment analysis,示範用 Google BERT model
* Translation,示範用 Helsinki-NLP/opus-mt-en-es,作業則是你自己隨便挑一個
* Lab 推薦 NLLB, the No Language Left Behind model https://huggingface.co/docs/transformers/model_doc/nllb
* Zero-shot classification
* Few-shot learning,用 gpt-neo-1.3B model,是這個 notebook 最 powerful 的 model
* 作業則是你自己寫一個 prompt
* 用這 API 沒幾行 code,在 Lab server 就下載了 5.31G 的 model,共花了8分鐘跑起來 gpt-neo model 挺厲害的
* 練習調 t5-small 的 tokenizer 和 model 參數看看
## Module 2: Embeddings, Vector Databases, and Search
### Introduction
* knowledge-based question answering (QA)
* 在 databrick 是最常見的應用,很多企業都有自己的資料集和知識
* 了解使用 vector search 的策略跟如何評估、了解 vector database 跟 best practices 和如何改進效能
* Language models 如何學知識?
* 透過 model training or fine-tuning
* 適合專家任務
* 類比: 考試前認真讀書
* 透過 model inputs: 將知識插入 input,讓 LLM 一起回答
* 使用 vectors 來搜尋相關內容給 LLM
* 類比: 開書考試
* 缺點: context token 長度限制
* 缺點: longer context 的 API costs 也比較高、更長的 processing 時間
* 除了文字可以轉 vectors,圖片、聲音都可以轉 vectors 來處理
* vector database 的用途
* 相似性搜尋: text, images, audio
* 語意搜尋,而不是關鍵字比對。非常適合 knowledge-based QA
* 推薦引擎
* https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/
* 找安全威脅
* vectorizing 病毒binaries,然後找相似
* 說明 Search and Retrieval-Augmented Generation 架構和流程
### How does Vector Search work?
* exact search
* 用 brute foce method
* KNN (k-neareest neighbors)
* approximate nearest neighbors(ANN)
* 較不準但速度快
* 是一種 indexing algorithm
* 輸出是 vector index
* Examples:
• Tree-based:_ ANNOY by Spotify https://github.com/spotify/annoy
• Proximity graphs: HNSW https://arxiv.org/abs/1603.09320
• Clustering: FAISS: by Facebook https://github.com/facebookresearch/faiss
• Hashing: LSH https://en.wikipedia.org/wiki/Locality-sensitive_hashing
• Vector compression: SCaNN: by Google https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html
* 其中 FAISS 和 HNSW 是目前 vector stores 最常實作的算法
* 如何計算 vectors 相似性
* Distance metrics: L2 (Euclidean)
* Similarity metrics: Cosine
* 是最常見的兩種
* Product Quantization (PQ)
* 用來壓縮 vectors 來節省記憶體使用
* FAISS 缺點: 不適合 sparse vectors
* 至此,有能力搜尋 similar 東西,而不只是以前的 fuzzy text 或 exact match
### Filtering
* vector database 如何實作 filtering function?
* 每家實作不太一樣
* 主要有三種策略
* Post-query
* user query 後,先 similarity search ,再過濾 top-k 結果
* 但有多少結果未知,也許過濾之後根本沒有滿足的條件
* In-query
* 算法同時做 filter 跟 similarity search
* 資料除了有 vectors 也要有 metadata 用來過濾資料
* 適合 row-based data
* Pre-query
* 效能較差,因為要先 burce-foce 過濾一次
> 找到 Pinecone 有支援 metadata filtering https://docs.pinecone.io/docs/metadata-filtering
> 而 FAISS 就不支援,討論串: https://github.com/facebookresearch/faiss/issues/1079
### Vector Stores
* 廣義來說: vector stores 包括 vector databases, libraries, plugins 等
* Why?
* Query time
* Scalability
* 用 library 或 plugins 就好?
* 很多不支援 filter queries (WHERE)
* 用 ANN library 就可以建立 vector indices
* 但是沒有 CRUD 支援,因此若資料變動,整個 index 要重建
* in-memory
* 用 plugins (指在現有的 DB 架構上增加 plugin)
* elasticsearch
* https://www.elastic.co/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0 ?
* pgvector
* 但功能通常比較少,例如少 metric chines 跟 ANN choices
* API 較不友善
* 但這塊進展很快
* 用 vector database?
* 優點
* Scalability: 將有百萬以上資料
* Spped
* Full-fledged database properties
* 需要 filtering 功能
* 當資料經常變動時
* 缺點: 學習成本、使用成本
![[Pasted image 20230615141955.png]]
### Best Practices
* Vector stores extend LLMs with knowledge
* 回傳相關文件,成為 LLM context
* context 可以降低 hallucination 現象
* 不需要 context augmentation(增加) 的 use cases
* Summarization, Text classification, Translation
* 如何改進 retrieval performance
* 選 Embedding model
* 你的 embedding model 是由跟你的資料類似的訓練資料訓練的嗎?
* The embedding model 應該要能同時代表你的 queries 和 documents
* 確保 embedding space 是一樣的 for both queries and documents
* query 和文件用一樣的 embedding model
* 並要確保 vector database 有你想要的相關資料
* 例如 query 是電影,但是 vector stores 裡面是醫學,這樣結果會很差
* DocumentStorage 策略
* 這一塊相對新,還沒有非常好的定義
* 要拆 chunks 嗎?
* 拆出來的 chunk 跟 prompt 有多相關?
* 若結果要送給 LLM,就需要配合 LLM 的 token limit
* 這非常 use-case specific
* 需要 iterative step 去實驗
* 文件有多長?
* 如果 1 chunk == 1 sentence,embedding 會代表特定意義
* 如果 1 chunk == multiple paragraphs,embedding 會代表更廣的主題
* 例如根據 headers 拆
* 用戶行為又如何? 他們的query會打多長?
* 長 query,比較容易找到相關的 chunks
* 短 query,則需要搭配短 chunk 比較好
* Resources
* [ ] https://python.langchain.com/en/latest/modules/indexes/text_splitters.html
* [ ] https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/
* [ ] https://www.pinecone.io/learn/chunking-strategies/
* 如何知道你做失敗了?
* user 端可以加上明確指示,例如:
* Tell me the top 3 hikes in California. If you do not know the answer, do not make it up. Say 'I don’t have information for that.
* 工程端,可以:
* failover logic
* 當 distance-x 超過某個 threshold y 時,顯示罐頭訊息,而不是讓用戶看到沒東西
* 加一個 toxicity classification model 在前面,避免用戶送不良訊息被存到 VDB 裡面 (造成之後被搜索出來)
* vector DB 要有 timeout
### Module 2 Resources
* Research papers on increasing context length limitation
- [Pope et al 2022](https://arxiv.org/abs/2211.05102)
- [Fu et al 2023](https://arxiv.org/abs/2212.14052)
- Industry examples on using vector databases
- FarFetch
- [FarFetch: Powering AI With Vector Databases: A Benchmark - Part I](https://www.farfetchtechblog.com/en/blog/post/powering-ai-with-vector-databases-a-benchmark-part-i/)
- [FarFetch: Powering AI with Vector Databases: A Benchmark - Part 2](https://www.farfetchtechblog.com/en/blog/post/powering-ai-with-vector-databases-a-benchmark-part-ii/)
- [FarFetch: Multimodal Search and Browsing in the FARFETCH Product Catalogue - A primer for conversational search](https://www.farfetchtechblog.com/en/blog/post/multimodal-search-and-browsing-in-the-farfetch-product-catalogue-a-primer-for-conversational-search/)
- [Spotify: Introducing Natural Language Search for Podcast Episodes](https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/)
- [Vector Database Use Cases compiled by Qdrant](https://qdrant.tech/use-cases/)
- Vector indexing strategies
- Hierarchical Navigable Small Worlds (HNSW) [Malkov and Yashunin 2018](https://arxiv.org/abs/1603.09320)
- Facebook AI Similarity Search (FAISS) [Meta AI Blog](https://ai.facebook.com/tools/faiss/)
- Product quantization [PQ for Similarity Search by Peggy Chang](https://towardsdatascience.com/product-quantization-for-similarity-search-2f1f67c5fddd)
* Cosine similarity and L2 Euclidean distance
- [Cosine and L2 are functionally the same when applied on normalized embeddings](https://stats.stackexchange.com/questions/146221/is-cosine-similarity-identical-to-l2-normalized-euclidean-distance)
* Filtering methods
- [Filtering: The Missing WHERE Clause in Vector Search by Pinecone](https://www.pinecone.io/learn/vector-search-filtering/)
* Chunking strategies
- [Chunking Strategies for LLM applications by Pinecone](https://www.pinecone.io/learn/chunking-strategies/)
- [Semantic Search with Multi-Vector Indexing by Vespa](https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/)
* Other general reading
- [Vector Library vs Vector Database by Weaviate](https://weaviate.io/blog/vector-library-vs-vector-database)
- [Not All Vector Databases Are Made Equal by Dmitry Kan](https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696)
- [Open Source Vector Database Comparison by Zilliz](https://zilliz.com/comparison)
- [Do you actually need a vector database? by Ethan Rosenthal](https://www.ethanrosenthal.com/2023/04/10/nn-vs-ann/)
### Lab 02
* 實作 Embeddings, Vector Databases 跟 Search,使用
* 資料集來自 https://newscatcherapi.com/
* Encoder 用 SentenceTransformers
* FAISS 和 ChromaDB
* Hugging Face model (可用 gpt2) 或 OpenAI 來做 QA
* Pinecone 或 Weaviate (optional)
## Module 3: Multi-stage Reasoning
* 多層推理,使用 prompt template, CoT, chaining prompts 等
* LLM pipelines
* LLM Tasks v.s. LLM-based Workflows
* Task 是 Sine Interaction with an LLM
* Workflow 是 Applications with more than single interaction
* Templating: Summarization template
* LLM Chains
* LangChain: 用來做 multi-stage reasoning, LLM-based workflows
* Multi-stage LLM Chains
* 示範 sequential flow: 先 summary 再做 sentiment
* LLMMath chain
* LLM Agents: Building reasoning loops
* ReasonAction loop (ReACT): 讓 LLM 選 tools
* LLM Plugins 陸續發展中,例如
* LangChain
* Transformers Agents
* ChatGPT plugins
* Google PaLM 2 with workspace
* AutoGPT: self-directing agents
![[Pasted image 20230701172919.png]]
> 這 Module 應該要是講 CoT 和 Chaining Prompts 內容,但講得好少啊,比 [[Building Systems with the ChatGPT API]] 還少,頗失望 :(
### Module Resources
* LLM Chains
- [LangChain](https://docs.langchain.com/)
- [OpenAI ChatGPT Plugins](https://platform.openai.com/docs/plugins/introduction)
* LLM Agents
- [Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents)
- [AutoGPT](https://github.com/Significant-Gravitas/Auto-GPT)
- [Baby AGI](https://github.com/yoheinakajima/babyagi)
- [Dust.tt](https://dust.tt/)
* Multi-stage Reasoning in LLMs
- [CoT Paradigms](https://matt-rickard.com/chain-of-thought-in-llms)
- [ReAct Paper](https://react-lm.github.io/)
- [Demonstrate-Search-Predict Framework](https://github.com/stanfordnlp/dsp)
### Lab 03
* 用 LangChain 實作
* JekyllHyde: 一個 AI self moderating system for social media
* 用到 SequentialChain 把兩個 Chain 串起來
* HuggingFace API (用 OpenAI 也行)
* DaScie: 一個 vector DB data science AI agent
* 用 ReAct,使用 vector DB,根據指示做 data science 任務
* langchin tools 用到 wikipedia, serpapi, python_repl, terminal
* 使用 langchain 的 create_pandas_dataframe_agent 方法,載入 kaggle 現成的一個 dataset 來做 pandas 分析
* 示範做 random forest 機器學習
* 用 ChromaDB 做一個 Question-Answer(QA) LLMChain
* https://huggingface.co/inference-api
* embedding model 用 HuggingFace sentence-transformers/all-MiniLM-L6-v2
* 透過 langchain 的 HuggingFaceEmbeddings
* LLM model 用 HuggingFace google/flan-t5-large
* 透過 langchain 的 HuggingFacePipeline
* 用 https://www.gutenberg.org 資料集(langchain 有 GutenbergLoader)
## Module 4: Fine-tuning and Evaluating LLMs
* 如何改進 model quality: few-shot learning 和 fine-tuning
* LLM fine-tuned versions
* base model
* chat model
* instruct model
### Applying Foundation LLMs
* LLM pipeline 中的 LLM 選擇比較
* few-shot learning with open-sourced LLM
* open-source instruction-following LLM
* paid LLM-as-a-Service
* Build your own
### Fine-Tuning: Few-shot learning
* 優點: 開發快、效能快、成本低
* 缺點
* 需要 example 資料 (指 prompt 中的 example)
* size-effect 若需要大model來跑,硬體需求較高
### Fine-Tuning: Instruction-following LLMs
* 優點:
* 可以做 zero-shot learning,不需要 examples
* 效能應該夠快、成本低
* 缺點:
* model 如果 fine-tuned 在類似的任務,品質可能會不好
* 一樣有size-effect
### Fine-Tuning: LLMs-as-a-Service
* 優點
* 開發速度快
* 不怕用大 model,效能在 server 端處理好了
* 缺點
* 成本
* Data Privacy/Security
* Vendor lock-in
### Fine-tuning: DIY
* 從 base model 開始做是不太可能的,需要的資源巨大
* 從已經有的 base model 進行 fine-tune
* 優點
* 可以建立針對你的任務的 task-specific model
* Inference cost 可以做出較低
* 更好的 control
* 缺點
* 時間和計算成本
* 需要 large dataset
* 需要 Skill sets 專業才能做
* 舉例
* Self-instruct (Alpaca and Dolly v1)
* 用另一個 LLM 才產生 dataset
* High-quality fine-tune (Dolly v2)
* An instruction-following LLM
* base model 是 EleutherAI 的 Pythia 12B 加上 databricks-dolly-15k 的 dataset 微調出來的
* Open Source 且可以商用
### Evaluating LLMs
* Perplexity: 好的 LM 要有好 high accuracy, low perplexity
* accuracy = next word is right or wrong
* perplexity = how confident was that choice
* 不同的 NLP task 有不同的 metrics
### Task-specific Evaluations
* Translation 用 BLEU
* BiLingual Evaluation Understudy
* Summarization 用 ROUGE
* 用 datasets 來做 Benchmarks: SQuAD
* Stanford Question Answering Dataset- reading comprehension
* ChatGPT 有用的 metrics
* Target application
* NLP tasks
* Queries chosen to match the API distribution (?)
* 人類偏好
* Alignment: Helpful, Honest, Harmless
### Guest Lecture from Harrison Chase (Creator of LangChain)
講 Evaluation of LLM Chains and Agents
* Why hard?
* 缺少 data
* 缺少 metrics
* 可能解法
* 缺少 data
* 用 LLM 產生資料
* 在 production 上慢慢累積
* 缺少 metrics
* 視覺化過程協助觀察
* 用 LLM 進行評估
* 用戶回饋
* Offline evaluation
* 先建立 dataset、執行、視覺化觀察、用 LLM 自動評分
* Online evaluation
* 每次 datapoint 進來 1. 用戶直接回饋(thumb up/down) 2. 間接回饋(有沒有click on link) 3. 持續追蹤回饋
### Module Resource
* Fine-tuned models
- [HF leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
- [MPT-7B](https://www.mosaicml.com/blog/mpt-7b)
- [Stanford Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html)
- [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/)
- [DeepSpeed on Databricks](https://www.databricks.com/blog/2023/03/20/fine-tuning-large-language-models-hugging-face-and-deepspeed.html)
* Databricks’ Dolly
- [Dolly v1 blog](https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html)
- [Dolly v2 blog](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm)
- [Dolly on Hugging Face](https://huggingface.co/databricks/dolly-v2-12b)
- [Build your own Dolly](https://www.databricks.com/resources/webinar/build-your-own-large-language-model-dolly)
* Evaluation and Alignment in LLMs
- [HONEST](https://huggingface.co/spaces/evaluate-measurement/honest)
- [LangChain Evaluate](https://docs.langchain.com/docs/use-cases/evaluation)
- [OpenAI’s post on InstructGPT and Alignment](https://openai.com/research/instruction-following)
- [Anthropic AI Alignment Papers](https://www.anthropic.com/index?subjects=alignment)
### Lab 04
* 實際用 base model 來進行 fine-tuning training
* base model 範例用 T5 small model,作業用 [pythia-70m-deduped](https://huggingface.co/EleutherAI/pythia-70m-deduped)
* dataset 用
* 範例用 imdb movie reviews 是 positive, negative, neutral
* 作業用 databricks/databricks-dolly-15k
* 用到 Nvidia CUDA
* TrainingArguments from HuggingFace
* TensorBoard 視覺化過程
* T5 small 範例跑 1 epoch 花費約6分鐘在 g5.2xlarge cluster 機器上
* pythia 作業跑 10 epoch 花費大約 30分鐘
* 用 Microsoft DeepSpeed https://github.com/microsoft/DeepSpeed
* 在 multi-GPU 上進行加速
* 用 ROUGE 進行摘要的評估
* https://en.wikipedia.org/wiki/ROUGE_(metric)
* 用來評估 summarization: 比較 reference summaries 和 generated summaries
* dataset 使用 huggingface dataset 的 cnn_dailymail
* 評估比較 t5-base 和 t5-small 和 gpt-2
* t5-base 的 ROUGE 分數如預期較 t5-small 好
* 分數是 t5-base > t5-small > gpt-2
> 作業還蠻有趣的,基本上就是在做一個精簡版的 Dolly 模型
## Module 5: Society and LLMs
### Risks and Limitations
* 危險和限制
* 訓練資料有 biases 偏見、有錯誤、不夠diversity,特別是來自 web 的資料
* 濫用
* hallucination 幻覺胡說八道
* 對社會的影響: 創意產業、就業、環境
### Hallucination
* 可分兩種
* Intrinsic: 與事實相反
* extrinsic: 無法驗證,有可能是錯的
* 原因
* data 就有問題
* model 問題
* 如何 evaluation
### Mitigation Strategies
* 更好的資料,建構 faithful dataset
* 繼續研究更好的 model
* Three-layered audit: Governance audit, Model audit, Application audit
### Module Resources
* Social Risks and Benefits of LLMs
- [Weidinger et al 2021 (DeepMind)](https://arxiv.org/pdf/2112.04359.pdf)
- [Bender et al 2021](https://dl.acm.org/doi/10.1145/3442188.3445922)
- [Mokander et al 2023](https://link.springer.com/article/10.1007/s43681-023-00289-2)
- [Rillig et al 2023](https://pubs.acs.org/doi/pdf/10.1021/acs.est.3c01106)
- [Pan et al 2023](https://arxiv.org/pdf/2305.13661.pdf)
* Hallucination
- Paper: Survey of Hallucination in Natural Language Generation [Ji et al 2022](https://arxiv.org/pdf/2202.03629.pdf)
* Bias evaluation metrics and tools
- [NeMo Guardrails](https://github.com/NVIDIA/NeMo-Guardrails)
- [Guardrails.ai](https://shreyar.github.io/guardrails/)
- [Liang et al 2022](https://arxiv.org/pdf/2211.09110.pdf)
* Other general reading
- [All the Hard Stuff Nobody Talks About when Building Products with LLMs by Honeycomb](https://www.honeycomb.io/blog/hard-stuff-nobody-talks-about-llm)
- [Science in the age of large language models by Nature Reviews Physics](https://www.nature.com/articles/s42254-023-00581-4)
- [Language models might be able to self-correct biases—if you ask them by MIT Technology Review](https://www.technologyreview.com/2023/03/20/1070067/language-models-may-be-able-to-self-correct-biases-if-you-ask-them-to/)
### Lab 05
* 使用 HuggingFace 的 Disaggregators https://github.com/huggingface/disaggregators
* 評估 bias 資料
* dataset: https://huggingface.co/datasets/wiki_bio
* 使用 HuggingFace 的 https://github.com/huggingface/evaluate
* 評估 Toxicity https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target
* 評估HONEST https://huggingface.co/spaces/evaluate-measurement/honest
* 評估 Regard
* dataset: https://huggingface.co/datasets/AlexaAI/bold
* https://huggingface.co/spaces/evaluate-measurement/regard
* 使用 SHAP (SHapley Additive exPlanations) https://github.com/slundberg/shap
* 用來解釋 model 輸出
* 視覺化解釋 output token,是由哪一個 input token 所貢獻的權重
* 另一個解釋方式 https://github.com/kayoyin/interpret-lm
* 解釋為何 model 預測這個字,不選另一個字
* https://nlptest.org/ 用來評估 model
## Module 6: LLMOps
### Traditional MLOps
* MLOps = DevOps + DataOps + ModelOps
* 管理機器學習 code, data 跟 model 的流程和自動化
* 詳細可以看 ebook: https://www.databricks.com/resources/ebook/the-big-book-of-mlops
![[Pasted image 20230702184134.png]]
### LLMOps
* 把 LLM 放進來,會改變 MLOps 什麼?
* Model training 部分,因為不會訓練 base model 了,會轉變成
* 較輕量的 mdole fine-tuning
* pipeline tuning
* 變成 prompt engineering
* user feedback 會變更重要 from dev to prod
* 自動化測試變非常困難,因為需要 human evaluation
* 不像傳統 ML 可以先做 batch datasets offline 做測試
* LLM 的上線會用 incremental rollouts: 先給一部分用戶使用,有信心後再開權限給更多用戶
* production 工具改變
* 更需要 GPU
* 需要 vector database
* 成本跟 performance
* 情況變複雜,特別是使用 3rd-party LLM APIs
### LLMOps Details
* Prompt engineering 在 Ops 的需求
* Track: 追蹤紀錄 query 和 response, 進行開發迭代
* 工具例如 MLflow https://mlflow.org/
* Template: 標準化的 prompt formats
* 工具例如 LangChain, LlamaIndex
* https://python.langchain.com/en/latest/index.html
* https://gpt-index.readthedocs.io/en/stable/
* Automate: 自動化的 tuning
* 工具例如 DSP https://github.com/stanfordnlp/dsp
* Packaging models or pipelines for deployment
* 工具 mlflow
* 標準化部署 models 和 pipelines
* Scaling out
* 大data和model需要做分散式計算
* 相關工具有 Distributed Tensorflow, Distributed Pytorch, DeepSpeed, Apache Spark, Ray, Delta Lake 等
* Managing cost/performance tradeoffs
* 可以優化的 metrics
* queries 和 training 的成本
* 開發成本
* LLM product 的 ROI
* Accuracy/metrics of model
* Query latency
* Tips
* 先從 existing models 做 prompt enginnering,有資料後再考慮做 fine-tuning
* 盡快得到 human feedback
* Human feedback, testing, and monitoring
* 很重要,一開始就應該做進 application
* feedback 需要紀錄起來做分析和 tuning
* Deploying models 和 deploying code
* code 會用 version control、model 會用 model registry,都會區分 dev, staging , prod
* Deploying code 是指 staging 跟 prod 只部署 code,model 是在 staging, prod 上產生的
* prompt engineering 和 pipeline tuning 可用 models 模式,因為 mlflow 可以方便 wrap 成一個 model 進 registry
* 可用 Service infrastructure: 例如 vector databases 跟 web service 是分開部署的
### Module Resources
* General MLOps
- [“The Big Book of MLOps”](https://www.databricks.com/resources/ebook/the-big-book-of-mlops) (eBook overviewing MLOps)
- Blog post (short) version: [“Architecting MLOps on the Lakehouse”](https://www.databricks.com/blog/2022/06/22/architecting-mlops-on-the-lakehouse.html)
- MLOps in the context of Databricks documentation ([AWS](https://docs.databricks.com/machine-learning/mlops/mlops-workflow.html), [Azure](https://learn.microsoft.com/en-us/azure/databricks/machine-learning/mlops/mlops-workflow), [GCP](https://docs.gcp.databricks.com/machine-learning/mlops/mlops-workflow.html))
* LLMOps
- Blog post: Chip Huyen on “[Building LLM applications for production](https://huyenchip.com/2023/04/11/llm-engineering.html)”
* [MLflow](https://mlflow.org/)
- [Documentation](https://mlflow.org/docs/latest/index.html)
- [Quickstart](https://mlflow.org/docs/latest/quickstart.html)
- [Tutorials and examples](https://mlflow.org/docs/latest/tutorials-and-examples/index.html)
- Overview in Databricks ([AWS](https://docs.databricks.com/mlflow/index.html), [Azure](https://learn.microsoft.com/en-us/azure/databricks/mlflow/), [GCP](https://docs.gcp.databricks.com/mlflow/index.html))
* [Apache Spark](https://spark.apache.org/)
- [Documentation](https://spark.apache.org/docs/latest/index.html)
- [Quickstart](https://spark.apache.org/docs/latest/quick-start.html)
- Overview in Databricks ([AWS](https://docs.databricks.com/spark/index.html), [Azure](https://learn.microsoft.com/en-us/azure/databricks/spark/), [GCP](https://docs.gcp.databricks.com/spark/index.html))
* [Delta Lake](https://delta.io/)
- [Documentation](https://docs.delta.io/latest/index.html)
- Overview in Databricks ([AWS](https://docs.databricks.com/delta/index.html), [Azure](https://learn.microsoft.com/en-us/azure/databricks/delta/), [GCP](https://docs.gcp.databricks.com/delta/index.html))
- [Lakehouse Architecture (CIDR paper)](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf)
### Lab 06
使用 MLflow 示範建立 LLM pipeline
* 可看作 data augmentation pipeline
* 使用 MLflow library
* Track LLM development
* 使用 MLflow tracking server
* 使用 MLflow registry
* 測試 LLM pipeline
* 上 staging stage
* 上 production stage
* production workflow for batch or streaming inference, or serving endpoint