edX LLM Application through Production

> 歡迎訂閱我的 [AI Engineer 電子報](https://aihao.eo.page/6tcs9) 和瀏覽 [[Generative AI Engineer 知識庫]] 課程網址: https://www.edx.org/professional-certificate/databricks-large-language-models Lab 程式碼: https://github.com/databricks-academy/large-language-models Databrick 跟 edX 合作的認證課程，針對 developers, data scientists, and engineers，目標是建構 LLM-centric 的應用。注意: 內容使用主要是用 Hugging Face 和開源 LLMs，而不是用 OpenAI API。雖然論應用的實用性還是直接呼叫 OpenAI API 更簡單效果又好啦。課程免費，但若要用 Lab 線上環境交作業跟拿認證則要付費。之前修過 edX 和 Databrick 合作開的 [Introduction to Big Data with Spark](https://ihower.tw/blog/archives/8495) 和 [Scalable Machine Learning](https://ihower.tw/blog/archives/8502) 還不錯，在 Lab 作業上學到很多，所以這次再來修。 > 2023/7/1 Verified Certificate 到手: https://courses.edx.org/certificates/80fba6c099774532a8d6dd04c7349844 > 2023/9 Professional Certificate 到手: https://credentials.edx.org/credentials/89c341dce41244548a69aed8b4236e88 ## Module 0: Course Introduction * Why LLMs? * Primer on NLP * Introduction to NLP * Language Model * 相比其他 language model 何謂 Large * Tokenization * Sentence, Word, Sub-word, Character * Byte Pair Encoding (BPE) 就是一種流行的 Sub-word 方式 * OpenAI API 的 token 就是使用 BPE * Word Embeddings 找相似文本 * 如何用 vectors 來代表 context e.g. word2vec > 從 NLP 開始教蠻特別的，但的確是不錯的打底。畢竟 LLM 是基於 NLP 發展出來的，很多概念跟術語出自 NLP。 > 不過講 word embeddings 竟然沒提一下 google T5 和 openai embeddings ? * Introduction Resources * NLP * https://online.stanford.edu/courses/xcs224n-natural-language-processing-deep-learning * https://huggingface.co/learn/nlp-course/chapter1/1 HuggingFace 出的 NLP 課程 * Language Modeling * https://en.wikipedia.org/wiki/Tf%E2%80%93idf * https://www.kaggle.com/code/vipulgandhi/bag-of-words-model-for-beginners * https://colah.github.io/posts/2015-08-Understanding-LSTMs/ * https://web.stanford.edu/~jurafsky/slp3/ * Word Embeddings * https://www.tensorflow.org/tutorials/text/word2vec * https://www.tensorflow.org/text/guide/word_embeddings * Tokenization * https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt * https://github.com/google/sentencepiece * https://ai.googleblog.com/2021/12/a-fast-wordpiece-tokenization-system.html > 你要等 14 天，才能在 Module 4 & 5 使用 Lab Environment (廠商是 [Vocareum](https://www.vocareum.com/) ) notebooks code 在 https://github.com/databricks-academy/large-language-models ## Module 1: Applications with LLMs ### Introduction * 本 Module 使用 Hugging Face 的 datasets, pipelines, tokenizers 跟 models * 學會在 Hugging Face Hub 找好用的 models * NLP Tasks 有 https://huggingface.co/learn/nlp-course/chapter7/1?fw=pt * NLP ecosystem * https://github.com/huggingface/transformers 做 Pre-trained DL models and pipeline * 經典 NLP 工具: NLTK, SpaCy, Gensim, Spark NLP * OpenAI, LangChain ### Hugging Face * Hub * Model: https://huggingface.co/models * Datasets: https://huggingface.co/datasets * Spaces https://huggingface.co/spaces * Hugging Face Pipeline 1. Prompt construction (optional for some models) 2. Tokenizer (encoding) 3. Model (LLM) 4. Tokenizer (decoding) * Datasets * https://huggingface.co/docs/datasets/index ### Model Selection * 以 Summarization https://huggingface.co/tasks/summarization 為例 * Hugging Face Hub 上面有上千個 models，怎麼選? * Filter by task, license, language, model size * Sort by popularity, updates * 是 generalist 或針對特定任務 fine-tuned 的 model * 可選擇 variants of models: 同一個 base model 不同 size 或 fine-tuned variants * 可先從小 model 開始測試，再換大的 * 同時也考慮文件好不好、 examples 跟 datasets * Table of LLMs: https://仔入.stanford.edu/ecosystem-graphs/index.html * 很多 models 是 family，或是共用 pre-training dataset * 有些是針對特定任務的 fined-tuned model * model size 大，不代表一定比較厲害 ![[Pasted image 20230701173008.png]] ### Common NLP Tasks * Summarization * Sentiment analysis * Translation * Zero-shot classification * Few-shot learning: 不需要針對特定任務 fine-tuning model，而是給 instruction 跟 few example * Conversation/chat * (Table) Question-answering * Text / token classification * Text generation > 看到目前還比較無趣，主要是從 NLP 角度切入介紹 LLMs ### Prompts * Foundation models 跟 Instruction-following models 不一樣 * 前者只會接龍，後者會根據你的指示 * prompts 是給 LLM 的 input or query 來引出 response ### Prompt Engineering * **Prompt engineering is model-specific!** * 不同的 model 需要不同的 prompts * 目前很多 guidelines 都是針對 ChatGPT (or OpenAI models) * 可能不適用於 non-ChatGPT models * 不同 user cases 可能需要不同的 prompts，所以 Iterative development 是關鍵 * 好的 prompt 包括: 1. Instruction, 2.Context/Example,3. Input/Question, 4. Output type/format * 用清楚的指令 * 用到特定的關鍵字: Classify, Translate, Summarize, Extract * 包括詳細的指示 * 用不同的 prompt 變形在不同的 samples 上，哪個 prompt 平均表現最好 * 如何表現更好? * 要求不要虛構 * Do not make things up if you do not know. Say 'I do not have that information' * 要求不要用敏感資訊 * Do not make assumptions based on nationalities * Do not ask the user to provide their SSNs * 要求用更多時間思考 (Chain of Thought for Reasoning) * Explain how you solve this math problem * Do this step-by-step. Step 1: Summarize into 100 words. Step 2: Translate from English to French... * Prompt 格式技巧 * 用分隔符號區隔 instruction 跟 context * Pound sign, Backticks, Braces, brackets, Dashes 等 * 避免 prompt injection, hacking prompt, Prompt leaking, Jailbreaking 等 * 其他處理 prompt hacking 的技巧 * 用另一個 LLM model 再過濾輸出 * 在 prompt 最後再重複一次指示 * 將 user input 用亂數字串包起來 * ihower: 但我覺得 [[Building Systems with the ChatGPT API]] 裡面那招用字串替換比較好 * 要求回傳格式: HTML, jsom, table, markdown...etc * 提供範例，例如 * Return the movie name mentioned in the form of a Python dictionary. The output should look like {'Title': 'In and Out'}" * 參考資源 (但有些是 OpenAI only) * https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api * https://www.promptingguide.ai/ * https://learn.deeplearning.ai/chatgpt-prompt-eng/lesson/1/introduction * https://learnprompting.org/docs/intro * https://github.com/brexhq/prompt-engineering * Tools * https://coefficient.io/ai-prompt-generator * https://replicate.com/kyrick/prompt-parrot ### Module 1 Resources * NLP tasks - [Hugging Face tasks page](https://huggingface.co/tasks) - [Hugging Face NLP course chapter 7: Main NLP Tasks](https://huggingface.co/course/chapter7/1?fw=pt) - Background reading on specific tasks - Summarization: [Hugging Face summarization task page](https://huggingface.co/tasks/summarization) and [course section](https://huggingface.co/learn/nlp-course/chapter7/5) - Sentiment Analysis: [Blog on “Getting Started with Sentiment Analysis using Python”](https://huggingface.co/blog/sentiment-analysis-python) - Translation: [Hugging Face translation task page](https://huggingface.co/docs/transformers/tasks/translation) and [course section](https://huggingface.co/learn/nlp-course/chapter7/4) - Zero-shot classification: [Hugging Face zero-shot classification task page](https://huggingface.co/tasks/zero-shot-classification) - Few-shot learning: [Blog on “Few-shot learning in practice: GPT-Neo and the 🤗 Accelerated Inference API”](https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api) - [Hugging Face Hub](https://huggingface.co/docs/hub/index) - [Models](https://huggingface.co/models) - [Datasets](https://huggingface.co/datasets) - [Spaces](https://huggingface.co/spaces) * Hugging Face libraries - [Transformers](https://huggingface.co/docs/transformers/index) - Blog post on inference configuration: [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate) - [Datasets](https://huggingface.co/docs/datasets) - [Evaluate](https://huggingface.co/docs/evaluate/index) * Models - Base model versions of models used in the demo notebook - [T5](https://huggingface.co/docs/transformers/model_doc/t5) - [BERT](https://huggingface.co/docs/transformers/model_doc/bert) - [Marian NMT framework](https://huggingface.co/docs/transformers/model_doc/marian) (with 1440 language translation models!) - [DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta) (Also see [DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)) - [GPT-Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo) (Also see [GPT-NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)) - [Table of LLMs](https://crfm.stanford.edu/ecosystem-graphs/index.html) * Prompt engineering - [Best practices for OpenAI-specific models](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api) - [DAIR.AI guide](https://www.promptingguide.ai/) - [ChatGPT Prompt Engineering Course](https://learn.deeplearning.ai/chatgpt-prompt-eng) by OpenAI and DeepLearning.AI - [🧠 Awesome ChatGPT Prompts](https://github.com/f/awesome-chatgpt-prompts) for fun examples with ChatGPT ### Lab 01: LLMs with Hugging Face https://github.com/databricks-academy/large-language-models * 用 Hugging Face Pipelines 實作 * Summarization，示範用 Google T5 model ，作業則是你自己隨便挑一個 * Sentiment analysis，示範用 Google BERT model * Translation，示範用 Helsinki-NLP/opus-mt-en-es，作業則是你自己隨便挑一個 * Lab 推薦 NLLB, the No Language Left Behind model https://huggingface.co/docs/transformers/model_doc/nllb * Zero-shot classification * Few-shot learning，用 gpt-neo-1.3B model，是這個 notebook 最 powerful 的 model * 作業則是你自己寫一個 prompt * 用這 API 沒幾行 code，在 Lab server 就下載了 5.31G 的 model，共花了8分鐘跑起來 gpt-neo model 挺厲害的 * 練習調 t5-small 的 tokenizer 和 model 參數看看 ## Module 2: Embeddings, Vector Databases, and Search ### Introduction * knowledge-based question answering (QA) * 在 databrick 是最常見的應用，很多企業都有自己的資料集和知識 * 了解使用 vector search 的策略跟如何評估、了解 vector database 跟 best practices 和如何改進效能 * Language models 如何學知識? * 透過 model training or fine-tuning * 適合專家任務 * 類比: 考試前認真讀書 * 透過 model inputs: 將知識插入 input，讓 LLM 一起回答 * 使用 vectors 來搜尋相關內容給 LLM * 類比: 開書考試 * 缺點: context token 長度限制 * 缺點: longer context 的 API costs 也比較高、更長的 processing 時間 * 除了文字可以轉 vectors，圖片、聲音都可以轉 vectors 來處理 * vector database 的用途 * 相似性搜尋: text, images, audio * 語意搜尋，而不是關鍵字比對。非常適合 knowledge-based QA * 推薦引擎 * https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/ * 找安全威脅 * vectorizing 病毒binaries，然後找相似 * 說明 Search and Retrieval-Augmented Generation 架構和流程 ### How does Vector Search work? * exact search * 用 brute foce method * KNN (k-neareest neighbors) * approximate nearest neighbors(ANN) * 較不準但速度快 * 是一種 indexing algorithm * 輸出是 vector index * Examples: • Tree-based:_ ANNOY by Spotify https://github.com/spotify/annoy • Proximity graphs: HNSW https://arxiv.org/abs/1603.09320 • Clustering: FAISS: by Facebook https://github.com/facebookresearch/faiss • Hashing: LSH https://en.wikipedia.org/wiki/Locality-sensitive_hashing • Vector compression: SCaNN: by Google https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html * 其中 FAISS 和 HNSW 是目前 vector stores 最常實作的算法 * 如何計算 vectors 相似性 * Distance metrics: L2 (Euclidean) * Similarity metrics: Cosine * 是最常見的兩種 * Product Quantization (PQ) * 用來壓縮 vectors 來節省記憶體使用 * FAISS 缺點: 不適合 sparse vectors * 至此，有能力搜尋 similar 東西，而不只是以前的 fuzzy text 或 exact match ### Filtering * vector database 如何實作 filtering function? * 每家實作不太一樣 * 主要有三種策略 * Post-query * user query 後，先 similarity search ，再過濾 top-k 結果 * 但有多少結果未知，也許過濾之後根本沒有滿足的條件 * In-query * 算法同時做 filter 跟 similarity search * 資料除了有 vectors 也要有 metadata 用來過濾資料 * 適合 row-based data * Pre-query * 效能較差，因為要先 burce-foce 過濾一次 > 找到 Pinecone 有支援 metadata filtering https://docs.pinecone.io/docs/metadata-filtering > 而 FAISS 就不支援，討論串: https://github.com/facebookresearch/faiss/issues/1079 ### Vector Stores * 廣義來說: vector stores 包括 vector databases, libraries, plugins 等 * Why? * Query time * Scalability * 用 library 或 plugins 就好? * 很多不支援 filter queries (WHERE) * 用 ANN library 就可以建立 vector indices * 但是沒有 CRUD 支援，因此若資料變動，整個 index 要重建 * in-memory * 用 plugins (指在現有的 DB 架構上增加 plugin) * elasticsearch * https://www.elastic.co/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0 ? * pgvector * 但功能通常比較少，例如少 metric chines 跟 ANN choices * API 較不友善 * 但這塊進展很快 * 用 vector database? * 優點 * Scalability: 將有百萬以上資料 * Spped * Full-fledged database properties * 需要 filtering 功能 * 當資料經常變動時 * 缺點: 學習成本、使用成本 ![[Pasted image 20230615141955.png]] ### Best Practices * Vector stores extend LLMs with knowledge * 回傳相關文件，成為 LLM context * context 可以降低 hallucination 現象 * 不需要 context augmentation(增加) 的 use cases * Summarization, Text classification, Translation * 如何改進 retrieval performance * 選 Embedding model * 你的 embedding model 是由跟你的資料類似的訓練資料訓練的嗎? * The embedding model 應該要能同時代表你的 queries 和 documents * 確保 embedding space 是一樣的 for both queries and documents * query 和文件用一樣的 embedding model * 並要確保 vector database 有你想要的相關資料 * 例如 query 是電影，但是 vector stores 裡面是醫學，這樣結果會很差 * DocumentStorage 策略 * 這一塊相對新，還沒有非常好的定義 * 要拆 chunks 嗎? * 拆出來的 chunk 跟 prompt 有多相關? * 若結果要送給 LLM，就需要配合 LLM 的 token limit * 這非常 use-case specific * 需要 iterative step 去實驗 * 文件有多長? * 如果 1 chunk == 1 sentence，embedding 會代表特定意義 * 如果 1 chunk == multiple paragraphs，embedding 會代表更廣的主題 * 例如根據 headers 拆 * 用戶行為又如何? 他們的query會打多長? * 長 query，比較容易找到相關的 chunks * 短 query，則需要搭配短 chunk 比較好 * Resources * [ ] https://python.langchain.com/en/latest/modules/indexes/text_splitters.html * [ ] https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/ * [ ] https://www.pinecone.io/learn/chunking-strategies/ * 如何知道你做失敗了? * user 端可以加上明確指示，例如: * Tell me the top 3 hikes in California. If you do not know the answer, do not make it up. Say 'I don’t have information for that. * 工程端，可以: * failover logic * 當 distance-x 超過某個 threshold y 時，顯示罐頭訊息，而不是讓用戶看到沒東西 * 加一個 toxicity classification model 在前面，避免用戶送不良訊息被存到 VDB 裡面 (造成之後被搜索出來) * vector DB 要有 timeout ### Module 2 Resources * Research papers on increasing context length limitation - [Pope et al 2022](https://arxiv.org/abs/2211.05102) - [Fu et al 2023](https://arxiv.org/abs/2212.14052) - Industry examples on using vector databases - FarFetch - [FarFetch: Powering AI With Vector Databases: A Benchmark - Part I](https://www.farfetchtechblog.com/en/blog/post/powering-ai-with-vector-databases-a-benchmark-part-i/) - [FarFetch: Powering AI with Vector Databases: A Benchmark - Part 2](https://www.farfetchtechblog.com/en/blog/post/powering-ai-with-vector-databases-a-benchmark-part-ii/) - [FarFetch: Multimodal Search and Browsing in the FARFETCH Product Catalogue - A primer for conversational search](https://www.farfetchtechblog.com/en/blog/post/multimodal-search-and-browsing-in-the-farfetch-product-catalogue-a-primer-for-conversational-search/) - [Spotify: Introducing Natural Language Search for Podcast Episodes](https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/) - [Vector Database Use Cases compiled by Qdrant](https://qdrant.tech/use-cases/) - Vector indexing strategies - Hierarchical Navigable Small Worlds (HNSW) [Malkov and Yashunin 2018](https://arxiv.org/abs/1603.09320) - Facebook AI Similarity Search (FAISS) [Meta AI Blog](https://ai.facebook.com/tools/faiss/) - Product quantization [PQ for Similarity Search by Peggy Chang](https://towardsdatascience.com/product-quantization-for-similarity-search-2f1f67c5fddd) * Cosine similarity and L2 Euclidean distance - [Cosine and L2 are functionally the same when applied on normalized embeddings](https://stats.stackexchange.com/questions/146221/is-cosine-similarity-identical-to-l2-normalized-euclidean-distance) * Filtering methods - [Filtering: The Missing WHERE Clause in Vector Search by Pinecone](https://www.pinecone.io/learn/vector-search-filtering/) * Chunking strategies - [Chunking Strategies for LLM applications by Pinecone](https://www.pinecone.io/learn/chunking-strategies/) - [Semantic Search with Multi-Vector Indexing by Vespa](https://blog.vespa.ai/semantic-search-with-multi-vector-indexing/) * Other general reading - [Vector Library vs Vector Database by Weaviate](https://weaviate.io/blog/vector-library-vs-vector-database) - [Not All Vector Databases Are Made Equal by Dmitry Kan](https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696) - [Open Source Vector Database Comparison by Zilliz](https://zilliz.com/comparison) - [Do you actually need a vector database? by Ethan Rosenthal](https://www.ethanrosenthal.com/2023/04/10/nn-vs-ann/) ### Lab 02 * 實作 Embeddings, Vector Databases 跟 Search，使用 * 資料集來自 https://newscatcherapi.com/ * Encoder 用 SentenceTransformers * FAISS 和 ChromaDB * Hugging Face model (可用 gpt2) 或 OpenAI 來做 QA * Pinecone 或 Weaviate (optional) ## Module 3: Multi-stage Reasoning * 多層推理，使用 prompt template, CoT, chaining prompts 等 * LLM pipelines * LLM Tasks v.s. LLM-based Workflows * Task 是 Sine Interaction with an LLM * Workflow 是 Applications with more than single interaction * Templating: Summarization template * LLM Chains * LangChain: 用來做 multi-stage reasoning, LLM-based workflows * Multi-stage LLM Chains * 示範 sequential flow: 先 summary 再做 sentiment * LLMMath chain * LLM Agents: Building reasoning loops * ReasonAction loop (ReACT): 讓 LLM 選 tools * LLM Plugins 陸續發展中，例如 * LangChain * Transformers Agents * ChatGPT plugins * Google PaLM 2 with workspace * AutoGPT: self-directing agents ![[Pasted image 20230701172919.png]] > 這 Module 應該要是講 CoT 和 Chaining Prompts 內容，但講得好少啊，比 [[Building Systems with the ChatGPT API]] 還少，頗失望 :( ### Module Resources * LLM Chains - [LangChain](https://docs.langchain.com/) - [OpenAI ChatGPT Plugins](https://platform.openai.com/docs/plugins/introduction) * LLM Agents - [Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents) - [AutoGPT](https://github.com/Significant-Gravitas/Auto-GPT) - [Baby AGI](https://github.com/yoheinakajima/babyagi) - [Dust.tt](https://dust.tt/) * Multi-stage Reasoning in LLMs - [CoT Paradigms](https://matt-rickard.com/chain-of-thought-in-llms) - [ReAct Paper](https://react-lm.github.io/) - [Demonstrate-Search-Predict Framework](https://github.com/stanfordnlp/dsp) ### Lab 03 * 用 LangChain 實作 * JekyllHyde: 一個 AI self moderating system for social media * 用到 SequentialChain 把兩個 Chain 串起來 * HuggingFace API (用 OpenAI 也行) * DaScie: 一個 vector DB data science AI agent * 用 ReAct，使用 vector DB，根據指示做 data science 任務 * langchin tools 用到 wikipedia, serpapi, python_repl, terminal * 使用 langchain 的 create_pandas_dataframe_agent 方法，載入 kaggle 現成的一個 dataset 來做 pandas 分析 * 示範做 random forest 機器學習 * 用 ChromaDB 做一個 Question-Answer(QA) LLMChain * https://huggingface.co/inference-api * embedding model 用 HuggingFace sentence-transformers/all-MiniLM-L6-v2 * 透過 langchain 的 HuggingFaceEmbeddings * LLM model 用 HuggingFace google/flan-t5-large * 透過 langchain 的 HuggingFacePipeline * 用 https://www.gutenberg.org 資料集(langchain 有 GutenbergLoader) ## Module 4: Fine-tuning and Evaluating LLMs * 如何改進 model quality: few-shot learning 和 fine-tuning * LLM fine-tuned versions * base model * chat model * instruct model ### Applying Foundation LLMs * LLM pipeline 中的 LLM 選擇比較 * few-shot learning with open-sourced LLM * open-source instruction-following LLM * paid LLM-as-a-Service * Build your own ### Fine-Tuning: Few-shot learning * 優點: 開發快、效能快、成本低 * 缺點 * 需要 example 資料 (指 prompt 中的 example) * size-effect 若需要大model來跑，硬體需求較高 ### Fine-Tuning: Instruction-following LLMs * 優點: * 可以做 zero-shot learning，不需要 examples * 效能應該夠快、成本低 * 缺點: * model 如果 fine-tuned 在類似的任務，品質可能會不好 * 一樣有size-effect ### Fine-Tuning: LLMs-as-a-Service * 優點 * 開發速度快 * 不怕用大 model，效能在 server 端處理好了 * 缺點 * 成本 * Data Privacy/Security * Vendor lock-in ### Fine-tuning: DIY * 從 base model 開始做是不太可能的，需要的資源巨大 * 從已經有的 base model 進行 fine-tune * 優點 * 可以建立針對你的任務的 task-specific model * Inference cost 可以做出較低 * 更好的 control * 缺點 * 時間和計算成本 * 需要 large dataset * 需要 Skill sets 專業才能做 * 舉例 * Self-instruct (Alpaca and Dolly v1) * 用另一個 LLM 才產生 dataset * High-quality fine-tune (Dolly v2) * An instruction-following LLM * base model 是 EleutherAI 的 Pythia 12B 加上 databricks-dolly-15k 的 dataset 微調出來的 * Open Source 且可以商用 ### Evaluating LLMs * Perplexity: 好的 LM 要有好 high accuracy, low perplexity * accuracy = next word is right or wrong * perplexity = how confident was that choice * 不同的 NLP task 有不同的 metrics ### Task-specific Evaluations * Translation 用 BLEU * BiLingual Evaluation Understudy * Summarization 用 ROUGE * 用 datasets 來做 Benchmarks: SQuAD * Stanford Question Answering Dataset- reading comprehension * ChatGPT 有用的 metrics * Target application * NLP tasks * Queries chosen to match the API distribution (?) * 人類偏好 * Alignment: Helpful, Honest, Harmless ### Guest Lecture from Harrison Chase (Creator of LangChain) 講 Evaluation of LLM Chains and Agents * Why hard? * 缺少 data * 缺少 metrics * 可能解法 * 缺少 data * 用 LLM 產生資料 * 在 production 上慢慢累積 * 缺少 metrics * 視覺化過程協助觀察 * 用 LLM 進行評估 * 用戶回饋 * Offline evaluation * 先建立 dataset、執行、視覺化觀察、用 LLM 自動評分 * Online evaluation * 每次 datapoint 進來 1. 用戶直接回饋(thumb up/down) 2. 間接回饋(有沒有click on link) 3. 持續追蹤回饋 ### Module Resource * Fine-tuned models - [HF leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) - [MPT-7B](https://www.mosaicml.com/blog/mpt-7b) - [Stanford Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) - [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) - [DeepSpeed on Databricks](https://www.databricks.com/blog/2023/03/20/fine-tuning-large-language-models-hugging-face-and-deepspeed.html) * Databricks’ Dolly - [Dolly v1 blog](https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html) - [Dolly v2 blog](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm) - [Dolly on Hugging Face](https://huggingface.co/databricks/dolly-v2-12b) - [Build your own Dolly](https://www.databricks.com/resources/webinar/build-your-own-large-language-model-dolly) * Evaluation and Alignment in LLMs - [HONEST](https://huggingface.co/spaces/evaluate-measurement/honest) - [LangChain Evaluate](https://docs.langchain.com/docs/use-cases/evaluation) - [OpenAI’s post on InstructGPT and Alignment](https://openai.com/research/instruction-following) - [Anthropic AI Alignment Papers](https://www.anthropic.com/index?subjects=alignment) ### Lab 04 * 實際用 base model 來進行 fine-tuning training * base model 範例用 T5 small model，作業用 [pythia-70m-deduped](https://huggingface.co/EleutherAI/pythia-70m-deduped) * dataset 用 * 範例用 imdb movie reviews 是 positive, negative, neutral * 作業用 databricks/databricks-dolly-15k * 用到 Nvidia CUDA * TrainingArguments from HuggingFace * TensorBoard 視覺化過程 * T5 small 範例跑 1 epoch 花費約6分鐘在 g5.2xlarge cluster 機器上 * pythia 作業跑 10 epoch 花費大約 30分鐘 * 用 Microsoft DeepSpeed https://github.com/microsoft/DeepSpeed * 在 multi-GPU 上進行加速 * 用 ROUGE 進行摘要的評估 * https://en.wikipedia.org/wiki/ROUGE_(metric) * 用來評估 summarization: 比較 reference summaries 和 generated summaries * dataset 使用 huggingface dataset 的 cnn_dailymail * 評估比較 t5-base 和 t5-small 和 gpt-2 * t5-base 的 ROUGE 分數如預期較 t5-small 好 * 分數是 t5-base > t5-small > gpt-2 > 作業還蠻有趣的，基本上就是在做一個精簡版的 Dolly 模型 ## Module 5: Society and LLMs ### Risks and Limitations * 危險和限制 * 訓練資料有 biases 偏見、有錯誤、不夠diversity，特別是來自 web 的資料 * 濫用 * hallucination 幻覺胡說八道 * 對社會的影響: 創意產業、就業、環境 ### Hallucination * 可分兩種 * Intrinsic: 與事實相反 * extrinsic: 無法驗證，有可能是錯的 * 原因 * data 就有問題 * model 問題 * 如何 evaluation ### Mitigation Strategies * 更好的資料，建構 faithful dataset * 繼續研究更好的 model * Three-layered audit: Governance audit, Model audit, Application audit ### Module Resources * Social Risks and Benefits of LLMs - [Weidinger et al 2021 (DeepMind)](https://arxiv.org/pdf/2112.04359.pdf) - [Bender et al 2021](https://dl.acm.org/doi/10.1145/3442188.3445922) - [Mokander et al 2023](https://link.springer.com/article/10.1007/s43681-023-00289-2) - [Rillig et al 2023](https://pubs.acs.org/doi/pdf/10.1021/acs.est.3c01106) - [Pan et al 2023](https://arxiv.org/pdf/2305.13661.pdf) * Hallucination - Paper: Survey of Hallucination in Natural Language Generation [Ji et al 2022](https://arxiv.org/pdf/2202.03629.pdf) * Bias evaluation metrics and tools - [NeMo Guardrails](https://github.com/NVIDIA/NeMo-Guardrails) - [Guardrails.ai](https://shreyar.github.io/guardrails/) - [Liang et al 2022](https://arxiv.org/pdf/2211.09110.pdf) * Other general reading - [All the Hard Stuff Nobody Talks About when Building Products with LLMs by Honeycomb](https://www.honeycomb.io/blog/hard-stuff-nobody-talks-about-llm) - [Science in the age of large language models by Nature Reviews Physics](https://www.nature.com/articles/s42254-023-00581-4) - [Language models might be able to self-correct biases—if you ask them by MIT Technology Review](https://www.technologyreview.com/2023/03/20/1070067/language-models-may-be-able-to-self-correct-biases-if-you-ask-them-to/) ### Lab 05 * 使用 HuggingFace 的 Disaggregators https://github.com/huggingface/disaggregators * 評估 bias 資料 * dataset: https://huggingface.co/datasets/wiki_bio * 使用 HuggingFace 的 https://github.com/huggingface/evaluate * 評估 Toxicity https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target * 評估HONEST https://huggingface.co/spaces/evaluate-measurement/honest * 評估 Regard * dataset: https://huggingface.co/datasets/AlexaAI/bold * https://huggingface.co/spaces/evaluate-measurement/regard * 使用 SHAP (SHapley Additive exPlanations) https://github.com/slundberg/shap * 用來解釋 model 輸出 * 視覺化解釋 output token，是由哪一個 input token 所貢獻的權重 * 另一個解釋方式 https://github.com/kayoyin/interpret-lm * 解釋為何 model 預測這個字，不選另一個字 * https://nlptest.org/ 用來評估 model ## Module 6: LLMOps ### Traditional MLOps * MLOps = DevOps + DataOps + ModelOps * 管理機器學習 code, data 跟 model 的流程和自動化 * 詳細可以看 ebook: https://www.databricks.com/resources/ebook/the-big-book-of-mlops ![[Pasted image 20230702184134.png]] ### LLMOps * 把 LLM 放進來，會改變 MLOps 什麼? * Model training 部分，因為不會訓練 base model 了，會轉變成 * 較輕量的 mdole fine-tuning * pipeline tuning * 變成 prompt engineering * user feedback 會變更重要 from dev to prod * 自動化測試變非常困難，因為需要 human evaluation * 不像傳統 ML 可以先做 batch datasets offline 做測試 * LLM 的上線會用 incremental rollouts: 先給一部分用戶使用，有信心後再開權限給更多用戶 * production 工具改變 * 更需要 GPU * 需要 vector database * 成本跟 performance * 情況變複雜，特別是使用 3rd-party LLM APIs ### LLMOps Details * Prompt engineering 在 Ops 的需求 * Track: 追蹤紀錄 query 和 response, 進行開發迭代 * 工具例如 MLflow https://mlflow.org/ * Template: 標準化的 prompt formats * 工具例如 LangChain, LlamaIndex * https://python.langchain.com/en/latest/index.html * https://gpt-index.readthedocs.io/en/stable/ * Automate: 自動化的 tuning * 工具例如 DSP https://github.com/stanfordnlp/dsp * Packaging models or pipelines for deployment * 工具 mlflow * 標準化部署 models 和 pipelines * Scaling out * 大data和model需要做分散式計算 * 相關工具有 Distributed Tensorflow, Distributed Pytorch, DeepSpeed, Apache Spark, Ray, Delta Lake 等 * Managing cost/performance tradeoffs * 可以優化的 metrics * queries 和 training 的成本 * 開發成本 * LLM product 的 ROI * Accuracy/metrics of model * Query latency * Tips * 先從 existing models 做 prompt enginnering，有資料後再考慮做 fine-tuning * 盡快得到 human feedback * Human feedback, testing, and monitoring * 很重要，一開始就應該做進 application * feedback 需要紀錄起來做分析和 tuning * Deploying models 和 deploying code * code 會用 version control、model 會用 model registry，都會區分 dev, staging , prod * Deploying code 是指 staging 跟 prod 只部署 code，model 是在 staging, prod 上產生的 * prompt engineering 和 pipeline tuning 可用 models 模式，因為 mlflow 可以方便 wrap 成一個 model 進 registry * 可用 Service infrastructure: 例如 vector databases 跟 web service 是分開部署的 ### Module Resources * General MLOps - [“The Big Book of MLOps”](https://www.databricks.com/resources/ebook/the-big-book-of-mlops) (eBook overviewing MLOps) - Blog post (short) version: [“Architecting MLOps on the Lakehouse”](https://www.databricks.com/blog/2022/06/22/architecting-mlops-on-the-lakehouse.html) - MLOps in the context of Databricks documentation ([AWS](https://docs.databricks.com/machine-learning/mlops/mlops-workflow.html), [Azure](https://learn.microsoft.com/en-us/azure/databricks/machine-learning/mlops/mlops-workflow), [GCP](https://docs.gcp.databricks.com/machine-learning/mlops/mlops-workflow.html)) * LLMOps - Blog post: Chip Huyen on “[Building LLM applications for production](https://huyenchip.com/2023/04/11/llm-engineering.html)” * [MLflow](https://mlflow.org/) - [Documentation](https://mlflow.org/docs/latest/index.html) - [Quickstart](https://mlflow.org/docs/latest/quickstart.html) - [Tutorials and examples](https://mlflow.org/docs/latest/tutorials-and-examples/index.html) - Overview in Databricks ([AWS](https://docs.databricks.com/mlflow/index.html), [Azure](https://learn.microsoft.com/en-us/azure/databricks/mlflow/), [GCP](https://docs.gcp.databricks.com/mlflow/index.html)) * [Apache Spark](https://spark.apache.org/) - [Documentation](https://spark.apache.org/docs/latest/index.html) - [Quickstart](https://spark.apache.org/docs/latest/quick-start.html) - Overview in Databricks ([AWS](https://docs.databricks.com/spark/index.html), [Azure](https://learn.microsoft.com/en-us/azure/databricks/spark/), [GCP](https://docs.gcp.databricks.com/spark/index.html)) * [Delta Lake](https://delta.io/) - [Documentation](https://docs.delta.io/latest/index.html) - Overview in Databricks ([AWS](https://docs.databricks.com/delta/index.html), [Azure](https://learn.microsoft.com/en-us/azure/databricks/delta/), [GCP](https://docs.gcp.databricks.com/delta/index.html)) - [Lakehouse Architecture (CIDR paper)](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) ### Lab 06 使用 MLflow 示範建立 LLM pipeline * 可看作 data augmentation pipeline * 使用 MLflow library * Track LLM development * 使用 MLflow tracking server * 使用 MLflow registry * 測試 LLM pipeline * 上 staging stage * 上 production stage * production workflow for batch or streaming inference, or serving endpoint