Full Stack LLM Bootcamp - ihower's Notes

> 歡迎訂閱我的 [AI Engineer 電子報](https://aihao.eo.page/6tcs9) 和瀏覽 [[Generative AI Engineer 知識庫]] LLM Bootcamp - Spring 2023: https://fullstackdeeplearning.com/llm-bootcamp/ 另外也推薦這篇網友的心得: https://zhuanlan.zhihu.com/p/629589593 和 https://zhuanlan.zhihu.com/p/633033220 至今(2023/6/22)看到內容最廣又深入的 LLM 技術系列影片。實體課程錄影共 11 隻影片，每隻影片約40~60分鐘，都有釋出投影片超佛心。這課難度比較高，資訊量很多，參考論文更多。課程時間是 4/20~22，錄影是 2023/5月釋出的。 ## [Launch an LLM App in One Hour](https://fullstackdeeplearning.com/llm-bootcamp/spring-2023/launch-an-llm-app-in-one-hour/) * Why now? * One tool can do it all: a large language model * To avoid AI winter, we need to build products that prople value. * Prototyping & Iteration in a Playground * 用 ChatGPT 示範 * Prototyping & Iteration in a Notebook * 用 google colab 示範 * 推薦用 cloud tooling * 先用一個簡單的 UI 可以快速得到 user feedback * Deploying an MVP ## [LLM Foundations](https://fullstackdeeplearning.com/llm-bootcamp/spring-2023/llm-foundations/) ### 機器學習基礎 * Unsupervised Learning, Supervised Learning, Reinforcement Learning, Deep Learning 等 * hugging face 上有很多人分享 pre-trained models, dataset ### Transformer 架構 * 2020以前的深度學習，不同用途有自己的 NN 架構，例如 CNN, RNN * 現在主流則都用 Transformer model 架構 * paper: Attention is all you need (2017) https://arxiv.org/abs/1706.03762 * 學習資源 * https://docs.google.com/presentation/d/1ZXFIhYczos679r70Yu8vV9uO6B1J0ztzeDxbnBxD1S0/edit#slide=id.g13dd67c5ab8_0_3938 * https://peterbloem.nl/blog/transformers * https://blog.nelhage.com/post/transformers-for-software-engineers/ * The Transformer Family Version 2.0 https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/ * Transformer models: an introduction and catalog https://arxiv.org/abs/2302.07730 * [ ] [In-context Learning and Induction Heads](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html) * [ ] [Neural Networks: Zero to Hero](https://karpathy.ai/zero-to-hero.html) * 大神 Andrej Karpathy 竟然有空錄課程!! * Let's build GPT: from scratch, in code, spelled out. https://www.youtube.com/watch?v=kCc8FmEb1nY * GPT-2 < 400 lines of code! ### 著名 LLMs 介紹 * BERT (2019) * T5 (2020) * GPT / GPT-2 (2019) * 開始用了 Byte Pair Encoding (也就是目前 LLM 所謂的 token) * GPT-3 (2020) * GPT-4 (2023): 架構沒公開 * Chinchilla (2022): 參數少但訓練資料多，表現比 Gopher 好。因為 most LLMs 其實 undertrained * LLaMA (2023) * "Chinchilla-optimal" * 訓練資料包括 Github 這點很有趣 * T5 跟 GPT-3 是特別排除 code 的，但最近新出的 models 大約有 5% 是 trained by code * OpenAI 在 Codex model (2021) (第一個用code訓練的模型) 發現的!! * codex model 是 GPT-3 用 code fine-tuned 出來的，他們發現這樣調出來的結果比 GPT-3 在推理效能上更好 * 實踐中發現在非 code 任務上，這可以增加效能!! * Instruction Tuning * GPT-3 (2020) 時代: 大家 mindset 用 few-shot，例如 text completion * ChatGPT(2022) 時代，大家 mindset 用 zero-shot 了，例如 instruction-following * 背後的原理是用了 Supervised Fine-tuning * 為了提升 zero-shot 效能，用少量高品質的 instructions-completions 資料集來 fine-tune 過 * InstructGPT/GPT-3.5 * 用人去 rank GPT-3 的輸出，用 RL 方式去 fine-tune。在 following instruction 上比 base GPT 好很多 * 釋出 davinci-002 model in OpenAI API * ChatGPT * 更進一步 RLHF 在 conversations 上 * 使用 ChatML format ![[Pasted image 20230616041826.png]] - [ ] [Yao Fu's How does GPT Obtain its Ability?](https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1) * "Alignment Tax" 現象 * Instruction-tuning 增加了模型 zero-shot 能力，但是這有代價 * 模型對答案的信賴度校准 (calibration) 降低 * base model 知道他知道什麼用他知道的方式來完成，但是 fine-tuned 之後他就有點搞混他知道什麼了 * few-shot 能力很可能降低 ![[Pasted image 20230616042629.png]] * Stanford Alpaca model * 基於 LLaMA，用 text-davinci-003 那邊"偷" 了 52K instruction-following 範例來 fine-tuned * 沒簽約，直接 call API 花了600美金(就偷了人家的 RLHF ?)，得到不錯的結果.... XD 不過效果還是沒有 GPT-3 好啦 * OpenAssistant Conversations Dataset (2023/4) for instruction turning，特別是 chat * _https://huggingface.co/datasets/OpenAssistant/oasst1_ * Retrieval-enhanced Transformer (2021) * from DeepMind * 目前效果還沒有 LLMs 好，但講者覺得是未來的方向 * Training & Inference * 投影片有，錄影沒有，感覺講者超時了所以沒講 ## [Learn to Spell: Prompt Engineering](https://fullstackdeeplearning.com/llm-bootcamp/spring-2023/prompt-engineering/) Prompt engineering is the art of designing that text goes into an LM. ### Prompts are Magic Spells(咒語) 整段有點玄的在解釋 prompts 是什麼 * paper: [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155) * paper: [The Capacity for Moral Self-Correction in Large Language Models](https://arxiv.org/abs/2302.07459) * paper: [Language Models as Agent Models](https://arxiv.org/abs/2212.01681) * 雖然 LMs 是文字的統計模型，但這帶給你不好的直覺 * Probabilistic programs” 會給你更好的直覺 [Language Model Cascades](https://arxiv.org/abs/2207.10342) * 結論 * Pretrained models 是多元宇宙文件產生器 * Instruction models 只要你 wish 精準問即可 * paper: [Reframing Instructional Prompts to GPTk's Language](https://arxiv.org/abs/2109.07830) * 不要用需要專業知識的術語 * 用 bullet point 形式更好 * 想像你對著數據打標人員描述你所需要完成的任務 * all models 都是 agent simulators，但品質差異很大 ### Prompting Techniques * 很多 prompt engineering paper 不需要 8 pages，因為其秘訣就一兩句，大部分內容是 benchmarking * Few-shot learning 不是 great model * GPT-3 那時代還叫做 "few-shot learners"，但是發現會在 pre-training 知識中掙扎不出來 * [ ] [Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm](https://arxiv.org/abs/2102.07350) * [ ] [Larger language models do in-context learning differently](https://arxiv.org/abs/2303.03846) * tokenization 問題 * models 看不到 characters, 而是看 token * 因此 GPT-3 簡單的 reverse words 反而不會，除非是明確每個英文字都用分隔號拆開 * GPT-4 有部分改進這個問題 * Structured text * LLM 喜歡 formatted text，容易 predict * 因此用比較結構的方式寫, 例如 pseudo code formatted，效果好 * Decomposition and reasoning * 拆解子任務 * [ ] [Decomposed Prompting: A Modular Approach for Solving Complex Tasks](https://arxiv.org/abs/2210.02406) * Self-Ask * [Measuring and Narrowing the Compositionality Gap in Language Models](https://arxiv.org/abs/2210.03350) * 讓模型問 following-up question 來改進答案 * 就像是自動化的 decomposition * CoT [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903) * Let's think step by step [LLMs are Zero-Shot Reasoners](https://arxiv.org/abs/2205.11916) * Self-criticism “just asks” the model to fix its answer 讓模型再次改進答案 * [LMs can Solve Computer Tasks](https://arxiv.org/abs/2303.17491) * Ensembling: 群眾智慧 * 一次產生多個結果再選 * [ ] [Self-Consistency Improves Chain of Thought Reasoning in Language Models](_https://arxiv.org/abs/2203.11171) * 可以合併使用這些 tricks 得到最好的結果 * [Challenging BIG-Bench Tasks and Whether CoT Can Solve Them](https://arxiv.org/abs/2210.09261) * 但考慮成本問題 ![[Pasted image 20230617020613.png]] 這裡 Ensembling 對 latency 沒影響，是因為你可以 parallel 同時 API request 去問 * LMs 有心智嗎? * 這段投影片有，演講沒有，大概也是超時了 ## [Augmented Language Models](https://fullstackdeeplearning.com/llm-bootcamp/spring-2023/augmented-language-models/) * why? * LLM 擅長了解語言、遵守指示、推理 * 但不擅長: 最新的知識、你的特定知識、更多有挑戰的推理、和世界互動 * 給 context 是一種給 LLM 最新知識的方式，但只能塞得下有限的資訊 * context window 有限 * GPT-3 是 2k * GPT-3.5 到 4k * GPT-4 到 8k, 32k ![[Pasted image 20230617235413.png]] * 雖然進步很快，但總是塞不下所有資訊。更何況越多 context 花越多錢 * 這堂課就是在教如何塞 limited context 來增強 LLM * Augmented language models 包括以下，這是很大的題目 * Retrieval: 用 bigger corpus 更大的文集 * Chains: 用 more LLM calls * Tools: 用 outside sources ### Retrieval augmentation * Why? * Context-building 就是 information retrieval，其實就是一種 search problem * Traditional information retrieval * Query, e.g search string * Object, e.g document * Relevance: 評估有多相關 * Ranking: 排序 * 傳統 search 用 inverted indexes * Relevance 用 boolean search，e.g AND 條件 * Ranking 用 BM25，三種元素 TF, IDF, Field length * https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up * 缺點: 只用 word frequencies, 沒有捕捉到語意資訊、相依資訊等 * AI-powered Information retrieval via embeddings * Search 和 AI 成就彼此: AI 給了 Search 更好的 representations of data (embedddings)，因此搜出更好的 context * embeddings 是 dense, fixed-size 代表資料的 vector，可以代表 any type 的 data * 好的 embedding: * 給下游任務的工具: 適合你的任務的 * 若你無法 benchmark 你的任務，則可以參考通用的排名 https://huggingface.co/spaces/mteb/leaderboard * 相似的東西其 vectors 就應該接近 * Embedding * OG 元老: Word2Vec * baseline: Sentence transformers * Multimodel option: CLIP * The one to use: OpenAI (Good, fast and cheap!) * 用 text-embedding-ada-002 * 最先進的: Instructor * [ ] paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741) * 把任務描述塞進要被 embedding 的文本，得到 task-specific embeddings * 就像是 instruction tuning for embedding * Embedding relevance and indexes * 將 Query 做 embedding，然後找最接近的 embedding index * Similarity metrics 有 * Cosine similarity, Dot product, Euclidean, Hamming 等 * OpenAI 建議 Cosine，而且說這些選擇不是太重要。 * OpenAI 的 embedding 用 cosine 跟 euclidean 的 rankings 結果一樣。dot product 效能較差。 * https://platform.openai.com/docs/guides/embeddings/limitations-risks * 如果你的 vectors < 100k，那可以自己用 numpy 跑相似性 NN 算法。若超過就需要 ANN 算法了 * 各種 ANN 算法: https://www.pinecone.io/learn/vector-indexes/ * ANN index tools 有 FAISS, Hnswlib, nmbslib, Annoy 等，推薦 FAISS, HNSW * 但更重要的是選 IR system/database，倒不是選 AN 算法 * 除了算法，DB 工作還包括 hosting, store data, 處理 sparse + dense retrieval, CRUD, scaling * Embedding database * Elasticsearch, postgres, redis 也可以跑 NN/ANN * 但這些不支援 complicated queries 或 highest scale * 可能已經對你足夠了，不需要專門的 embedding databases * Document splitting * 當文件太長超過 embedding model 長度限制時 * 用 separator(\n) 拆開 chunks * 進階就是讓每個 chunks 更語意 meaningful 一致 * 參考 langchain text splitters 文件 * Query language * 找最相似文件聽起來很簡單 * 但你會有其他需求 * filter 其他 metadata (receny, etc?) * 當用戶不是給你對等文件找相似時，例如只給 search string 或是客戶只說想要 summary 時。你要如何決定用戶的 query 就很難 * Search algorithm * 如何設計 index 的階層架構 * 例如當文件很長時，會拆很多份 chunks 做索引 * 這些 chunks 可能很像，導致你用 K-NNs 搜 chunks 而不是搜文件時，搜出來的東西都是來自同一份文件的 chunks ??? 這是好是壞? * 這取決於根據你的任務，如何設計就很重要了 * 現成的 embedding databases 介紹 * Embedding mgmt 指的是: 誰負責 calling embedding function，是你還是DB負責處理 * 推薦 * Pinecone 的 Setup 跑起來最快 * more flexible queries: Vespa or Weaviate * more scale/reliability: Vespa or Milvus ![[Pasted image 20230616012233.png]] * Beyond naive nearest neighbor * 問題: 當你的 queries 和 docs 不一樣時，例如 * queries 是短問題 * 文件是長文件 * embeddings 就不是那麼 comparable: 你的 data 跟 pre-trained bedding 預期的完全不同 * [ ] 有些解法: https://blog.reachsumit.com/posts/2023/03/llm-for-text-ranking/ * Joint training * HyDE * Re-ranking * 問題: 當你的資料已經有結構化時 * 參考 Llamaindex 合併兩種方式 * Copilot case study * https://thakkarparth007.github.io/copilot-explorer/posts/copilot-internals * Retrieval augmentation question answering pattern (QA) * 目前非常常見的 pattern * 限制: 需要 context 必須是 top few embedded chunks，因為 token 限制 * 解法一: 繼續改進 search 品質，用更 advanced 的方式找出最相關的 chunks * 解法二: 可把 context 拆開多次 loop 呼叫，例如不只抓 top 3，而是抓出 top 30 份，一次只能塞 3 份，那就拆開 call 10 次。每次除了 contexts 還加上上次呼叫的答案 (變成一種chain)給LLM，Prompt 長這樣: Answer the following question based on the document provided. If the answer is not in the provided document, say “I don’t know” Document: {{doc}} Question: {{query}}” 這招就像 brute-force 暴力法，但品質可以拉上來。缺點是又慢又貴。 ### Chains * Chains 幫助你設計更複雜的推理，和對應處理 token 上限 * Building chains of LLM calls * 有時後最好的 context 不來自你的文件 * 而是來自另一個 LLM 的輸出 * Example patterns * The QA pattern * Question ➡ embedding ➡ similar docs ➡ QA prompt * Hypothetical document embeddings (HyDE) * Question ➡ document generating prompt ➡ rest of QA chain * Summarization * Document corpus ➡ apply a summarization prompt to each ➡ pass all document summaries to another prompt ➡ get global summary back 80 * 就是拆解長文本做摘要 * 工具可用 LangChain，也有很多人自幹 chains * 自幹不難，難的是想出適合你問題的 chains * langchain 做好很多現成的 chains 可以用，拿來做 prototyping 很方便 * 但要上 production 時，自幹也不是什麼難事 ### Tools * 用 tool 可讓 LLM 去 access 外部世界 * paper: [Toolformer: Language Models Can Teach Themselves to Use Tools](https://arxiv.org/abs/2302.04761) * 訓練 LLM 學會呼叫 API * 但只有 few tools，還需要產生 dataset * LangChain agents * querying sql chain * ChatGPT plugins ![[Pasted image 20230618014614.png]] ### 其他參考資料 * [ ] Augmented Language Models: a Survey (Mialon et al, 2023): https://arxiv.org/abs/2302.07842 * [ ] A great course about information retrieval: https://github.com/sebastian-hofstaetter/teaching * Langchain documentation: https://python.langchain.com/en/latest/index.html ## [Project Walkthrough: askFSDL](https://fullstackdeeplearning.com/llm-bootcamp/spring-2023/askfsdl-walkthrough/) https://github.com/the-full-stack/ask-fsdl 講怎麼做 Discord QA chatbot: askFSDL ![[Pasted image 20230615034136.png]] * Tech Stack * [langchain](https://github.com/hwchase17/langchain) * MongoDB hosting: [Atlas](https://www.mongodb.com/atlas/database) * vector: [FAISS index](https://github.com/facebookresearch/faiss) * serverless backend: [Modal](https://modal.com/) * Python UI [Gradio](https://gradio.app/) * iscord bot hosting: [`discord.py`](https://discordpy.readthedocs.io/en/stable/), free-tier [AWS EC2](https://aws.amazon.com/ec2/) configure with [Pulumi](https://www.pulumi.com/). * [Gantry](https://gantry.io) monitor model ## [UX for Language User Interfaces](https://fullstackdeeplearning.com/llm-bootcamp/spring-2023/ux-for-luis/) ### UI Principles * 介面 Interface 的歷史: 語言可說是第一個數位 digital 介面 * 什麼是好的 user interface * Don Norman 的 The Design of Everyday 一書 (設計的心理學) * Steve Krug 的 Don't Make me Think 一書 * AI 應該 * Inform and educate the user (e.g. Grammarly). * Provide affordances for fixing mistakes (e.g. speech-to-text on phone). * Incentivize user to provide feedback (e.g. Mid-Journey image selection 讓用戶選). ### LUI Patterns * 比較和分析了以下 Patterns: * Click-to-complete (Playground) * nat.dev 讓你同時測試多家 LLMs * Auto-Complete (Copilot) * Github Copilot * Command Palette (Replit) * Notion AI 也是這種，讓用戶選指令 * One-on-one Chat (ChatGPT) * 一些招式變化: * Suggested follow-ups * Citations * Enriching text (支援 Markdown 顯示) * Plugins * Access to work context (把聊天放在文件側爛) * 當作 primary app interface ? * ChatSpot = ChatGPT + HubSpot CRM ### Case Studies * Copilot 做對了 * autocomplete 對 accuracy 要求比較低，因為用戶本來就預期 autocomplete 不會太準.... * 這場景 Latency 比 Quality 更重要 * Bing Chat 做失敗了 * 開發太趕了，沒有好好測過可以處理用戶(惡意的)問題 * 有人分析原因: Bing Sydney 根本不是好好 RLHF 後的版本，只是一些 fine-tuning 之後的版本 * https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned?commentId=AAC8jKeDp6xqsZK2K * 注意你的 signifier (指示) 應該要符合你的 affordance (系統有能力做的事情) * e.g 你的 bot 命名就該像個 machine，不應該像人 ![[Pasted image 20230620230615.png]] * 結論 * 做好 UX research，就像 Copilot * 注意 feedback loops，不要像 Bing Chat * signifiers 和 affordance 要符合，避免用戶錯誤預期 ## [LLMOps](https://fullstackdeeplearning.com/llm-bootcamp/spring-2023/llmops/) ### Choosing your base model * 對大部分的情況，就用 GPT-4 開始做 * 因為如果 GPT-4 無法完成你的任務，其他 models 更不行了 * 如果 cost or latency 很重要，再考慮用 GPT-3.5 or Claude * 現況就是 proprietary models 比較好 * Open Source models 還有商業授權問題 * 除非你真的需要 Open Source: 更好 customize 以及資料安全。不過這塊進展很快。 * 只有少數最好的 models 同時做到用四種 data 來訓練: internet data, code, instructions, human feedback ![[Pasted image 20230621020048.png]] * gpt-4 品質最好 * gpt-3.5-turbo 如果你想要更快更便宜 * claude 最好的第二選擇 * command-xlarge 用來 fine-tuning 最佳 * claude-instant 最快最便宜的選擇 * ada, babbage, curie, command-medium 針對 latency 或 cost 敏感的場景 ![[Pasted image 20230621020331.png]] * 黑色是 Base model, 後面的藍色是指 instruction tune model * Google T5, Flan-T5 是 open source 中最好的 * Pythia, Dolly 2.0 是最近的新選擇，聽說還不錯。可惜 Dolly 2.0 不能商用 * Stable LM 也是最近的選擇 * 以下都不能商用了 * LLaMA 的 ecosystem 豐富 * 在以下不用考慮了 ### Iteration and prompt management * 如何 save prompt 進行開發管理? * 需要更好的 evaluate prompt: Robust automatic evaluation * 三種管理程度 * Level 1: do nothing, 開發 app v0 還可以，但要認真 building apps 就不夠了 * Level 2: track prompts in Git * Level 3: track prompts in a specialized tool * 需要 running parallel evals * 開發 prompts 跟部署 prompts 要分開 * 讓非技術的人也可以使用，例如 PM * 供應商例如(本來做 ML experiment tracking tool providers 的) 但會有更多廠商來做 * W&B * comet * mlflow ### Testing * 如何 measure 新的 model or prompt 比舊的好? * 很常見說你測一種情況有變好，但其他情況都變差 * 用戶是預期能信任你要維護好效能的 * old-school ML way * Train 時，區分 Train Set 和 Eval Set (此時會 Overfitting) * Production 時，區分 Test set 和 Prod Data * 這種方式不適用 LLMs * 你沒有 training set * 你的 production data 一定跟 training data 不一樣 * 你也很難定義 quantitative metrics * 當用戶輸入非常不同時，你很難比較 accuracy * 那如何思考 testing LLMs? * Data 部分 * 針對你的任務建構你的 evaluation dataset * incrementally 就開始做 * 可用 LLM 幫忙你產生更多 diverse test case! * https://github.com/rlancemartin/auto-evaluator * 繼續加更多資料 * test coverage? * 是否有方法可以衡量 test set 的品質? * 用 distribution shift 來衡量跟 reference distribution 的距離 * Metric(s) 部分 ![[Pasted image 20230621203232.png]] * 關鍵就是你可以用 LLMs 來評估 LLMs * Regular eval metric: Accuracy * Reference matching metrics: Semantic similarity, Ask another LLM * "Which is better" metrics: Ask an LLM * "Is the feedback incorporated" metrics: Ask an LLM * Static metrics: 1.驗證是否有正確結構 2. 問 model 給評分 ### Deployment * Deploying OSS LLMs 是另一個大主題 * https://fullstackdeeplearning.com/course/2022/lecture-5-deployment/ * https://blog.replit.com/llm-training * 如何在 production 改進 output? * Self-critique: 問 LLM 這是正確答案嗎? * 取樣 Sample 多次，取最好的答案 or 平均重組ensemble * https://shreyar.github.io/guardrails/ * 這些技巧是用更多 API calls 成本跟 latency，去換品質 ### Monitoring * 你如何得知你在 production 是否真的解決了用戶問題? * 看這些 signals * Outcome 和用戶 feedback * Model performance metrics * Proxy metrics * Measuring what actually goes wrong * https://gantry.io/blog/youre-probably-monitoring-your-models-wrong/ ### Continual improvement and fine-tuning * 根據 monitoring signals 如何改進? 兩個方向 * Make the prompt better * 從用戶 feedback 中找到哪些 theme (題目?) 沒有被處理好 * 調整 prompt * Fine-tune the model: 兩個方向 * Supervised fine-tuning * 針對你的特定任務調整，而且 in-context learning 效果不好時 * 當你有很多資料時，而且 retrieval 效果不好時 * 當你想透過縮小 model 來解省成本時 * Fine-tuning from human feedback 這較少公司做，複雜又貴 * RLHD/RLAIF 在今天仍是困難的事情 * 這裡講演跳過一大段，投影片有細節 * 用 GPT-4 的話，你可能不太需要 fine-tuning ![[Pasted image 20230621211229.png]] ## [What's Next?](https://fullstackdeeplearning.com/llm-bootcamp/spring-2023/whats-next/) ### Robotics * multimodal 快有了，而 robots 是最終的 key app * 不只 text，Transformer 也可以拿來做 Vision: ViTs * ViTs 非常需要 dataset 來做訓練 * ImageNet-1k * JFT-300M * JFT-4B * GPT-4 就可以做到 OCR https://arxiv.org/abs/2303.08774 * Google's PaLM-E 可以關注看看 * 開源專案 https://github.com/Vision-CAIR/MiniGPT-4 * 有 LUI 的 Robots 要來了，有能力做 internal reasoning ### Scale * big 跟 small 極限? * 是可以更大，但小 model 更好 * LLaMA: https://arxiv.org/abs/2302.13971 * RWKV: https://github.com/BlinkDL/RWKV-LM RNN 架構會逆襲嗎? * 做出更強的 model 限制在哪裡? * Money: 不是瓶頸 * GPT-4 估計的 training 成本是 5千萬 * 25,000 A100s x $1 per A100 per hour x 24 hours per day x 90 days = ~$50,000,000 * 對大組織來說這不是問題，而且又不是要經常重新訓練 * Compute: 也不算是瓶頸(如果你願意等) * 雖然 GPU 目前缺 (A100)，你在 cloud 上買不到這個資源 * 但大廠要 training models 都是早就跟 datacenter 預定好資源的 * Data: 可能是瓶頸 * paper: [Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning](https://arxiv.org/abs/2211.04325) * data 跟 model parameters 都可以改進效能 * 但 parameter 成長太快了，一年 ~10x，這造成 Chinchilla 效應 (i.e. 可以做出小 model 而效果卻更好) * 因此目前應該 scaling data * No model trained* on 300B tokens can beat Chinchilla. * https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications * 那要多少 data? * https://twitter.com/srush_nlp/status/1633509903611437058 * https://twitter.com/BlancheMinerva/status/1644177571628699649?s=20 * paper: https://arxiv.org/abs/2203.15556 * 小 model 的好處 * https://www.harmdevries.com/post/model-size-vs-compute-overhead/ * 要知道 Inference time 也需要成本，因此計算總成本時，小 model 需要的 Inference 成本較低 * 使用 Tool 可以降低. parameter 需求 * 不需要存 wikipedia 資料，只需要可以 query 即可 * 做推理引擎 (reasoning engines) 比 "internet-scale 資料庫"，可以 * 才有可能跑在 local 個人裝置，例如樹莓派 * https://simonwillison.net/2023/Mar/11/llama/ ### AGI * 已經是了嗎? 多數人還不認為，但是..... * 需要時間來探索 model 的能力 * 一個 prompt 就改進很大 [Large Language Models are Zero-Shot Reasoners](https://arxiv.org/abs/2205.11916) Let's think step by step * model 有能力發現他自己的能力 * [Large Language Models Are Human-Level Prompt Engineers](https://arxiv.org/abs/2211.01910) * model 也許有能力自己改進 * [Teaching Large Language Models to Self-Debug](https://arxiv.org/abs/2304.05128) * BabyAGI, AutoGPT 等專案，目前是 cool demo，還不能 production * 一種新的電腦? * https://www.beren.io/2023-04-11-Scaffolded-LLMs-natural-language-computers/ * https://twitter.com/karpathy/status/1644183721405464576 ### Security & Alignment * Security 的考量有 * Prompt Injection * 目前還沒辦法保證 user input 不會覆蓋掉你的 prompt * 只能假設你的 prompt 會被洩漏 * gpt-3.5/4 都沒解決 * Prompt 也可能被外部 resource 污染 * 當使用 ChatGPT plugins，用戶就可能前往惡意連結 * Jailbreaking * https://aiadventure.spiel.com/carpet LLM game * https://www.jailbreakchat.com/ * Agentic Dangers * 用戶刻意用 eval 去執行 GPT 產生出來的 code.... (loop) * 探討開發更 powerful models 的風險 * LLM AI 很會寫 code，就有人拿來做壞事 * https://cdn.openai.com/papers/gpt-4-system-card.pdf * https://threatpost.com/stuxnets-first-five-victims-provided-path-to-natanz/109291/ * "The only way out is through" 理論 * Yes, superintelligent AGI is dangerous. That's why the good guys need to build it first. * OpenAI 方式: https://openai.com/blog/our-approach-to-alignment-research ## Invited Talks: [Harrison Chase: Agents](https://fullstackdeeplearning.com/llm-bootcamp/spring-2023/chase-agents/) LangChain 的作者投影片: https://fsdl.me/2023-llmbc-slides-chase * 什麼是 agents? * 把 LLM 當作 reasoning engine * Non-deterministic 的連續動作 * 可以處理 errors，進行多步驟任務 * 基本實作邏輯 * 選要用的 tool * 觀察 tool 輸出 * 不斷重複直到條件滿足 * 停止條件 * LLM 決定 * 寫死的規則 * ReAct * 談 agents 的挑戰 * AutoGPT, BabyAGI * https://www.camel-ai.org/ * paper: [Generative Agents: Interactive Simulacra of Human Behavior](https://arxiv.org/abs/2304.03442) ## Invited Talks: [Reza Shabani: How to train your own LLM](https://fullstackdeeplearning.com/llm-bootcamp/spring-2023/shabani-train-your-own/) 投影片: https://docs.google.com/presentation/d/13yrWx4eSLkne8d0s9iIqTEBWr8wRgG9nBdmSHLIOu84/edit#slide=id.g12407b231b9_0_43 ## Invited Talks: [Fireside Chat with Peter Welinder](https://fullstackdeeplearning.com/llm-bootcamp/spring-2023/welinder-fireside-chat/)