edX LLM Foundation Models from the Ground Up

> 歡迎訂閱我的 [AI Engineer 電子報](https://aihao.eo.page/6tcs9) 和瀏覽 [[Generative AI Engineer 知識庫]] 課程網址: https://www.edx.org/professional-certificate/databricks-large-language-models 課程心得: * 此為接續 [[edX LLM Application through Production]] 的第二門課程，講 LLM 模型本身 * 內容挺難的，模型本身是深度學習的東西，有點一知半解艱難的把這門課修完 * 學 PEFT 還是蠻有幫助的，對於微調技術有更多認識 * 講 MLLM 也很精彩，對這塊研究領域有了初步的認識 > 2023/9 Verified Certificate 到手: https://courses.edx.org/certificates/b6c5d185cd0d4aee94fb3b12cd0c23a5 > 2023/9 Professional Certificate 到手: https://credentials.edx.org/credentials/89c341dce41244548a69aed8b4236e88 ## Module 0 - Course Introduction * 開源 LLM 品質快速進步中，微調的成本也在快速降低 * LLM 的 fundamentals 從 2018 年起其實沒什麼變化，指現代的 transformer ## Module 1 - Transformer Architecture: Attention & Transformer Fundamentals * 在 NLP 領域，雖然有不同訓練技巧，但目前大家都用 Transformer 架構了 * 這章講 Transformer Block, Transformer 架構, Attention, encoder, decoder, encoder-decoder 等 * GPT: Generative pre-trained transformer * decoder-based transformer model ![[Pasted image 20230926161822.png]] * 出處 https://github.com/Mooler0410/LLMsPracticalGuide ### Lab 1 用 PyTorch 來蓋出 transformer 架構 (Encoder 和 Decoder) * 做 encoding natural language: word embedding 和 positional encoding * 做 decoder 架構 * 做 single layer 的 transformer * 做 multi layer * (以上只是做出架構，沒有真的跑訓練過程，因此 predict 結果其實是隨機的) * 載入 gpt-2 pre-trained model 來跑 predict * 做 encoder transformer * 探索 word embedding，然後跟 BERT 比較結果 * 練習 Masked Language Modeling (MLM) ### Resources - [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/) - [A Mathematical View of Attention Models in Deep Learning](https://people.tamu.edu/~sji/classes/attn.pdf) - [What Is a Transformer Model?](https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transformer-model/) - [Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](https://www.mosaicml.com/blog/mpt-7b) - [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) - [BookCorpus](https://en.wikipedia.org/wiki/BookCorpus) and [Gpt-2-output-dataset](https://github.com/openai/gpt-2-output-dataset) (webText) - [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) - [Large Language Models and the Reverse Turing Test](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10177005/) ## Module 2 - Efficient Fine Tuning https://learning.edx.org/course/course-v1:Databricks+LLM102x+2T2023/home * Transfer learning 是一個比 fine-tuning 更大一點的概念，不過兩者經常替換使用 * fine-tune 專指再 pre-trained model 上進一步訓練 and/or 用不同資料訓練 * transfer learning 指讓 pre-trained model 用在新的任務上 * Fune tune = 調整 foundation model weights * Fine-Tuning 可以只微調模型參數的 subset，而不是全部 weight * 可新增 layers 或調整 top layers，這樣微調比較有效率 * full fine-tuning 調整全部 weight 的確可以效能更好，但是很花費資源 * 而且搞不好會讓模型忘記本來會的東西 * X-shot learning 也算是一種微調 * prompt design = prompt engineering = 又叫做 hard/discrete prompt tuning = in-context learning * 不需要調整 model weights ![[Pasted image 20230926161221.png]] * 舉例: Goat https://arxiv.org/abs/2305.14201 model 是目前算數任務最好的微調 model，比 PaLM 跟 GPT-4 都強 ### PEFT (Parameter-efficient fine tuning) * 有三種類型 * additive 新增 layers，本來 foundation model weights 不變 * Prompt Tuning * selective (效果不好，因此本課程就不細講了) * re-parameterization * LoRA * Additive: Prompt Tuning (and prefix tuning) * task-specific soft prompt，是 virtual tokens (不是看得懂的 text) 加在你的 hard prompt 前面 * https://arxiv.org/abs/2104.06599 * 叫做 prompt tuning (而不是 model tuning) 只調整 prompt weights * 不同任務可以替換不懂的 task prompts * https://aclanthology.org/2021.emnlp-main.243.pdf * 在 >11B 的模型上，Prompt Tuning 可以追上 Model Tuning 的效能 * ![[Pasted image 20230926172914.png]] * Prompt 長度在大模型時，影響不大 * ![[Pasted image 20230926173233.png]] * 優點 * 不像 few-shot 只能少數 example，這招可以全部 training set 都用上 * 自動學，不需要手動下 prompt * 比較不會 over-fitting 特定任務，因為 foundation 沒動 * https://blog.research.google/2022/02/guiding-frozen-language-models-with.html * 缺點 * 不好解釋 * 效能有 unstable 現象(上圖中的 Instability ) * 另一種類似的技術: Prefix tuning * 加在每一個 transformer block 上，而不是加在 input embedding layer * https://arxiv.org/abs/2101.00190 * https://lightning.ai/pages/community/article/understanding-llama-adapters/ * Low-Rank Adaptation (LoRA) * 需要線性代數來理解 * 將本來的 weight 變成 lower-rank matrices * 例如本來是 100x100 的 matrix，變成 100x2 乘 2x100 * 如果參數量由本來的 100,000 減少為 (100x2) + (2x100) = 400 * 減少了 10000-40 = 96% 要調整的參數量 * paper: https://arxiv.org/abs/2106.09685 * 相比 Full turning 有 175255.8M 參數量，LoRA 只調整了 37.7M，約 0.02% 參數量，然後效能還更好 * 優點 * 大部分 weights 不變 * 大幅增加訓練效率 * 可以和其他 PEFT 方式一起做 * 限制 * multi-task serving 無法 swap * 更新的 PEFT 技巧比 LoRA 減少更多需要訓練的參數 https://arxiv.org/abs/2205.05638 * 一些議題 * 微調的各種方法，相比 full fine-tuning，仍是 unstable 的，不一定總是能得想要的效能 * 對於 hyperparameters 非常敏感 * 換不同地方調整 weight 的方式: Soft prompt transfer https://arxiv.org/abs/2110.07904 * full 微調但是降低 memory 用量 https://arxiv.org/abs/2306.09782 LOMO * 微調不總是能讓 inference 階段更有效率 * 不會降低儲存 foundation model 的空間 * 不會降低 training 的時間複雜度，仍需要 full forward 跟 backward passes ### 資料準備最佳實務 * 好的模型來自好的訓練資料 * 資料集 * C4 https://huggingface.co/datasets/c4 * Pile https://pile.eleuther.ai/ * Finance 模型 BloombergGPT https://arxiv.org/abs/2303.17564 * 需要多少 fine-tuning 資料? * https://arxiv.org/abs/2305.11206 說約 1000 則高品質資料 * 除了數量，diversity 也很重要 * https://platform.openai.com/docs/guides/fine-tuning OpenAI 說至少上百則 * 可用 Synthetic data 產生更多資料 * 但用 LLM output 當作訓練資料，更多只是學到 style 和遵守指令，而非 content 知識 * https://arxiv.org/abs/2305.15717 * 學習知識仍主要發生在 pre-training 階段 > 知乎的這兩篇 paper 說明: https://zhuanlan.zhihu.com/p/633171715 ### Lab 2 用 Hugging Face 的 PEFT library 來做微調 https://huggingface.co/docs/peft/index https://github.com/huggingface/peft * Causal language model 又叫做 auto-regressive model * pre-training model 用 BloomZ-560m 來示範 * 用 Prompt Tuning 做 fine-tuning * random initialization * text initialization (指人工先給一個 prompt 來做微調) * 但實驗結果顯示兩種初始方式沒什麼差異，因為建議用 random 就好了 * 分享 model 到 HuggingFace Hub * 用 LoRA 做 fine-tuning * 這個 lab 跑微調訓練，大約要跑15分鐘 ### Resources - [What’s in Colossal Clean Common Crawl (C4) dataset](https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/) - [LaMDA: Language Models for Dialog Applications](https://arxiv.org/abs/2201.08239) - LaMDA is a family of dialog models. The authors found fine-tuning the model with a classifier with some crowdsourced annotated data can improve model safety - [Gorilla: Large Language Model Connected with Massive APIs](https://gorilla.cs.berkeley.edu/) - [Interpretable Soft Prompts](https://learnprompting.org/docs/trainable/discretized) - By performing prompt tuning on initialized text – e.g. “classify this sentiment” – the resulting prompt embeddings might become nonsensical. But this nonsensical prompt can give better performance on the task - [Continual Domain-Adaptive Pre-training](https://arxiv.org/pdf/2302.03241.pdf) - [Foundation Models for Decision Making: Problems, Methods, and Opportunities](https://arxiv.org/pdf/2303.04129.pdf) - [“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors](https://aclanthology.org/2023.findings-acl.426/?utm_source=substack&utm_medium=email) - Using a simple compressor, like gzip with a KNN classifier, can outperform BERT on text classification. The method also performs well in few-shot settings. - [The False Promise of Imitating Proprietary LLMs](https://arxiv.org/abs/2305.15717) - [Ahead of AI: LLM Tuning & Dataset Perspectives](https://magazine.sebastianraschka.com/p/ahead-of-ai-9-llm-tuning-and-dataset) - [How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources](https://arxiv.org/pdf/2306.04751.pdf) - [AlpaGasus: Training A Better Alpaca with Fewer Data](https://arxiv.org/abs/2307.08701) - More data for fine-tuning LLMs is not necessarily better. AlpaGasus used 9k high-quality data out of the original 52k Alpaca dataset and it performed the original Alpaca-7B model. ## Module 3 - Deployment and Hardware Considerations * 改進 model size 和 speed，降低訓練和 inference 的計算成本跟 latency * 模型越大，雖然 accuracy, alignment, abilities 會增加 * 但是 speed, memory 用量, updatability 也會變差 * 因此目標是改進 speed 和 footprint 時，也能保持模型 quality * context length 問題 * 隨著 length 越長，attention scores 是二次方增加 * 這跟 positional encoding 有關 * Alibi https://arxiv.org/abs/2108.12409 * 訓練時可以用短 context length，Inference 可以用長 context length * 可上看 32k, 64k, 100k > 參考 https://zhuanlan.zhihu.com/p/525552086 * FlashAttention https://arxiv.org/abs/2205.14135 * 用 GPU SRAM 大幅改進效率 * Grouped Query Attention https://arxiv.org/abs/2305.13245 * 改進 inference 效率 * 改進 Model Footprint * Google 的 BF16 比 IEEE 標準 F16, F32 更有效率 https://cloud.google.com/tpu/docs/bfloat16 * Quantization https://arxiv.org/abs/2208.07339 * 轉成整數形式計算更快 * QLoRA https://arxiv.org/abs/2305.14314 * 也就是 Quantized LoRA，這用在微調 * Mixture-of-Experts * MoE 多任務學習框架: input 送到 router (gating function)，訓練多個 NNs 專家 * Switch Transformers: https://arxiv.org/abs/2101.03961 * LLM Cascades * FrugalGPT https://arxiv.org/abs/2305.05176 * prompt 先送給小模型，若小模型信心不夠，再送給大模型 * 目的是降低 inference 成本 * 最佳實務 * Numbers every LLM Developer should know https://github.com/ray-project/llm-numbers ![[Pasted image 20230928213624.png]] * MosaicML (已被 Databricks 合併) 的 Abhinav Venigalla 客座分享 How we built MPT-7B and MPT-30B ![[Pasted image 20230928214315.png]] ### Lab 3 * 示範自己寫個 Quantization 方法: 將浮點數轉整數 * 示範 PyTorch 的 Quantization 功能 QuantStub, DequantStub * 比較前後，這個 Quantized 模型只有本來 27% 的模型大小，這合理，因為原先 weight 是 32-bit 變成 8-bit * 除了訓練時用，Inference 時也可用 * 實作一個簡單版的 mixture-of-experts (MoE) LLM system ### Resources - [We’re getting a better idea of AI’s true carbon footprint](https://www.technologyreview.com/2022/11/14/1063192/were-getting-a-better-idea-of-ais-true-carbon-footprint/) - [ESTIMATING THE CARBON FOOTPRINT OF BLOOM, A 176B PARAMETER LANGUAGE MODEL](https://arxiv.org/pdf/2211.02001.pdf) - [Mosaic LLMs (Part 2): GPT-3 quality for <$500k](https://www.mosaicml.com/blog/gpt-3-quality-for-500k) and [ChatGPT and generative AI are booming, but the costs can be extraordinary](https://www.cnbc.com/2023/03/13/chatgpt-and-generative-ai-are-booming-but-at-a-very-expensive-price.html) - [When AI’s Large Language Models Shrink](https://spectrum.ieee.org/large-language-models-size) ## Module 4 - Beyond Text-Based LLMs: Multi-Modality * Transformer 架構是通用的，也可以拿來處理影像、聲音 * A Survey on Multimodal Large Language Models: https://arxiv.org/abs/2306.13549 * 例如 OpenAI 的 Whisper, DALL-E, CLIP * 應用 Video-LLaMA * https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA * 應用 MiniGPT-4 * https://minigpt-4.github.io/ * CoT * Frame by Frame 的思考 https://arxiv.org/abs/2305.13903 * 針對影像先產生推理文字過程，再產生最後答案 https://arxiv.org/abs/2302.00923 * PandaGPT 同時吃影像跟聲音 * https://panda-gpt.github.io/ * HuggingGPT 使用工具 * https://huggingface.co/spaces/microsoft/HuggingGPT ### ViT (2021) * Transformer 就是個通用的 sequence processing tool * 圖片、聲音、音樂、影片、遊戲動作、蛋白質..... 等等 * cross attention 可以銜接不同 modalities 模態 * Stable Diffusion 就是銜接了 text 跟 image https://stability.ai/blog/stable-diffusion-announcement * Computer vision * 之前是用 CNN * Vision Transformer (ViT) 2021 * paper: https://arxiv.org/abs/2010.11929 * 彩色圖片是 3-D tensors * 灰階圖片則是 2-D，在大量圖片時用灰階可以節省計算資源 * https://towardsdatascience.com/what-is-a-tensor-in-deep-learning-6dedd95d6507 * https://www.kdnuggets.com/2020/01/convert-picture-numbers.html * ViT 將 image 分割成固定大小的 image patch，然後 linear project 到 vector 並添加 positional embedding，再送到 Transformer * patch 就像是 LLM 中的 sub-word * ViT 只有在 larger datasets 時，效能才超過 ResNets * 且 ViT 的訓練速度比 CNN 快四倍 * 後續 papers (2021): * https://arxiv.org/abs/2103.14030 * https://arxiv.org/abs/2105.01601 (不算是 ViT 也不是 CNN) * ViT 不算是革命，而只是演進 * 教授說: 相同問題其實 CNN 也可以解決，彈是 ViT 有計算效率優勢，因為硬體好配合 ### Audio * 聲音是頻率和時間，即可轉成 embedding vector，就可以送進 Transformer * 挑戰是 context length 會很長，因為錄音時間長 * Speech Transformer paper (2018): https://ieeexplore.ieee.org/document/8462506 * Review paper: https://arxiv.org/abs/2305.00359 * 2019 前都沒用 transformer * 而且都只有處理 text-speech, speech-text or speech-speech * 比起文字跟圖片，語音的訓練資料又更難取得了 * Meta AI 的 data2vec 是第一個有處理到情緒、語調、講者辨識的多模態模型: https://ai.meta.com/blog/the-first-high-performance-self-supervised-algorithm-that-works-for-speech-vision-and-text/ ![[Pasted image 20230929211425.png]] ### Training Data * text-to-audio 和 text-to-video 的訓練資料，很難取得，研究者常常給自己搞 * https://maxbain.com/webvid-dataset/ * https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data * https://arxiv.org/abs/2305.13903 * Structured Scene Description (FAMOuS) * LLaVA * https://llava.hliu.cc/ * https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/viewer/liuhaotian--LLaVA-CC3M-Pretrain-595K/train?row=0 * 用 model generated data https://arxiv.org/abs/2304.08485 * LAION-5B 目前最好的開源多模態資料集 https://laion.ai/blog/laion-5b/ * 但僅限研究用途 * 裡面有 copyright 圖片 ### X-shot learning: Computer vision ![[Pasted image 20230929210925.png]] * 優點: 解決訓練資料缺少問題，可以不需要這麼多訓練資料 * zero-shot learning * CLIP (openai) https://openai.com/research/clip * 在 non-ImageNet 的資料上表現好很多 * 最大的限制 Inflexible: 無法產生文字，只能產生 caption文字的機率 * few-shot (2022) * Flamingo (deepmind) https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model * paper: https://arxiv.org/abs/2204.14198 * 可以產生 free-form text * 用法: (image/video + text) + visual query ![[Pasted image 20230929224212.png]] ### X-shot learning: Audio * OpenAI Whisper 是用 Encoder-decoder transformer * 有用 CNN 降低聲音維度 * 拆成 30s frames * Whisper 辨識英文的錯誤率已經跟人類差不多了 * 測試資料集是 LibriSpeech http://www.openslr.org/12 ### 挑戰 * MLLMs 仍有 LLM 的限制 * Hallucination 幻覺 * Prompt sensitivity, Context limit * Inference compute cost * Bais, Toxicity 等 * Copyrights issues * lack common sense * 除了 Attention，還有什麼值得關注的方向? * RLHF * https://magazine.sebastianraschka.com/p/ahead-of-ai-6-train-differently * 建立 Reward model * Hyena Hierarchy https://hazyresearch.stanford.edu/blog/2023-03-07-hyena * Retentive Networks https://arxiv.org/abs/2307.08621 ### 新的應用 * DreamFusion 產生 3D 物件 https://dreamfusion3d.github.io/ * Make-a-Video 產生影片 https://makeavideo.studio/ * PaLM-E-bot 機器人 https://palm-e.github.io/#demo * AlphaCode 寫程式 https://arxiv.org/abs/2203.07814 * Multi-lingual models: Bactrian-X https://github.com/mbzuai-nlp/Bactrian-X * Textless NLP https://ai.meta.com/blog/textless-nlp-generating-expressive-speech-from-raw-audio/ * 不需要 text，從 raw audio 就產生 speech * 對於 low-resource 語言很有用，沒有文字的語言 * AlphaFold 蛋白質 * https://www.deepmind.com/blog/alphafold-reveals-the-structure-of-the-protein-universe * https://www.deepmind.com/research/highlighted-research/alphafold/timeline-of-a-breakthrough * Gato: a generalist AI agent 通才工具 https://www.deepmind.com/blog/a-generalist-agent ### Lab 4 * 示範訓練一個 image captioning model (根據照片產生文字描述) * 資料集用 sub_captions https://huggingface.co/datasets/sbu_captions * 用 Vision Encoder Decoder Models https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder#overview 方式，可將任意 transformer based vision 和 text model 銜接在一起 * vision pre-training model 用 https://huggingface.co/google/vit-base-patch16-224-in21k 這是 encoder * text pre-training model 用 gpt2，這是 decoder * 示範用現成的 BLIP model https://huggingface.co/docs/transformers/model_doc/blip * 練習做 zero-shot video classification * 參考自 https://github.com/NielsRogge/Transformers-Tutorials/blob/master/X-CLIP/Zero_shot_classify_a_YouTube_video_with_X_CLIP.ipynb * 使用 X-CLIP model https://huggingface.co/docs/transformers/model_doc/xclip * https://huggingface.co/microsoft/xclip-base-patch16-zero-shot * 影片用 pytube 抓 https://pytube.io/en/latest/index.html * 載入和處理影片用 https://github.com/dmlc/decord * 用 https://huggingface.co/openai/clip-vit-base-patch32 做一次 zero-shot image classification * 根據圖片，產生 text caption 機率 * 用 openai whisper API * https://github.com/openai/whisper/discussions/categories/show-and-tell * https://huggingface.co/spaces/aadnk/whisper-webui * https://github.com/openai/whisper/discussions/264 ### Resources - [Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts](https://arxiv.org/pdf/2307.11661.pdf) - [EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention](https://arxiv.org/abs/2305.07027) - [Key-Locked Rank One Editing for Text-to-Image Personalization](https://arxiv.org/pdf/2305.01644.pdf) - This paper describes text-to-image generation being done with a model that is 100KB in size. Maybe size isn't everything. - [AudioCraft by MetaAI](https://audiocraft.metademolab.com/) - MetaAI just released this code base for generative audio needs in early August 2023. It can model audio sequences and capture the long-term dependencies in the audio. - [X-ray images with LLMs and vision encoders](https://arxiv.org/abs/2308.01317) - [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) - [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) - This is the original paper on Stable Diffusion. - [The Illustrated Stable Diffusion by Jay Alammar](https://jalammar.github.io/illustrated-stable-diffusion/) - This blog post illustrates how stable diffusion model works. - [All are Worth Words: A ViT Backbone for Diffusion Models](https://arxiv.org/abs/2209.12152) - This paper describes how to add diffusion models to Vision Transformer. - [RLHF: Reinforcement Learning from Human Feedback](https://huyenchip.com/2023/05/02/rlhf.html) by Huyen Chip