> 歡迎訂閱我的 [AI Engineer 電子報](https://aihao.eo.page/6tcs9) 和瀏覽 [[Generative AI Engineer 知識庫]]
課程網址: https://www.edx.org/professional-certificate/databricks-large-language-models
課程心得:
* 此為接續 [[edX LLM Application through Production]] 的第二門課程,講 LLM 模型本身
* 內容挺難的,模型本身是深度學習的東西,有點一知半解艱難的把這門課修完
* 學 PEFT 還是蠻有幫助的,對於微調技術有更多認識
* 講 MLLM 也很精彩,對這塊研究領域有了初步的認識
> 2023/9 Verified Certificate 到手: https://courses.edx.org/certificates/b6c5d185cd0d4aee94fb3b12cd0c23a5
> 2023/9 Professional Certificate 到手: https://credentials.edx.org/credentials/89c341dce41244548a69aed8b4236e88
## Module 0 - Course Introduction
* 開源 LLM 品質快速進步中,微調的成本也在快速降低
* LLM 的 fundamentals 從 2018 年起其實沒什麼變化,指現代的 transformer
## Module 1 - Transformer Architecture: Attention & Transformer Fundamentals
* 在 NLP 領域,雖然有不同訓練技巧,但目前大家都用 Transformer 架構了
* 這章講 Transformer Block, Transformer 架構, Attention, encoder, decoder, encoder-decoder 等
* GPT: Generative pre-trained transformer
* decoder-based transformer model
![[Pasted image 20230926161822.png]]
* 出處 https://github.com/Mooler0410/LLMsPracticalGuide
### Lab 1
用 PyTorch 來蓋出 transformer 架構 (Encoder 和 Decoder)
* 做 encoding natural language: word embedding 和 positional encoding
* 做 decoder 架構
* 做 single layer 的 transformer
* 做 multi layer
* (以上只是做出架構,沒有真的跑訓練過程,因此 predict 結果其實是隨機的)
* 載入 gpt-2 pre-trained model 來跑 predict
* 做 encoder transformer
* 探索 word embedding,然後跟 BERT 比較結果
* 練習 Masked Language Modeling (MLM)
### Resources
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/)
- [A Mathematical View of Attention Models in Deep Learning](https://people.tamu.edu/~sji/classes/attn.pdf)
- [What Is a Transformer Model?](https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transformer-model/)
- [Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](https://www.mosaicml.com/blog/mpt-7b)
- [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
- [BookCorpus](https://en.wikipedia.org/wiki/BookCorpus) and [Gpt-2-output-dataset](https://github.com/openai/gpt-2-output-dataset) (webText)
- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
- [Large Language Models and the Reverse Turing Test](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10177005/)
## Module 2 - Efficient Fine Tuning
https://learning.edx.org/course/course-v1:Databricks+LLM102x+2T2023/home
* Transfer learning 是一個比 fine-tuning 更大一點的概念,不過兩者經常替換使用
* fine-tune 專指再 pre-trained model 上進一步訓練 and/or 用不同資料訓練
* transfer learning 指讓 pre-trained model 用在新的任務上
* Fune tune = 調整 foundation model weights
* Fine-Tuning 可以只微調模型參數的 subset,而不是全部 weight
* 可新增 layers 或調整 top layers,這樣微調比較有效率
* full fine-tuning 調整全部 weight 的確可以效能更好,但是很花費資源
* 而且搞不好會讓模型忘記本來會的東西
* X-shot learning 也算是一種微調
* prompt design = prompt engineering = 又叫做 hard/discrete prompt tuning = in-context learning
* 不需要調整 model weights
![[Pasted image 20230926161221.png]]
* 舉例: Goat https://arxiv.org/abs/2305.14201 model 是目前算數任務最好的微調 model,比 PaLM 跟 GPT-4 都強
### PEFT (Parameter-efficient fine tuning)
* 有三種類型
* additive 新增 layers,本來 foundation model weights 不變
* Prompt Tuning
* selective (效果不好,因此本課程就不細講了)
* re-parameterization
* LoRA
* Additive: Prompt Tuning (and prefix tuning)
* task-specific soft prompt,是 virtual tokens (不是看得懂的 text) 加在你的 hard prompt 前面
* https://arxiv.org/abs/2104.06599
* 叫做 prompt tuning (而不是 model tuning) 只調整 prompt weights
* 不同任務可以替換不懂的 task prompts
* https://aclanthology.org/2021.emnlp-main.243.pdf
* 在 >11B 的模型上,Prompt Tuning 可以追上 Model Tuning 的效能
* ![[Pasted image 20230926172914.png]]
* Prompt 長度在大模型時,影響不大
* ![[Pasted image 20230926173233.png]]
* 優點
* 不像 few-shot 只能少數 example,這招可以全部 training set 都用上
* 自動學,不需要手動下 prompt
* 比較不會 over-fitting 特定任務,因為 foundation 沒動
* https://blog.research.google/2022/02/guiding-frozen-language-models-with.html
* 缺點
* 不好解釋
* 效能有 unstable 現象(上圖中的 Instability )
* 另一種類似的技術: Prefix tuning
* 加在每一個 transformer block 上,而不是加在 input embedding layer
* https://arxiv.org/abs/2101.00190
* https://lightning.ai/pages/community/article/understanding-llama-adapters/
* Low-Rank Adaptation (LoRA)
* 需要線性代數來理解
* 將本來的 weight 變成 lower-rank matrices
* 例如本來是 100x100 的 matrix,變成 100x2 乘 2x100
* 如果參數量由本來的 100,000 減少為 (100x2) + (2x100) = 400
* 減少了 10000-40 = 96% 要調整的參數量
* paper: https://arxiv.org/abs/2106.09685
* 相比 Full turning 有 175255.8M 參數量,LoRA 只調整了 37.7M,約 0.02% 參數量,然後效能還更好
* 優點
* 大部分 weights 不變
* 大幅增加訓練效率
* 可以和其他 PEFT 方式一起做
* 限制
* multi-task serving 無法 swap
* 更新的 PEFT 技巧比 LoRA 減少更多需要訓練的參數 https://arxiv.org/abs/2205.05638
* 一些議題
* 微調的各種方法,相比 full fine-tuning,仍是 unstable 的,不一定總是能得想要的效能
* 對於 hyperparameters 非常敏感
* 換不同地方調整 weight 的方式: Soft prompt transfer https://arxiv.org/abs/2110.07904
* full 微調但是降低 memory 用量 https://arxiv.org/abs/2306.09782 LOMO
* 微調不總是能讓 inference 階段更有效率
* 不會降低儲存 foundation model 的空間
* 不會降低 training 的時間複雜度,仍需要 full forward 跟 backward passes
### 資料準備最佳實務
* 好的模型來自好的訓練資料
* 資料集
* C4 https://huggingface.co/datasets/c4
* Pile https://pile.eleuther.ai/
* Finance 模型 BloombergGPT https://arxiv.org/abs/2303.17564
* 需要多少 fine-tuning 資料?
* https://arxiv.org/abs/2305.11206 說約 1000 則高品質資料
* 除了數量,diversity 也很重要
* https://platform.openai.com/docs/guides/fine-tuning OpenAI 說至少上百則
* 可用 Synthetic data 產生更多資料
* 但用 LLM output 當作訓練資料,更多只是學到 style 和遵守指令,而非 content 知識
* https://arxiv.org/abs/2305.15717
* 學習知識仍主要發生在 pre-training 階段
> 知乎的這兩篇 paper 說明: https://zhuanlan.zhihu.com/p/633171715
### Lab 2
用 Hugging Face 的 PEFT library 來做微調
https://huggingface.co/docs/peft/index
https://github.com/huggingface/peft
* Causal language model 又叫做 auto-regressive model
* pre-training model 用 BloomZ-560m 來示範
* 用 Prompt Tuning 做 fine-tuning
* random initialization
* text initialization (指人工先給一個 prompt 來做微調)
* 但實驗結果顯示兩種初始方式沒什麼差異,因為建議用 random 就好了
* 分享 model 到 HuggingFace Hub
* 用 LoRA 做 fine-tuning
* 這個 lab 跑微調訓練,大約要跑15分鐘
### Resources
- [What’s in Colossal Clean Common Crawl (C4) dataset](https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/)
- [LaMDA: Language Models for Dialog Applications](https://arxiv.org/abs/2201.08239)
- LaMDA is a family of dialog models. The authors found fine-tuning the model with a classifier with some crowdsourced annotated data can improve model safety
- [Gorilla: Large Language Model Connected with Massive APIs](https://gorilla.cs.berkeley.edu/)
- [Interpretable Soft Prompts](https://learnprompting.org/docs/trainable/discretized)
- By performing prompt tuning on initialized text – e.g. “classify this sentiment” – the resulting prompt embeddings might become nonsensical. But this nonsensical prompt can give better performance on the task
- [Continual Domain-Adaptive Pre-training](https://arxiv.org/pdf/2302.03241.pdf)
- [Foundation Models for Decision Making: Problems, Methods, and Opportunities](https://arxiv.org/pdf/2303.04129.pdf)
- [“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors](https://aclanthology.org/2023.findings-acl.426/?utm_source=substack&utm_medium=email)
- Using a simple compressor, like gzip with a KNN classifier, can outperform BERT on text classification. The method also performs well in few-shot settings.
- [The False Promise of Imitating Proprietary LLMs](https://arxiv.org/abs/2305.15717)
- [Ahead of AI: LLM Tuning & Dataset Perspectives](https://magazine.sebastianraschka.com/p/ahead-of-ai-9-llm-tuning-and-dataset)
- [How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources](https://arxiv.org/pdf/2306.04751.pdf)
- [AlpaGasus: Training A Better Alpaca with Fewer Data](https://arxiv.org/abs/2307.08701)
- More data for fine-tuning LLMs is not necessarily better. AlpaGasus used 9k high-quality data out of the original 52k Alpaca dataset and it performed the original Alpaca-7B model.
## Module 3 - Deployment and Hardware Considerations
* 改進 model size 和 speed,降低訓練和 inference 的計算成本跟 latency
* 模型越大,雖然 accuracy, alignment, abilities 會增加
* 但是 speed, memory 用量, updatability 也會變差
* 因此目標是改進 speed 和 footprint 時,也能保持模型 quality
* context length 問題
* 隨著 length 越長,attention scores 是二次方增加
* 這跟 positional encoding 有關
* Alibi https://arxiv.org/abs/2108.12409
* 訓練時可以用短 context length,Inference 可以用長 context length
* 可上看 32k, 64k, 100k
> 參考 https://zhuanlan.zhihu.com/p/525552086
* FlashAttention https://arxiv.org/abs/2205.14135
* 用 GPU SRAM 大幅改進效率
* Grouped Query Attention https://arxiv.org/abs/2305.13245
* 改進 inference 效率
* 改進 Model Footprint
* Google 的 BF16 比 IEEE 標準 F16, F32 更有效率 https://cloud.google.com/tpu/docs/bfloat16
* Quantization https://arxiv.org/abs/2208.07339
* 轉成整數形式計算更快
* QLoRA https://arxiv.org/abs/2305.14314
* 也就是 Quantized LoRA,這用在微調
* Mixture-of-Experts
* MoE 多任務學習框架: input 送到 router (gating function),訓練多個 NNs 專家
* Switch Transformers: https://arxiv.org/abs/2101.03961
* LLM Cascades
* FrugalGPT https://arxiv.org/abs/2305.05176
* prompt 先送給小模型,若小模型信心不夠,再送給大模型
* 目的是降低 inference 成本
* 最佳實務
* Numbers every LLM Developer should know https://github.com/ray-project/llm-numbers
![[Pasted image 20230928213624.png]]
* MosaicML (已被 Databricks 合併) 的 Abhinav Venigalla 客座分享 How we built MPT-7B and MPT-30B
![[Pasted image 20230928214315.png]]
### Lab 3
* 示範自己寫個 Quantization 方法: 將浮點數轉整數
* 示範 PyTorch 的 Quantization 功能 QuantStub, DequantStub
* 比較前後,這個 Quantized 模型只有本來 27% 的模型大小,這合理,因為原先 weight 是 32-bit 變成 8-bit
* 除了訓練時用,Inference 時也可用
* 實作一個簡單版的 mixture-of-experts (MoE) LLM system
### Resources
- [We’re getting a better idea of AI’s true carbon footprint](https://www.technologyreview.com/2022/11/14/1063192/were-getting-a-better-idea-of-ais-true-carbon-footprint/)
- [ESTIMATING THE CARBON FOOTPRINT OF BLOOM, A 176B PARAMETER LANGUAGE MODEL](https://arxiv.org/pdf/2211.02001.pdf)
- [Mosaic LLMs (Part 2): GPT-3 quality for <$500k](https://www.mosaicml.com/blog/gpt-3-quality-for-500k) and [ChatGPT and generative AI are booming, but the costs can be extraordinary](https://www.cnbc.com/2023/03/13/chatgpt-and-generative-ai-are-booming-but-at-a-very-expensive-price.html)
- [When AI’s Large Language Models Shrink](https://spectrum.ieee.org/large-language-models-size)
## Module 4 - Beyond Text-Based LLMs: Multi-Modality
* Transformer 架構是通用的,也可以拿來處理影像、聲音
* A Survey on Multimodal Large Language Models: https://arxiv.org/abs/2306.13549
* 例如 OpenAI 的 Whisper, DALL-E, CLIP
* 應用 Video-LLaMA
* https://huggingface.co/spaces/DAMO-NLP-SG/Video-LLaMA
* 應用 MiniGPT-4
* https://minigpt-4.github.io/
* CoT
* Frame by Frame 的思考 https://arxiv.org/abs/2305.13903
* 針對影像先產生推理文字過程,再產生最後答案 https://arxiv.org/abs/2302.00923
* PandaGPT 同時吃影像跟聲音
* https://panda-gpt.github.io/
* HuggingGPT 使用工具
* https://huggingface.co/spaces/microsoft/HuggingGPT
### ViT (2021)
* Transformer 就是個通用的 sequence processing tool
* 圖片、聲音、音樂、影片、遊戲動作、蛋白質..... 等等
* cross attention 可以銜接不同 modalities 模態
* Stable Diffusion 就是銜接了 text 跟 image https://stability.ai/blog/stable-diffusion-announcement
* Computer vision
* 之前是用 CNN
* Vision Transformer (ViT) 2021
* paper: https://arxiv.org/abs/2010.11929
* 彩色圖片是 3-D tensors
* 灰階圖片則是 2-D,在大量圖片時用灰階可以節省計算資源
* https://towardsdatascience.com/what-is-a-tensor-in-deep-learning-6dedd95d6507
* https://www.kdnuggets.com/2020/01/convert-picture-numbers.html
* ViT 將 image 分割成固定大小的 image patch,然後 linear project 到 vector 並添加 positional embedding,再送到 Transformer
* patch 就像是 LLM 中的 sub-word
* ViT 只有在 larger datasets 時,效能才超過 ResNets
* 且 ViT 的訓練速度比 CNN 快四倍
* 後續 papers (2021):
* https://arxiv.org/abs/2103.14030
* https://arxiv.org/abs/2105.01601 (不算是 ViT 也不是 CNN)
* ViT 不算是革命,而只是演進
* 教授說: 相同問題其實 CNN 也可以解決,彈是 ViT 有計算效率優勢,因為硬體好配合
### Audio
* 聲音是頻率和時間,即可轉成 embedding vector,就可以送進 Transformer
* 挑戰是 context length 會很長,因為錄音時間長
* Speech Transformer paper (2018): https://ieeexplore.ieee.org/document/8462506
* Review paper: https://arxiv.org/abs/2305.00359
* 2019 前都沒用 transformer
* 而且都只有處理 text-speech, speech-text or speech-speech
* 比起文字跟圖片,語音的訓練資料又更難取得了
* Meta AI 的 data2vec 是第一個有處理到情緒、語調、講者辨識的多模態模型: https://ai.meta.com/blog/the-first-high-performance-self-supervised-algorithm-that-works-for-speech-vision-and-text/
![[Pasted image 20230929211425.png]]
### Training Data
* text-to-audio 和 text-to-video 的訓練資料,很難取得,研究者常常給自己搞
* https://maxbain.com/webvid-dataset/
* https://github.com/OpenGVLab/InternVideo/tree/main/Data/instruction_data
* https://arxiv.org/abs/2305.13903
* Structured Scene Description (FAMOuS)
* LLaVA
* https://llava.hliu.cc/
* https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/viewer/liuhaotian--LLaVA-CC3M-Pretrain-595K/train?row=0
* 用 model generated data https://arxiv.org/abs/2304.08485
* LAION-5B 目前最好的開源 多模態資料集 https://laion.ai/blog/laion-5b/
* 但僅限研究用途
* 裡面有 copyright 圖片
### X-shot learning: Computer vision
![[Pasted image 20230929210925.png]]
* 優點: 解決訓練資料缺少問題,可以不需要這麼多訓練資料
* zero-shot learning
* CLIP (openai) https://openai.com/research/clip
* 在 non-ImageNet 的資料上表現好很多
* 最大的限制 Inflexible: 無法產生文字,只能產生 caption文字 的 機率
* few-shot (2022)
* Flamingo (deepmind) https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model
* paper: https://arxiv.org/abs/2204.14198
* 可以產生 free-form text
* 用法: (image/video + text) + visual query
![[Pasted image 20230929224212.png]]
### X-shot learning: Audio
* OpenAI Whisper 是用 Encoder-decoder transformer
* 有用 CNN 降低聲音維度
* 拆成 30s frames
* Whisper 辨識英文的錯誤率已經跟人類差不多了
* 測試資料集是 LibriSpeech http://www.openslr.org/12
### 挑戰
* MLLMs 仍有 LLM 的限制
* Hallucination 幻覺
* Prompt sensitivity, Context limit
* Inference compute cost
* Bais, Toxicity 等
* Copyrights issues
* lack common sense
* 除了 Attention,還有什麼值得關注的方向?
* RLHF
* https://magazine.sebastianraschka.com/p/ahead-of-ai-6-train-differently
* 建立 Reward model
* Hyena Hierarchy https://hazyresearch.stanford.edu/blog/2023-03-07-hyena
* Retentive Networks https://arxiv.org/abs/2307.08621
### 新的應用
* DreamFusion 產生 3D 物件 https://dreamfusion3d.github.io/
* Make-a-Video 產生影片 https://makeavideo.studio/
* PaLM-E-bot 機器人 https://palm-e.github.io/#demo
* AlphaCode 寫程式 https://arxiv.org/abs/2203.07814
* Multi-lingual models: Bactrian-X https://github.com/mbzuai-nlp/Bactrian-X
* Textless NLP https://ai.meta.com/blog/textless-nlp-generating-expressive-speech-from-raw-audio/
* 不需要 text,從 raw audio 就產生 speech
* 對於 low-resource 語言很有用,沒有文字的語言
* AlphaFold 蛋白質
* https://www.deepmind.com/blog/alphafold-reveals-the-structure-of-the-protein-universe
* https://www.deepmind.com/research/highlighted-research/alphafold/timeline-of-a-breakthrough
* Gato: a generalist AI agent 通才工具 https://www.deepmind.com/blog/a-generalist-agent
### Lab 4
* 示範訓練一個 image captioning model (根據照片產生文字描述)
* 資料集用 sub_captions https://huggingface.co/datasets/sbu_captions
* 用 Vision Encoder Decoder Models https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder#overview 方式,可將任意 transformer based vision 和 text model 銜接在一起
* vision pre-training model 用 https://huggingface.co/google/vit-base-patch16-224-in21k 這是 encoder
* text pre-training model 用 gpt2,這是 decoder
* 示範用現成的 BLIP model https://huggingface.co/docs/transformers/model_doc/blip
* 練習做 zero-shot video classification
* 參考自 https://github.com/NielsRogge/Transformers-Tutorials/blob/master/X-CLIP/Zero_shot_classify_a_YouTube_video_with_X_CLIP.ipynb
* 使用 X-CLIP model https://huggingface.co/docs/transformers/model_doc/xclip
* https://huggingface.co/microsoft/xclip-base-patch16-zero-shot
* 影片用 pytube 抓 https://pytube.io/en/latest/index.html
* 載入和處理影片用 https://github.com/dmlc/decord
* 用 https://huggingface.co/openai/clip-vit-base-patch32 做一次 zero-shot image classification
* 根據圖片,產生 text caption 機率
* 用 openai whisper API
* https://github.com/openai/whisper/discussions/categories/show-and-tell
* https://huggingface.co/spaces/aadnk/whisper-webui
* https://github.com/openai/whisper/discussions/264
### Resources
- [Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts](https://arxiv.org/pdf/2307.11661.pdf)
- [EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention](https://arxiv.org/abs/2305.07027)
- [Key-Locked Rank One Editing for Text-to-Image Personalization](https://arxiv.org/pdf/2305.01644.pdf)
- This paper describes text-to-image generation being done with a model that is 100KB in size. Maybe size isn't everything.
- [AudioCraft by MetaAI](https://audiocraft.metademolab.com/)
- MetaAI just released this code base for generative audio needs in early August 2023. It can model audio sequences and capture the long-term dependencies in the audio.
- [X-ray images with LLMs and vision encoders](https://arxiv.org/abs/2308.01317)
- [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284)
- [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)
- This is the original paper on Stable Diffusion.
- [The Illustrated Stable Diffusion by Jay Alammar](https://jalammar.github.io/illustrated-stable-diffusion/)
- This blog post illustrates how stable diffusion model works.
- [All are Worth Words: A ViT Backbone for Diffusion Models](https://arxiv.org/abs/2209.12152)
- This paper describes how to add diffusion models to Vision Transformer.
- [RLHF: Reinforcement Learning from Human Feedback](https://huyenchip.com/2023/05/02/rlhf.html) by Huyen Chip