* https://sebastianraschka.com/blog/2025/bpe-from-scratch.html ## 計算 app * https://tiktokenizer.vercel.app/ * https://huggingface.co/spaces/Xenova/the-tokenizer-playground * openai https://platform.openai.com/tokenizer ## Let's build the GPT Tokenizer https://twitter.com/karpathy/status/1759996549109776702 (2024/2/21) https://www.youtube.com/watch?v=zduSFxRajkE (約看了前一半,後一半還沒看) code: https://github.com/karpathy/minbpe ## OpenAI 用 cl100k_base https://platform.openai.com/tokenizer https://github.com/openai/tiktoken ## Mistral https://github.com/mistralai/mistral-common ## 相關 papers * Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models * https://x.com/karpathy/status/1789590397749957117 (2024/5/10) * https://arxiv.org/abs/2405.05417 ## 繁體中文的 LLM 分詞器 Tokenizer 表現比較 https://ihower.tw/blog/archives/11933 ### 其他相關連結 * 受到這篇的啟發 https://www.linkedin.com/posts/peter-gostev_llm-providers-charge-you-per-token-but-their-activity-7177810257417523200-c4dp/ 2024/3 * 不過他把 Non-English Text 混一起,不知道是怎麼計算的 * Cohere Command R 做的結果,Chinese 相差 50%,跟我的評測很接近 - https://twitter.com/Prashant_1722/status/1776457732213870869 - https://txt.cohere.com/command-r-plus-microsoft-azure/ * Breeze 分詞說明 https://www.facebook.com/permalink.php?story_fbid=3130697160397454&id=100003716013282 - 大神的教學影片 https://www.youtube.com/watch?v=zduSFxRajkE - 有人做的 for code https://twitter.com/amanrsanger/status/1771590523046051947 ??