* https://sebastianraschka.com/blog/2025/bpe-from-scratch.html
## 計算 app
* https://tiktokenizer.vercel.app/
* https://huggingface.co/spaces/Xenova/the-tokenizer-playground
* openai https://platform.openai.com/tokenizer
## Let's build the GPT Tokenizer
https://twitter.com/karpathy/status/1759996549109776702 (2024/2/21)
https://www.youtube.com/watch?v=zduSFxRajkE (約看了前一半,後一半還沒看)
code: https://github.com/karpathy/minbpe
## OpenAI 用 cl100k_base
https://platform.openai.com/tokenizer
https://github.com/openai/tiktoken
## Mistral
https://github.com/mistralai/mistral-common
## 相關 papers
* Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
* https://x.com/karpathy/status/1789590397749957117 (2024/5/10)
* https://arxiv.org/abs/2405.05417
## 繁體中文的 LLM 分詞器 Tokenizer 表現比較
https://ihower.tw/blog/archives/11933
### 其他相關連結
* 受到這篇的啟發 https://www.linkedin.com/posts/peter-gostev_llm-providers-charge-you-per-token-but-their-activity-7177810257417523200-c4dp/ 2024/3
* 不過他把 Non-English Text 混一起,不知道是怎麼計算的
* Cohere Command R 做的結果,Chinese 相差 50%,跟我的評測很接近
- https://twitter.com/Prashant_1722/status/1776457732213870869
- https://txt.cohere.com/command-r-plus-microsoft-azure/
* Breeze 分詞說明 https://www.facebook.com/permalink.php?story_fbid=3130697160397454&id=100003716013282
- 大神的教學影片 https://www.youtube.com/watch?v=zduSFxRajkE
- 有人做的 for code https://twitter.com/amanrsanger/status/1771590523046051947 ??