
Alisa Liu
@alisawuffles
We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵
This started with a curiosity💡: why do all LLMs limit tokens to *parts* of whitespace-delimited words? After all, many word sequences (e.g. “by the way”) function as single units. Different languages can also express the same meaning in one or several words.
E.g. “math teacher” = “Mathelehrer” in German. At the extreme, Chinese *doesn’t use whitespace at all*, so its tokens can span many words — yet this has seemingly not hindered LMs like @deepseek_ai from learning it!
What can we gain from less restrictive tokenization? To find out, we developed SuperBPE🚀, which learns subword *and* superword tokens. SuperBPE dramatically improves encoding efficiency over BPE — at a fixed vocab size of 200k, SuperBPE reduces sequence length by 33% on average!
Then we pretrain 8B models from scratch with BPE and SuperBPE🚀, fixing everything about the training setup except the tokenizer. We see +4% on avg📈 across 30 downstream tasks, and win on 25/30 of individual tasks, while also being 27% more efficient at inference time.
Why does SuperBPE🚀 work? We find that loss is distributed more uniformly over tokens in SuperBPE models. They are less overfit to high-frequency, easy-to-predict tokens (e.g. “way” after “By the”), and at the same time master a much broader set of language phenomena.
SuperBPE🚀 is a seamless replacement for BPE in modern LM development pipelines, requiring no changes to the model architecture or training framework. You can use it in HF right now!
Play around with our tokenizers here! https://superbpe.github.io/ 🚀 Paper: https://arxiv.org/abs/2503.134... HF models & tokenizers: https://tinyurl.com/superbpe This work would not have been possible w/o co-1st 🌟@jonathanhayase🌟, and @vjhofmann @sewoong79 @nlpnoah @yejinchoinka