Profile picture of Alisa Liu

Alisa Liu

@alisawuffles

Published: March 21, 2025
58
213
1.9k

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

Image in tweet by Alisa Liu

This started with a curiosity💡: why do all LLMs limit tokens to *parts* of whitespace-delimited words? After all, many word sequences (e.g. “by the way”) function as single units. Different languages can also express the same meaning in one or several words.

E.g. “math teacher” = “Mathelehrer” in German. At the extreme, Chinese *doesn’t use whitespace at all*, so its tokens can span many words — yet this has seemingly not hindered LMs like @deepseek_ai from learning it!

What can we gain from less restrictive tokenization? To find out, we developed SuperBPE🚀, which learns subword *and* superword tokens. SuperBPE dramatically improves encoding efficiency over BPE — at a fixed vocab size of 200k, SuperBPE reduces sequence length by 33% on average!

Image in tweet by Alisa Liu

Then we pretrain 8B models from scratch with BPE and SuperBPE🚀, fixing everything about the training setup except the tokenizer. We see +4% on avg📈 across 30 downstream tasks, and win on 25/30 of individual tasks, while also being 27% more efficient at inference time.

Image in tweet by Alisa Liu

Why does SuperBPE🚀 work? We find that loss is distributed more uniformly over tokens in SuperBPE models. They are less overfit to high-frequency, easy-to-predict tokens (e.g. “way” after “By the”), and at the same time master a much broader set of language phenomena.

Image in tweet by Alisa Liu

SuperBPE🚀 is a seamless replacement for BPE in modern LM development pipelines, requiring no changes to the model architecture or training framework. You can use it in HF right now!

Image in tweet by Alisa Liu

Play around with our tokenizers here! https://superbpe.github.io/ 🚀 Paper: https://arxiv.org/abs/2503.134... HF models & tokenizers: https://tinyurl.com/superbpe This work would not have been possible w/o co-1st 🌟@jonathanhayase🌟, and @vjhofmann @sewoong79 @nlpnoah @yejinchoinka

Image in tweet by Alisa Liu

Share this thread

Read on Twitter

View original thread

Navigate thread

1/8