by Haytham ElFadeel - [email protected]
published 2024, updated 2025
<aside> ♻️
2025 Update:
I added a section on H-Net / Dynamic Chunking (2025), a newer end-to-end approach that learns segmentation jointly with the model.
</aside>
Tokenization is useful because it significantly decreases training and inference cost by shortening the effective sequence length seen by the main model. However, tokenization also comes with well-known drawbacks: sensitivity to noise, weaker character/number handling, representational biases, and extra system complexity.
This article surveys why people want to remove tokenization, and summarizes a few research directions toward tokenizer-free (or tokenizer-lite) language modeling.
Modern LLM pipelines typically look like:
Tokenization is primarily a compression layer for sequence modeling: fewer “symbols” means fewer steps for the main network, which matters a lot when your backbone has quadratic cost (e.g. attention) or large per-token FFN cost.
Most modern tokenizers (BPE / unigram variants) create tokens based on substring frequency (e.g., car will most likely be one token, but frequency could be “freq”, “ue”, and “ncy”)
Subword tokenization is a heuristic compromise: it compresses text well, but it hard-codes a particular segmentation that may not align with semantics. Humans don’t think in subwords; humans think in concepts (e.g., Beyoncé — not “Bey”, “once”, “é”). Concepts are language- and modality-agnostic and often represent higher-level structure.
If a model could learn its own units of computation (and do so in a way that’s stable and efficient), it could allocate compute where it matters and potentially learn abstractions that generalize better across domains and languages.