published 2024, updated 2025

<aside> ♻️

2025 Update:

I added a section on H-Net / Dynamic Chunking (2025), a newer end-to-end approach that learns segmentation jointly with the model.

</aside>

Tokenization is useful because it significantly decreases training and inference cost by shortening the effective sequence length seen by the main model. However, tokenization also comes with well-known drawbacks: sensitivity to noise, weaker character/number handling, representational biases, and extra system complexity.

This article surveys why people want to remove tokenization, and summarizes a few research directions toward tokenizer-free (or tokenizer-lite) language modeling.

Motivation

What

Modern LLM pipelines typically look like:

String → bytes (e.g., UTF-8)
Bytes → tokens (usually subwords via BPE / Unigram LM / WordPiece-style schemes)
Tokens → model (embedding + “global” sequence model)
Model → tokens → bytes → string (detokenization / decoding)

Tokenization is primarily a compression layer for sequence modeling: fewer “symbols” means fewer steps for the main network, which matters a lot when your backbone has quadratic cost (e.g. attention) or large per-token FFN cost.

Most modern tokenizers (BPE / unigram variants) create tokens based on substring frequency (e.g., car will most likely be one token, but frequency could be “freq”, “ue”, and “ncy”)

Why we want to remove it

The right unit of computation and “concepts”

Subword tokenization is a heuristic compromise: it compresses text well, but it hard-codes a particular segmentation that may not align with semantics. Humans don’t think in subwords; humans think in concepts (e.g., Beyoncé — not “Bey”, “once”, “é”). Concepts are language- and modality-agnostic and often represent higher-level structure.

If a model could learn its own units of computation (and do so in a way that’s stable and efficient), it could allocate compute where it matters and potentially learn abstractions that generalize better across domains and languages.