BPE Tokenizer (Byte-Pair Encoding Tokenizer)

BPE Tokenizer (Byte-Pair Encoding Tokenizer)

An algorithm that merges text based on frequent patterns and splits it into subword units. It directly affects the input/output cost and processing speed of LLMs; for low-resource languages, insufficient dedicated vocabulary leads to byte-level decomposition.

BPE Tokenizer (Byte-Pair Encoding Tokenizer) is an algorithm that splits text into subword units by merging frequently occurring character and character-string patterns, and serves as a foundational technology that directly affects the input/output costs and processing speed of [LLMs (Large Language Models)](slug: llm).

How the Algorithm Works

BPE originally emerged as a data compression technique. Its application to the NLP field became the prototype for modern tokenizers. The operating principle is straightforward: all characters are first treated as individual units, and then the most frequently occurring pair among adjacent symbols is merged into a single new symbol. By repeating this operation until the vocabulary size limit is reached, a vocabulary table is produced in which frequent words remain as single tokens while rare words are decomposed into subwords or individual characters.

The specific process can be summarized as follows:

  • Corpus collection: Gather large amounts of training text and expand it to the character level
  • Frequency counting: Tally the occurrence count of adjacent pairs across the entire corpus
  • Merge operation: Add the most frequent pair to the vocabulary as a new token and replace all corresponding occurrences in the corpus
  • Iteration: Repeat merges until the configured vocabulary size is reached (e.g., 30,000–100,000 tokens)

As a result, "running" is split into run + ning, and "unhappiness" into un + happiness, enabling even unknown words to be handled as meaningful fragments.

Why Token Design Directly Affects Cost

[Tokens](slug: token) serve as the fundamental unit for all LLM billing, speed, and context length considerations. Even for the same text, the number of tokens can vary significantly depending on the quality of the vocabulary design, directly impacting [AI ROI (Return on Investment)](slug: ai-roi). When an English-centric vocabulary table is applied to Japanese text, it is not uncommon for a single kanji character to be decomposed into multiple tokens, which can cause processing costs to balloon several times over.

In the context of [multilingual NLP (Multilingual Natural Language Processing)](slug: multilingual-nlp), this problem is even more serious. Low-resource languages have inherently smaller training corpora, making it harder for frequent pairs to form, which means words tend to be decomposed into fine-grained subwords or individual characters. One approach to addressing these challenges is byte-level BPE, which builds the vocabulary over Unicode byte sequences. While byte-level BPE offers the versatility of theoretically eliminating unknown words, it carries the trade-off of increasing the number of tokens per sentence and making it harder for the model to learn meaningful semantic units.

Adoption in Major Models and Derived Methods

GPT-series models use the "tiktoken" library based on BPE, while Claude and Gemini also employ their own custom-tuned subword tokenizers. In recent years, the Unigram Language Model—a probability model-based algorithm independent of BPE—has also come into widespread use, and SentencePiece, a toolkit implementing both BPE and Unigram algorithms, has been adopted by many models. The choice of tokenizer at the design stage of a [foundation model](slug: foundation-model) has a significant impact on performance.

When customizing a model through [fine-tuning](slug: fine-tuning) or [PEFT](slug: peft), it is common practice to carry over the base model's tokenizer as-is. This is because adding or modifying the vocabulary after the fact requires retraining the embedding layer, causing costs to spike.

Practical Considerations

When building [RAG (Retrieval-Augmented Generation)](slug: rag) pipelines, [chunk size](slug: chunk-size) is often configured based on token count. Overlooking the premise that "character count ≠ token count" can lead to context window overflow and degraded retrieval accuracy. In particular, for non-Latin-script languages such as Japanese, Chinese, and Arabic, the same number of characters can consume 2–4 times as many tokens as English, making it advisable to understand the token conversion factor for each language.

It has also been pointed out that a mismatch between token segmentation granularity and semantic boundaries is one contributing factor to [hallucination](slug: hallucination). When proper nouns or technical terms are split unnaturally, the risk increases that the model will reconstruct words in an incorrect context. At the practical level, considering whether to add domain-specific vocabulary at the vocabulary design stage or to standardize notation through [prompt engineering](slug: prompt-engineering) represents a realistic approach to improving accuracy.