An algorithm that merges text based on frequent patterns and splits it into subword units. It directly affects the input/output cost and processing speed of LLMs; for low-resource languages, insufficient dedicated vocabulary leads to byte-level decomposition.
BPE Tokenizer (Byte-Pair Encoding Tokenizer) is an algorithm that splits text into subword units by merging frequently occurring character and character-string patterns, and serves as a foundational technology that directly affects the input/output costs and processing speed of LLMs (Large Language Models).
BPE originally emerged as a data compression technique. Its application to the NLP field became the prototype for modern tokenizers. The operating principle is straightforward: all characters are first treated as individual units, and then the most frequently occurring pair among adjacent symbols is merged into a single new symbol. By repeating this operation until the vocabulary size limit is reached, a vocabulary table is produced in which frequent words remain as single tokens while rare words are decomposed into subwords or individual characters.
The specific process can be summarized as follows:
As a result, "running" is split into run + ning, and "unhappiness" into un + happiness, enabling even unknown words to be handled as meaningful fragments.
Tokens serve as the fundamental unit for all LLM billing, speed, and context length considerations. Even for the same text, the number of tokens can vary significantly depending on the quality of the vocabulary design, directly impacting AI ROI (Return on Investment). When an English-centric vocabulary table is applied to Japanese text, it is not uncommon for a single kanji character to be decomposed into multiple tokens, which can cause processing costs to balloon several times over.
In the context of multilingual NLP (Multilingual Natural Language Processing), this problem is even more serious. Low-resource languages have inherently smaller training corpora, making it harder for frequent pairs to form, which means words tend to be decomposed into fine-grained subwords or individual characters. One approach to addressing these challenges is byte-level BPE, which builds the vocabulary over Unicode byte sequences. While byte-level BPE offers the versatility of theoretically eliminating unknown words, it carries the trade-off of increasing the number of tokens per sentence and making it harder for the model to learn meaningful semantic units.
GPT-series models use the "tiktoken" library based on BPE, while Claude and Gemini also employ their own custom-tuned subword tokenizers. In recent years, the Unigram Language Model—a probability model-based algorithm independent of BPE—has also come into widespread use, and SentencePiece, a toolkit implementing both BPE and Unigram algorithms, has been adopted by many models. The choice of tokenizer at the design stage of a foundation model has a significant impact on performance.
When customizing a model through fine-tuning or PEFT, it is common practice to carry over the base model's tokenizer as-is. This is because adding or modifying the vocabulary after the fact requires retraining the embedding layer, causing costs to spike.
When building RAG (Retrieval-Augmented Generation) pipelines, chunk size is often configured based on token count. Overlooking the premise that "character count ≠ token count" can lead to context window overflow and degraded retrieval accuracy. In particular, for non-Latin-script languages such as Japanese, Chinese, and Arabic, the same number of characters can consume 2–4 times as many tokens as English, making it advisable to understand the token conversion factor for each language.
It has also been pointed out that a mismatch between token segmentation granularity and semantic boundaries is one contributing factor to hallucination. When proper nouns or technical terms are split unnaturally, the risk increases that the model will reconstruct words in an incorrect context. At the practical level, considering whether to add domain-specific vocabulary at the vocabulary design stage or to standardize notation through prompt engineering represents a realistic approach to improving accuracy.



A token is the smallest unit used by an LLM when processing text. It is not necessarily a whole word; it can include parts of words, symbols, and spaces — essentially the fragments resulting from splitting text based on the model's vocabulary.

Embedding is a technique that transforms unstructured data such as text, images, and audio into fixed-length numerical vectors while preserving semantic relationships.

BPO refers to a form of outsourcing in which a company delegates specific business processes to an external specialized vendor. AI Hybrid BPO, which combines BPO with automation leveraging AI, has been attracting significant attention in recent years.

PEFT (Parameter-Efficient Fine-Tuning) is a collective term for fine-tuning methods that adapt a large language model to a specific task with minimal computational resources and data, by updating only a subset of the model's parameters rather than all of them.