Chunk size refers to the size (in number of tokens or characters) of the unit into which documents are split when stored in a vector store within a RAG pipeline. It is a critical parameter that directly affects retrieval accuracy and answer quality.
LLMs have an upper limit on their context window. Since it is not possible to pass hundreds of pages of internal manuals as-is, documents must be split into appropriate granular units (chunking), vectorized, and only the sections relevant to a query retrieved. At this point, "how large to make each cut" becomes the question of chunk size.
If chunks are too small, a single chunk lacks sufficient context, meaning that even when retrieved, it may not contain the information the LLM needs to construct an answer. Conversely, if chunks are too large, irrelevant information enters as noise, degrading answer accuracy while also increasing token costs.
Generally, 256–1,024 tokens is considered a starting point, but the optimal value depends on the domain and the nature of the queries. For short Q&A content such as FAQs, a smaller size is appropriate; for documents where surrounding context is important, such as technical specifications, a larger size is the standard practical approach.
To mitigate the problem of context being cut off at chunk boundaries, "overlap"—partially duplicating adjacent chunks—is commonly used. For example, with a chunk size of 512 tokens and an overlap of 64 tokens, the last 64 tokens of the previous chunk are also included at the beginning of the next chunk. This contributes to improved accuracy in BM25 and vector search, though storage and index size increase as a result.


A token is the smallest unit used by an LLM when processing text. It is not necessarily a whole word; it can include parts of words, symbols, and spaces — essentially the fragments resulting from splitting text based on the model's vocabulary.

An algorithm that merges text based on frequent patterns and splits it into subword units. It directly affects the input/output cost and processing speed of LLMs; for low-resource languages, insufficient dedicated vocabulary leads to byte-level decomposition.

An optimization technique that compresses model size by reducing parameter precision from 16-bit to 4-bit or similar, enabling inference with limited computational resources.

What is a Vector Database? A Complete Guide to How It Works, Top Product Comparisons, and RAG Applications