
RAG (Retrieval-Augmented Generation) is a technical architecture that retrieves external documents in real time and incorporates their content into an LLM's response generation.
While applications in internal knowledge search and customer support chatbots are rapidly expanding, the complaint that "the prototype worked but we can't ship it to production" is repeatedly heard in the field. Frameworks such as LangChain and LlamaIndex have lowered the barrier to implementation, but the number of pitfalls encountered across the design, implementation, and operations phases has grown proportionally.
This article is intended for the following readers:
By the end of this article, you will have a systematic understanding of 10 failure patterns that recur in the field along with concrete countermeasures for each, giving you knowledge you can immediately apply as a design review checklist or implementation reference.
RAG is increasingly being adopted in production by companies seeking to equip LLMs with up-to-date and specialized knowledge. However, systems that worked fine at the PoC stage continue to suffer unexpected drops in answer accuracy when deployed to production environments.
Behind this lies the inherently complex processing pipeline of RAG. Failure can lurk at many points: document preprocessing, chunk splitting, embedding generation, vector search, prompt assembly, and LLM inference.
The following H3 sections dig into the structural factors that make failures likely, as well as the pitfalls that arise in each of the design, implementation, and operations phases.
RAG (Retrieval-Augmented Generation) has been widely adopted as an architecture that simultaneously mitigates the "knowledge freshness problem" and the "hallucination problem" of LLMs. Models such as GPT and Claude have no information beyond their training data cutoff, making them ill-suited on their own for scenarios that require referencing internal documents or the latest product specifications. RAG is a powerful means of compensating for that shortcoming.
However, it comes with structural complexity. A RAG pipeline is composed primarily of the following elements:
These elements share a vulnerability: a design error in even one of them causes a cascading degradation in final answer quality. For example, when chunk boundaries cut off mid-context, the resulting embedding vectors cannot accurately represent meaning, and cases have been reported where highly relevant documents fail to appear at the top of search results.
Furthermore, as applied forms such as Agentic RAG and multimodal RAG proliferate, pipeline complexity tends to increase. The fact that a "seemingly working but low-accuracy" state can go unnoticed for extended periods is the fundamental challenge of building RAG systems.
What makes RAG failures troublesome is that they do not stem from a single bad component — they occur across all three phases of design, implementation, and operations. Because the nature of problems differs by phase, identifying the root cause is often delayed.
In the design phase, lax requirements definition cascades into downstream problems.
These lead to the classic symptom of "it runs at first but accuracy never materializes." It is also worth noting that the spread of evaluation frameworks such as RAGAS and TruLens is making metric definition at the design stage an industry standard.
In the implementation phase, technology selection mismatches occur frequently.
Implementation mistakes tend to go undetected because they produce a "seemingly working but low-accuracy" state.
In the operations phase, document freshness management is the biggest blind spot.
Developing the habit of reviewing all three phases holistically is the first step toward preventing RAG failures before they occur.

RAG failures tend to concentrate in the process of elevating a system from "vaguely working" to "production quality." The problems that commonly arise across the three phases of design, implementation, and operations have been organized into 10 patterns. Start by grasping the overall picture and comparing it against the current state of your own project. Detailed root-cause analysis and countermeasures for each pattern are explained step by step in the sections that follow.
Design Phase (Patterns 1–3)
Implementation Phase (Patterns 4–7)
Operations Phase (Patterns 8–10)
Mistakes made during the design phase ripple out and negatively impact every subsequent stage. Since the cost of rework is high once implementation has begun, it is important to verify the following three patterns before starting.
Pattern 1: Inappropriate Chunk Size Design
If chunks are too large, irrelevant information gets mixed in during retrieval, degrading the LLM's answer accuracy. Conversely, if they are too small, context gets cut off, and chunks that are meaningless on their own tend to rank highly.
Pattern 2: Failing to Anticipate the Diversity of User Queries
If query variations are not identified during the design phase, unexpected questions in production will continue to return irrelevant search results.
Pattern 3: Skipping Data Source Quality Assessment
Vectorizing documents while their quality remains low will not improve retrieval accuracy. The principle of "Garbage In, Garbage Out" is no exception in RAG.
Even during the implementation phase after the design is finalized, easy-to-overlook pitfalls appear in succession. Patterns 4–7 are typical examples that produce a state where the system "appears to be working but fails to deliver accuracy."
Pattern 4: Language Mismatch Between the Embedding Model and the LLM
When Japanese documents are vectorized using an embedding model specialized for English, the semantic space of the vocabulary becomes misaligned, and similarity scores tend to become unreliable. Since the range of multilingual model options—such as multilingual-e5-large and the text-embedding-3 series—has expanded, it is important to verify in advance that the languages supported by the LLM and the embedding model are aligned.
Pattern 5: Incorrect Chunk Overlap Configuration
Setting overlap to zero causes context to break off, while setting it too high makes duplicate information a source of noise. Since the optimal value differs depending on the document structure, empirical evaluation across multiple configurations is required.
Pattern 6: Flat Search Without Leveraging Metadata
Relying solely on vector similarity carries the risk that an older version of a manual will rank higher than a newer one. Combining creation date, category, version number, and similar fields as filter conditions can improve retrieval accuracy.
Pattern 7: Passing the Top-k Results Directly Without Re-ranking
The top results from a vector search are "semantically close," but are not necessarily "most useful for answering the question." Cases have been reported where inserting a Cross-Encoder-based re-ranking model improves the quality of the context passed to the LLM.
Please verify the following as an implementation checklist:
If these points are overlooked before going to production, the cost of corrections during the operations phase tends to increase significantly.
Problems discovered after going live tend to require correction costs several times greater than those in the design phase. The complacency of thinking "it's working, so it must be fine" is what leads to failures unique to the operations phase.
Pattern 8: Neglecting to Rebuild the Index When Documents Are Updated
Internal regulations and product specifications are revised frequently. If operations continue with an outdated vector index in place, there is a risk that responses will be generated based on deprecated rules or old-version specifications. It is important to automate a differential update pipeline and explicitly design the update triggers.
Pattern 9: Having No Evaluation or Monitoring Mechanism
A mechanism that periodically measures RAGAS scores in batch and triggers an alert when scores fall below a threshold is effective.
Pattern 10: Failing to Account for Behavioral Changes Caused by LLM Model Version Updates
When an API provider updates a model, the output tendencies can change even with the same prompt. If prompt templates or post-processing logic depend on an older version, this can lead to sudden quality degradation.
Building continuous measurement and update management into the design phase from the outset is the key to maintaining quality over the long term.

Once you have grasped the overall picture with the checklist, the next step is to structurally understand "why those failures occur" and "how to avoid them."
Each pattern has causes specific to its respective phase—design, implementation, or operations. Even if you fix only the surface-level symptoms, the same problems tend to recur if the root cause remains. Read on while comparing each pattern against which phase of your own system it applies to.
Incorrect chunk size configuration is one of the most frequently reported failures in RAG development. Cases where the judgment that "cutting larger chunks prevents information loss" instead significantly degrades retrieval accuracy are commonly observed.
Why Overly Large Chunks Are Problematic
In vector search, the meaning of an entire chunk is compressed into a single embedding vector. The larger the chunk, the more likely that vector becomes a "vague representation averaging multiple topics," making similarity calculations with queries inaccurate. As a result, "retrieval misses"—where the sections that should match do not rank highly—increase.
Common Symptoms
Recommended Approach: Small-to-Big Retrieval (Parent-Child Chunk Strategy)
The currently mainstream approach is to perform high-precision matching at retrieval time using small chunks (approximately 128–256 tokens), while passing parent chunks (512–1,024 tokens) as input to the LLM. This preserves context while maintaining retrieval accuracy. LlamaIndex and LangChain provide functionality to implement this strategy as a standard feature.
Implementation Guidelines (Reference Values)
Chunk size is not something you "set once and forget." Regular measurement of retrieval accuracy using evaluation frameworks such as RAGAS, combined with ongoing adjustment, is important for maintaining production quality. Since the optimal values differ depending on the nature of the documents and the use case, the figures above are provided as reference only, and validation in your own environment is recommended.
A mismatch between the languages of the embedding model and the LLM quietly erodes the overall accuracy of a RAG system. When search results are returned but the answers feel somehow off—that is one of the root causes worth suspecting.
The Structure of the Problem
There is a division of roles: the embedding model "calculates the semantic distance between a query and documents," while the LLM "reads the retrieved context and generates an answer." When these two components operate in different language spaces, relevant chunks may rank highly in search results, yet the LLM may still fail to interpret those chunks correctly.
Typical mismatch patterns are as follows:
Mitigation Strategies
Model selection is costly to change once finalized. Making it a habit to A/B test multiple models in the early stages is key to preventing rework downstream.
Are you passing the top-k chunks retrieved by vector search directly to the LLM? While easy to implement, this is a classic pitfall that can significantly degrade answer quality.
Vector search measures semantic proximity using metrics such as cosine similarity, but relevance to a query and the amount of information needed to answer it do not necessarily align. Even chunks with high similarity scores do not always rank at the top when it comes to accurately answering the actual question.
Why Passing the Top-k Chunks Directly Causes Problems
Mitigation Strategy: Introducing Re-ranking
It is effective to insert a re-ranking step using a cross-encoder model between the retrieval phase and the generation phase. Because a cross-encoder takes both the query and the chunk as simultaneous inputs to precisely evaluate relevance, it can be expected to produce more accurate rankings than the bi-encoder approach used in vector search.
Cohere's Rerank endpoint, the BAAI/bge-reranker series, and lightweight local models such as FlashRank are widely used. Combining these allows you to narrow down the chunks passed to the LLM and improve the quality of the context.
Key Implementation Points
A RAG system is not "done once it's built." Leaving the vector index untouched after documents have been updated is one of the frequently reported failure patterns in production deployments.
Why This Becomes a Problem
What is stored in the vector store is nothing more than a snapshot taken at the time of indexing. When handling frequently updated documents—such as internal policies, product specifications, or API documentation—keeping a stale index leads to the following issues:
The Often-Overlooked Lack of Incremental Update Design
Many teams are mindful of "full index rebuilds" but tend to go live without designing a mechanism for incremental updates. Full rebuilds become increasingly costly and time-consuming as the number of documents grows, making them impractical in environments with high update frequency.
Document loaders in LlamaIndex and LangChain are progressively incorporating incremental change detection capabilities. A design that manages document hash values and timestamps, and re-embeds only the chunks that have changed, is recommended (refer to the official documentation of each framework for the latest specifications).
Key Mitigation Points
Index freshness is directly tied to RAG answer quality. Once a system enters the operational phase, automating the update workflow and establishing a monitoring framework should be the top priorities.

Stopping at "we built something that works" is a reported cause of unexpected problems surfacing in production environments. In RAG implementations, approaches that appear correct at first glance can significantly undermine retrieval accuracy and cost efficiency. This section covers two NG (no-good) implementation patterns that are especially common in the field, and explains what makes them problematic.
Vectorizing PDFs directly is one of the most representative NG examples in RAG development. While it may seem straightforward, it is important to be aware that this approach tends to significantly degrade retrieval accuracy.
Why It's an NG: The Structural Problems PDFs Carry
PDF is a format that prioritizes print layout, and various types of noise are introduced during the text extraction stage. The most common issues are as follows:
When this noise is vectorized as-is, the quality of the embedding vectors degrades, making it harder for relevant chunks to surface in search results.
Mitigation Strategy: Establish a Preprocessing Pipeline
Preprocessing using libraries such as PyMuPDF, pdfplumber, and Unstructured is widely adopted. The basic steps are as follows:
Tesseract or Azure Document Intelligence.PDF preprocessing is the foundation that determines the overall quality of a RAG system. Allocating sufficient effort to the stage before vectorization is essential for ensuring production-level quality.
One common anti-pattern frequently seen in practice is the assumption that "passing everything is safe" — stuffing large amounts of text into the context field of a prompt by passing all retrieved chunks without filtering.
While it may seem that more information would improve answer accuracy, this approach tends to be counterproductive. As the context grows longer, LLMs find it increasingly difficult to determine which information is important, and degraded response quality has been widely reported. This phenomenon is known as the "Lost in the Middle" problem, with research showing that information placed near the middle of a prompt is less likely to be referenced by the model.
A Typical Failure Sequence
Factors That Compound the Problem
Key Mitigation Strategies
Even now that the context windows of GPT and Claude have been significantly expanded, "being able to pass more" does not mean "you should pass more." To maintain the right balance of accuracy, cost, and latency, it is essential to design with the principle of always keeping the amount of context to the necessary minimum.

While considerable effort is often invested in chunk design and the selection of retrieval algorithms, there are pitfalls that tend to be overlooked. Releasing systems without evaluation metrics in place, and accuracy degradation in environments where multilingual documents are mixed together, are challenges that are repeatedly reported in practice. The following subsections will dig into these two pitfalls and outline concrete directions for addressing them.
A RAG system that is "running" and one that is "answering accurately" are two entirely different things. Releasing to production without evaluation metrics in place risks delayed detection of quality degradation, leaving user complaints as the only feedback mechanism.
Key Risks of Releasing Without Evaluation
RAGAS is a widely referenced framework for RAG evaluation. It enables multi-dimensional measurement of system quality through the following key metrics:
The minimum recommended evaluation workflow before going to production is as follows:
Releasing to production based solely on the subjective assessment that "accuracy seems good" represents a structural deficiency from a quality assurance perspective. Building a solid evaluation foundation is the first step toward ensuring the reliability of a RAG system.
A knowledge base where Japanese and English are mixed is a breeding ground for retrieval accuracy degradation in RAG — one that tends to be overlooked. While multilingual embedding models continue to advance, issues stemming from mixed-language content are still widely reported in real-world deployments.
Why Accuracy Degrades
Embedding models tend to have different vector space distributions for different languages. Even when an English document is semantically close to a Japanese query, the vector distance can widen, causing that document to be missed in the search results.
The main causes of degradation are as follows:
A Concrete Scenario
Consider a system where product manuals are managed in Japanese and technical specifications in English. For a Japanese query such as "冷却ファンの回転数制御" (cooling fan RPM control), even if the English specification document containing "cooling fan RPM control" is the most semantically appropriate source, it is likely to fail to surface in the top search results.
Mitigation Strategies
Mixed-language issues tend to surface only once the system is in use. Auditing the language composition of documents at the design stage and incorporating countermeasures early is the most direct path to a successful production release.

Reflecting on the failure patterns covered so far, a key principle in RAG design is not to "aim for perfection from the start," but rather to "choose structures that are resistant to failure." If any one of retrieval strategy, chunk design, or evaluation metrics is missing, problems will inevitably surface at some point in the process. This section takes a deeper look at the design philosophy behind hybrid search — widely regarded as particularly effective — and at the Agentic RAG pattern, which handles complex queries with greater flexibility.
Queries that vector search alone cannot capture occur frequently in real-world deployments. Hybrid search is one of the most proven approaches available today for addressing that weakness.
Weaknesses of Vector Search and BM25
Combining both approaches allows coverage of both semantic approximation and lexical matching.
Score Integration: RRF as the Standard
Reciprocal Rank Fusion (RRF) has been widely adopted for score integration. It is a simple method that weights each result's rank by its reciprocal and sums the values, with the key advantage of combining two systems with different score scales without normalization. Major search platforms such as Elasticsearch, OpenSearch, and Weaviate are advancing native support for RRF-based hybrid search, reducing implementation costs compared to before.
Cases Where It Tends to Be Effective
Multiple cases have reported improvements in search accuracy for queries containing proper nouns. A practical approach is to start with a low BM25 weight so that vector search takes the lead, then adjust the ratio while monitoring accuracy logs. When combining with Agentic RAG in the next section, establishing hybrid search as the foundation of the retrieval layer further enhances the ability to handle complex queries.
Standard RAG assumes a simple pipeline of "1 query → 1 retrieval → 1 answer." In practice, however, queries that require multi-step reasoning or the integration of multiple sources arise frequently. Agentic RAG is the design pattern built to handle such complex queries.
Agentic RAG refers to an architecture in which the LLM functions as an agent, autonomously repeating a loop of retrieval, reasoning, and re-retrieval. The main design patterns can be organized into the following three types.
For example, a query such as "Compare and explain the technical reasons behind the price difference between Product A and Product B" is difficult to handle with a single retrieval. With the Plan-and-Execute pattern, it can be decomposed into three steps: retrieving Product A's specifications, retrieving Product B's specifications, and performing comparative reasoning.
The following points should be kept in mind during implementation:
Agentic RAG is powerful, but the added complexity introduces more potential failure points. A hybrid configuration—using standard RAG for simple queries and activating the agent only for complex ones—tends to be effective from an operational stability standpoint.

We have selected two questions frequently raised by developers and engineers working on RAG construction and operation. "Does RAG make sense even for a small number of documents?" and "Which LLM should I choose?" are questions that continue to come up regularly in practice. Because they directly inform design decisions, each is explained below with concrete perspectives.
To state the conclusion upfront: there are many cases where RAG functions effectively even with a small number of documents. However, the smaller the scale, the more the design considerations change.
In small-scale environments (on the order of tens to hundreds of documents), the vector search index size is small, so retrieval latency tends to be low. On the other hand, because the absolute number of candidate chunks is limited, rough chunk design is more likely to directly impact answer quality—a point that warrants attention.
Examples of use cases where it tends to be effective:
Conversely, one risk to be aware of is that fewer documents means more cases where "retrieval returns no hits." When no chunk corresponding to a query exists, the LLM tends to fill in the gap with its own knowledge, making hallucinations more likely. Explicitly incorporating an instruction such as "do not answer if the information is not found in the documents" into the prompt becomes especially important.
Key takeaways:
When building RAG, the basic approach to selecting an LLM is not "which is the most powerful" but "which fits my use case." GPT, Claude, and Gemini all deliver high performance, and it is difficult to rank them definitively.
Characteristics and suitable use cases for each model:
Points to verify when selecting a model:
In practice, it is effective to evaluate multiple models rather than relying on a single one before making a selection. It is recommended to use an evaluation framework such as RAGAS with your own documents and query sets to actually measure scores before deciding on a production model. Choosing a model simply because it is well-known can lead to unexpected issues in accuracy, cost, or latency.

The 10 failure patterns covered so far are scattered across the design, implementation, and operations phases. Rather than trying to resolve everything at once, a more practical approach is to work through them incrementally using a phase-specific checklist.
Quantitative evaluation combining multiple metrics — Retrieval accuracy, Answer Relevancy, and Faithfulness — using evaluation frameworks such as RAGAS and TruLens is becoming the standard. Measuring a baseline before going to production makes it easier to iterate through improvement cycles.
Key points for operating the checklist are as follows:
One area that is particularly easy to overlook is continuous monitoring during the operations phase. Each time documents are updated, the index tends to become stale, and answer quality silently degrades. Building quality checks in as a systematic mechanism is what leads to stable long-term operation.
RAG is not a system you build once and leave alone — it is one that must be continuously cultivated in response to data and user queries. Establishing the checklist not as a "design-time ritual" but as an "operational habit" is arguably the most direct path to a successful production deployment.

Chi
Majored in Information Science at the National University of Laos, where he contributed to the development of statistical software, building a practical foundation in data analysis and programming. He began his career in web and application development in 2021, and from 2023 onward gained extensive hands-on experience across both frontend and backend domains. At our company, he is responsible for the design and development of AI-powered web services, and is involved in projects that integrate natural language processing (NLP), machine learning, and generative AI and large language models (LLMs) into business systems. He has a voracious appetite for keeping up with the latest technologies and places great value on moving swiftly from technical validation to production implementation.

Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).