10 RAG Implementation Failure Patterns and How to Avoid Them — Preventing Production Issues Before They Happen

Struggling with "Off-Target Answers" or "Not Ready for Production" After Building RAG? 10 Recurring Failure Patterns from the Field—Each with Concrete Workarounds
RAG (Retrieval-Augmented Generation) is a technical architecture that retrieves external documents in real time and incorporates their content into an LLM's response generation.
While applications in internal knowledge search and customer support chatbots are rapidly expanding, the complaint that "the prototype worked but we can't ship it to production" is repeatedly heard in the field. Frameworks such as LangChain and LlamaIndex have lowered the barrier to implementation, but the number of pitfalls encountered across the design, implementation, and operations phases has grown proportionally.
This article is intended for the following readers:
- Engineers who have completed a RAG PoC (proof of concept) but are hitting a wall when trying to move to production
- ML engineers and development leads struggling with retrieval accuracy or hallucination issues and looking for ways to improve
- Architects who are about to design and introduce RAG and want to understand common failure patterns in advance
By the end of this article, you will have a systematic understanding of 10 failure patterns that recur in the field along with concrete countermeasures for each, giving you knowledge you can immediately apply as a design review checklist or implementation reference.
RAG is increasingly being adopted in production by companies seeking to equip LLMs with up-to-date and specialized knowledge. However, systems that worked fine at the PoC stage continue to suffer unexpected drops in answer accuracy when deployed to production environments.
Behind this lies the inherently complex processing pipeline of RAG. Failure can lurk at many points: document preprocessing, chunk splitting, embedding generation, vector search, prompt assembly, and LLM inference.
The following H3 sections dig into the structural factors that make failures likely, as well as the pitfalls that arise in each of the design, implementation, and operations phases.
The Problem RAG Aims to Solve and Its Structural Challenges
RAG (Retrieval-Augmented Generation) has been widely adopted as an architecture that simultaneously mitigates the "knowledge freshness problem" and the "hallucination problem" of LLMs. Models such as GPT and Claude have no information beyond their training data cutoff, making them ill-suited on their own for scenarios that require referencing internal documents or the latest product specifications. RAG is a powerful means of compensating for that shortcoming.
However, it comes with structural complexity. A RAG pipeline is composed primarily of the following elements:
- Document preprocessing: Normalizing diverse formats such as PDF, HTML, and Markdown
- Chunk splitting: Designing the size and boundaries for dividing text into searchable units
- Embedding: Selecting a model to convert text into vectors
- Vector store: Configuring index construction and approximate nearest-neighbor search
- Retrieval: The logic for fetching relevant chunks in response to a query
- Generation: Incorporating retrieved results into a prompt so the LLM can produce an answer
These elements share a vulnerability: a design error in even one of them causes a cascading degradation in final answer quality. For example, when chunk boundaries cut off mid-context, the resulting embedding vectors cannot accurately represent meaning, and cases have been reported where highly relevant documents fail to appear at the top of search results.
Furthermore, as applied forms such as Agentic RAG and multimodal RAG proliferate, pipeline complexity tends to increase. The fact that a "seemingly working but low-accuracy" state can go unnoticed for extended periods is the fundamental challenge of building RAG systems.
Failure-Prone Phases: Design, Implementation, or Operation?
What makes RAG failures troublesome is that they do not stem from a single bad component — they occur across all three phases of design, implementation, and operations. Because the nature of problems differs by phase, identifying the root cause is often delayed.
In the design phase, lax requirements definition cascades into downstream problems.
- Deciding on a chunk strategy without researching users' query patterns
- Starting work without cataloguing the languages and formats of target documents (PDF, HTML, internal wikis, etc.)
- Deferring evaluation metrics with a "we'll figure that out later" attitude
These lead to the classic symptom of "it runs at first but accuracy never materializes." It is also worth noting that the spread of evaluation frameworks such as RAGAS and TruLens is making metric definition at the design stage an industry standard.
In the implementation phase, technology selection mismatches occur frequently.
- The embedding model and LLM do not support the same language (e.g., processing Japanese documents with an English model)
- Relying solely on vector search and missing queries that require keyword matching
- Skipping re-ranking, so low-relevance chunks are passed directly to the LLM
Implementation mistakes tend to go undetected because they produce a "seemingly working but low-accuracy" state.
In the operations phase, document freshness management is the biggest blind spot.
- Indexes are not rebuilt even when source documents are updated
- There is no mechanism to collect and analyze production query logs, making accuracy degradation invisible
- "Just pass everything" designs are widespread, and cases have been reported where both cost and accuracy problems become apparent as a result
Developing the habit of reviewing all three phases holistically is the first step toward preventing RAG failures before they occur.
Top 10 Failure Patterns: A Checklist Overview

RAG failures tend to concentrate in the process of elevating a system from "vaguely working" to "production quality." The problems that commonly arise across the three phases of design, implementation, and operations have been organized into 10 patterns. Start by grasping the overall picture and comparing it against the current state of your own project. Detailed root-cause analysis and countermeasures for each pattern are explained step by step in the sections that follow.
Design Phase (Patterns 1–3)
- Pattern 1: Chunk size design errors
- Pattern 2: Missing metadata design
- Pattern 3: Selecting an embedding model that does not fit the use case
Implementation Phase (Patterns 4–7)
- Pattern 4: Language mismatch between the embedding model and the LLM
- Pattern 5: Insufficient preprocessing of unstructured documents such as PDFs
- Pattern 6: Overstuffing the prompt with context
- Pattern 7: Passing the top-k results directly to the LLM without re-ranking (a re-ranker)
Operations Phase (Patterns 8–10)
- Pattern 8: Poor management of index rebuild timing
- Pattern 9: Moving to production without defining evaluation metrics
- Pattern 10: Retrieval accuracy degradation caused by mixed multilingual documents
Design Phase Failures (Patterns 1–3)
Mistakes made during the design phase ripple out and negatively impact every subsequent stage. Since the cost of rework is high once implementation has begun, it is important to verify the following three patterns before starting.
Pattern 1: Inappropriate Chunk Size Design
If chunks are too large, irrelevant information gets mixed in during retrieval, degrading the LLM's answer accuracy. Conversely, if they are too small, context gets cut off, and chunks that are meaningless on their own tend to rank highly.
- For general use cases, 256–512 tokens is commonly cited as a starting point for consideration
- For documents with complex structures, such as legal texts or technical specifications, variable-length chunking at the section level has been reported to be effective
- LlamaIndex and LangChain come with semantic chunking functionality as a standard feature, making splits that account for the semantic coherence of sentences available as an option
Pattern 2: Failing to Anticipate the Diversity of User Queries
If query variations are not identified during the design phase, unexpected questions in production will continue to return irrelevant search results.
- When internal users who query with technical terminology and general users who ask in plain language coexist, a single embedding model may not be able to handle both adequately
- It is advisable to consider preprocessing techniques such as Query Expansion and HyDE (Hypothetical Document Embeddings) during the design phase
Pattern 3: Skipping Data Source Quality Assessment
Vectorizing documents while their quality remains low will not improve retrieval accuracy. The principle of "Garbage In, Garbage Out" is no exception in RAG.
- Duplicate documents, outdated information, and inconsistent formatting tend to be the primary causes of index contamination
- If a data cleansing pipeline is not defined during the design phase, quality issues are likely to surface during the operations phase
- When mixing multiple sources such as PDFs and internal wikis, it is important to simultaneously define the metadata schema (creation date, source type, reliability score, etc.)
Implementation Phase Failures (Patterns 4–7)
Even during the implementation phase after the design is finalized, easy-to-overlook pitfalls appear in succession. Patterns 4–7 are typical examples that produce a state where the system "appears to be working but fails to deliver accuracy."
Pattern 4: Language Mismatch Between the Embedding Model and the LLM
When Japanese documents are vectorized using an embedding model specialized for English, the semantic space of the vocabulary becomes misaligned, and similarity scores tend to become unreliable. Since the range of multilingual model options—such as multilingual-e5-large and the text-embedding-3 series—has expanded, it is important to verify in advance that the languages supported by the LLM and the embedding model are aligned.
Pattern 5: Incorrect Chunk Overlap Configuration
Setting overlap to zero causes context to break off, while setting it too high makes duplicate information a source of noise. Since the optimal value differs depending on the document structure, empirical evaluation across multiple configurations is required.
Pattern 6: Flat Search Without Leveraging Metadata
Relying solely on vector similarity carries the risk that an older version of a manual will rank higher than a newer one. Combining creation date, category, version number, and similar fields as filter conditions can improve retrieval accuracy.
Pattern 7: Passing the Top-k Results Directly Without Re-ranking
The top results from a vector search are "semantically close," but are not necessarily "most useful for answering the question." Cases have been reported where inserting a Cross-Encoder-based re-ranking model improves the quality of the context passed to the LLM.
Please verify the following as an implementation checklist:
- Have you verified that the embedding model's supported languages match the document language?
- Have you evaluated the overlap rate across multiple configurations?
- Has the metadata filtering design been completed?
- Has a decision been made on whether to include a re-ranking step?
If these points are overlooked before going to production, the cost of corrections during the operations phase tends to increase significantly.
Operation Phase Failures (Patterns 8–10)
Problems discovered after going live tend to require correction costs several times greater than those in the design phase. The complacency of thinking "it's working, so it must be fine" is what leads to failures unique to the operations phase.
Pattern 8: Neglecting to Rebuild the Index When Documents Are Updated
Internal regulations and product specifications are revised frequently. If operations continue with an outdated vector index in place, there is a risk that responses will be generated based on deprecated rules or old-version specifications. It is important to automate a differential update pipeline and explicitly design the update triggers.
Pattern 9: Having No Evaluation or Monitoring Mechanism
- Many systems are operated without setting quantitative metrics such as Faithfulness and Answer Relevancy
- Without a feedback loop, detection of accuracy degradation is delayed
- Incorporating evaluation frameworks such as RAGAS and TruLens before going to production is becoming a standard practice
A mechanism that periodically measures RAGAS scores in batch and triggers an alert when scores fall below a threshold is effective.
Pattern 10: Failing to Account for Behavioral Changes Caused by LLM Model Version Updates
When an API provider updates a model, the output tendencies can change even with the same prompt. If prompt templates or post-processing logic depend on an older version, this can lead to sudden quality degradation.
- Explicitly pin the model version and run regression tests when updating
- Manage prompts in Git the same way as code
- Separate the production environment from the validation environment and perform gradual rollouts
Building continuous measurement and update management into the design phase from the outset is the key to maintaining quality over the long term.
Detailed Breakdown and Workarounds for Each Failure Pattern

Once you have grasped the overall picture with the checklist, the next step is to structurally understand "why those failures occur" and "how to avoid them."
Each pattern has causes specific to its respective phase—design, implementation, or operations. Even if you fix only the surface-level symptoms, the same problems tend to recur if the root cause remains. Read on while comparing each pattern against which phase of your own system it applies to.
Pattern 1: Chunk Size Too Large, Degrading Retrieval Accuracy
Incorrect chunk size configuration is one of the most frequently reported failures in RAG development. Cases where the judgment that "cutting larger chunks prevents information loss" instead significantly degrades retrieval accuracy are commonly observed.
Why Overly Large Chunks Are Problematic
In vector search, the meaning of an entire chunk is compressed into a single embedding vector. The larger the chunk, the more likely that vector becomes a "vague representation averaging multiple topics," making similarity calculations with queries inaccurate. As a result, "retrieval misses"—where the sections that should match do not rank highly—increase.
Common Symptoms
- Paragraphs unrelated to the query's intent are more likely to be mixed in
- The context passed to the LLM contains a mix of "relevant information" and "irrelevant information," degrading answer accuracy
- Even when retrieving the top-k results, the effective information density is low
Recommended Approach: Small-to-Big Retrieval (Parent-Child Chunk Strategy)
The currently mainstream approach is to perform high-precision matching at retrieval time using small chunks (approximately 128–256 tokens), while passing parent chunks (512–1,024 tokens) as input to the LLM. This preserves context while maintaining retrieval accuracy. LlamaIndex and LangChain provide functionality to implement this strategy as a standard feature.
Implementation Guidelines (Reference Values)
- Retrieval chunks: 128–256 tokens (accuracy-focused)
- LLM input chunks: 512–1,024 tokens (context preservation)
- Inter-chunk overlap: approximately 20–50 tokens (to prevent context breaks)
Chunk size is not something you "set once and forget." Regular measurement of retrieval accuracy using evaluation frameworks such as RAGAS, combined with ongoing adjustment, is important for maintaining production quality. Since the optimal values differ depending on the nature of the documents and the use case, the figures above are provided as reference only, and validation in your own environment is recommended.
Pattern 4: Language Mismatch Between Embedding Model and LLM
A mismatch between the languages of the embedding model and the LLM quietly erodes the overall accuracy of a RAG system. When search results are returned but the answers feel somehow off—that is one of the root causes worth suspecting.
The Structure of the Problem
There is a division of roles: the embedding model "calculates the semantic distance between a query and documents," while the LLM "reads the retrieved context and generates an answer." When these two components operate in different language spaces, relevant chunks may rank highly in search results, yet the LLM may still fail to interpret those chunks correctly.
Typical mismatch patterns are as follows:
- English-specialized embedding model × Japanese documents: Models trained primarily on English corpora tend to produce coarser semantic representations for Japanese text.
- Multilingual embedding model × English-only LLM: Even if the embeddings support multiple languages, answer quality degrades if the LLM cannot accurately process Japanese context.
- Domain-specific vocabulary mismatch: In fields with dense specialized terminology—such as law, medicine, or finance—general-purpose embedding models have been reported to fail at accurately vectorizing the meaning of technical terms.
Mitigation Strategies
- Align the embedding model and LLM to the same provider to minimize divergence in language space.
- Include multilingual models as candidates for comparative evaluation (rely on official benchmark scores only as a reference; real-world measurement in your own environment is recommended).
- Use evaluation frameworks such as RAGAS to measure the "Context Relevance" score, enabling separate assessment of retrieval accuracy and LLM interpretation accuracy.
Model selection is costly to change once finalized. Making it a habit to A/B test multiple models in the early stages is key to preventing rework downstream.
Pattern 7: Passing Top-k Results Directly Without Re-Ranking
Are you passing the top-k chunks retrieved by vector search directly to the LLM? While easy to implement, this is a classic pitfall that can significantly degrade answer quality.
Vector search measures semantic proximity using metrics such as cosine similarity, but relevance to a query and the amount of information needed to answer it do not necessarily align. Even chunks with high similarity scores do not always rank at the top when it comes to accurately answering the actual question.
Why Passing the Top-k Chunks Directly Causes Problems
- Chunks that are "related but not the answer" tend to get mixed in.
- The context window is consumed wastefully, diluting the information that is truly needed.
- The more noise there is, the higher the risk that the LLM picks up incorrect information, increasing the likelihood of hallucination.
Mitigation Strategy: Introducing Re-ranking
It is effective to insert a re-ranking step using a cross-encoder model between the retrieval phase and the generation phase. Because a cross-encoder takes both the query and the chunk as simultaneous inputs to precisely evaluate relevance, it can be expected to produce more accurate rankings than the bi-encoder approach used in vector search.
Cohere's Rerank endpoint, the BAAI/bge-reranker series, and lightweight local models such as FlashRank are widely used. Combining these allows you to narrow down the chunks passed to the LLM and improve the quality of the context.
Key Implementation Points
- Retrieve a broader set in the initial search (top-20 to 50), then narrow it down to top-3 to 5 after re-ranking.
- Always check the official documentation of the re-ranking model to confirm whether it supports Japanese.
- Measure the impact on latency and compensate with asynchronous processing or caching.
Pattern 8: Neglecting Index Rebuilding When Documents Are Updated
A RAG system is not "done once it's built." Leaving the vector index untouched after documents have been updated is one of the frequently reported failure patterns in production deployments.
Why This Becomes a Problem
What is stored in the vector store is nothing more than a snapshot taken at the time of indexing. When handling frequently updated documents—such as internal policies, product specifications, or API documentation—keeping a stale index leads to the following issues:
- Answers are generated based on deprecated rules or outdated version specifications.
- The latest information is excluded from search targets, increasing the risk of providing users with incorrect information.
- Deleted chunks still appear in search results, generating references to content that no longer exists.
The Often-Overlooked Lack of Incremental Update Design
Many teams are mindful of "full index rebuilds" but tend to go live without designing a mechanism for incremental updates. Full rebuilds become increasingly costly and time-consuming as the number of documents grows, making them impractical in environments with high update frequency.
Document loaders in LlamaIndex and LangChain are progressively incorporating incremental change detection capabilities. A design that manages document hash values and timestamps, and re-embeds only the chunks that have changed, is recommended (refer to the official documentation of each framework for the latest specifications).
Key Mitigation Points
- Attach metadata to each document (last updated timestamp, version number) in the vector store.
- Incorporate an index update job into the CI/CD pipeline and trigger it automatically on document changes.
- Establish periodic index consistency checks to verify that referenced documents still exist.
- Manage a "soft-delete flag" in metadata to handle deleted documents.
Index freshness is directly tied to RAG answer quality. Once a system enters the operational phase, automating the update workflow and establishing a monitoring framework should be the top priorities.
Common NG Implementation Examples: What's the Problem?

Stopping at "we built something that works" is a reported cause of unexpected problems surfacing in production environments. In RAG implementations, approaches that appear correct at first glance can significantly undermine retrieval accuracy and cost efficiency. This section covers two NG (no-good) implementation patterns that are especially common in the field, and explains what makes them problematic.
NG Example: Vectorizing PDFs As-Is
Vectorizing PDFs directly is one of the most representative NG examples in RAG development. While it may seem straightforward, it is important to be aware that this approach tends to significantly degrade retrieval accuracy.
Why It's an NG: The Structural Problems PDFs Carry
PDF is a format that prioritizes print layout, and various types of noise are introduced during the text extraction stage. The most common issues are as follows:
- Header and footer contamination: Page numbers and company names get merged with the body text, generating meaningless chunks.
- Misreading of multi-column layouts: In two-column PDFs, content from the left and right columns may be extracted in a mixed order.
- Failure to convert tables and figures to text: Cell data gets merged or lost, making it impossible to accurately retrieve numerical information.
- Garbled characters and symbol contamination: Due to font embedding issues, special characters and Japanese text may become corrupted.
When this noise is vectorized as-is, the quality of the embedding vectors degrades, making it harder for relevant chunks to surface in search results.
Mitigation Strategy: Establish a Preprocessing Pipeline
Preprocessing using libraries such as PyMuPDF, pdfplumber, and Unstructured is widely adopted. The basic steps are as follows:
- Trim headers and footers using regular expressions or layout analysis.
- Convert tables to Markdown format or CSV before chunking.
- Visually sample the extracted text after extraction to check for noise.
- For scanned PDFs, consider
TesseractorAzure Document Intelligence.
PDF preprocessing is the foundation that determines the overall quality of a RAG system. Allocating sufficient effort to the stage before vectorization is essential for ensuring production-level quality.
NG Example: Stuffing Too Much Context into the Prompt
One common anti-pattern frequently seen in practice is the assumption that "passing everything is safe" — stuffing large amounts of text into the context field of a prompt by passing all retrieved chunks without filtering.
While it may seem that more information would improve answer accuracy, this approach tends to be counterproductive. As the context grows longer, LLMs find it increasingly difficult to determine which information is important, and degraded response quality has been widely reported. This phenomenon is known as the "Lost in the Middle" problem, with research showing that information placed near the middle of a prompt is less likely to be referenced by the model.
A Typical Failure Sequence
- The top 20 retrieved chunks are concatenated directly into the system prompt
- The total token count exceeds 10,000–20,000
- The model overlooks information in the middle and toward the end, generating responses based only on content near the beginning
- As a result, "off-target" or "incomplete answers" occur frequently
Factors That Compound the Problem
- Duplicate content across chunks makes it easier for the model to become confused
- Low-relevance chunks appear among the top results and function as noise
- Longer contexts increase API inference costs and response latency
Key Mitigation Strategies
- Insert a re-ranking step (e.g., a Cross-Encoder) to narrow the top-k results down to approximately 3–5 chunks
- Set a relevance score threshold for chunks and exclude those that fall below it
- Implement logic that caps the total token count of the passed context and truncates any excess
Even now that the context windows of GPT and Claude have been significantly expanded, "being able to pass more" does not mean "you should pass more." To maintain the right balance of accuracy, cost, and latency, it is essential to design with the principle of always keeping the amount of context to the necessary minimum.
Easily Overlooked Pitfalls: Where Are They?

While considerable effort is often invested in chunk design and the selection of retrieval algorithms, there are pitfalls that tend to be overlooked. Releasing systems without evaluation metrics in place, and accuracy degradation in environments where multilingual documents are mixed together, are challenges that are repeatedly reported in practice. The following subsections will dig into these two pitfalls and outline concrete directions for addressing them.
The Risk of Going Live Without Setting Evaluation Metrics (e.g., RAGAS)
A RAG system that is "running" and one that is "answering accurately" are two entirely different things. Releasing to production without evaluation metrics in place risks delayed detection of quality degradation, leaving user complaints as the only feedback mechanism.
Key Risks of Releasing Without Evaluation
- Drops in retrieval recall cannot be tracked numerically, making it impossible to prioritize improvements
- The frequency of hallucinations remains unknown, creating a risk that misinformation accumulates and spreads
- The impact of changes to chunk design or prompts cannot be compared quantitatively, leaving improvements dependent on individual judgment
- Regressions (quality degradation caused by updates) go undetected, introducing quality risk with every release
RAGAS is a widely referenced framework for RAG evaluation. It enables multi-dimensional measurement of system quality through the following key metrics:
- Faithfulness: Whether the generated answer is grounded in the retrieved context
- Answer Relevancy: Whether the answer appropriately addresses the question
- Context Precision / Recall: Whether the retrieved chunks are appropriate in terms of both precision and coverage
The minimum recommended evaluation workflow before going to production is as follows:
- Prepare a test set of 50–100 representative queries
- Measure baseline scores for each metric using RAGAS or a similar framework
- Define thresholds (e.g., Faithfulness ≥ 0.8) and establish a rule to hold the release if scores fall below them
- After going live, re-evaluate regularly using the same test set and monitor score trends over time
Releasing to production based solely on the subjective assessment that "accuracy seems good" represents a structural deficiency from a quality assurance perspective. Building a solid evaluation foundation is the first step toward ensuring the reliability of a RAG system.
Retrieval Accuracy Degradation with Mixed Multilingual Documents
A knowledge base where Japanese and English are mixed is a breeding ground for retrieval accuracy degradation in RAG — one that tends to be overlooked. While multilingual embedding models continue to advance, issues stemming from mixed-language content are still widely reported in real-world deployments.
Why Accuracy Degrades
Embedding models tend to have different vector space distributions for different languages. Even when an English document is semantically close to a Japanese query, the vector distance can widen, causing that document to be missed in the search results.
The main causes of degradation are as follows:
- Variation in multilingual coverage across models: Even models that claim multilingual support have been reported to achieve lower accuracy on East Asian languages than on English
- Tokenizer effects: Japanese is split at the morpheme level, which inflates token counts and makes context truncation at chunk boundaries more likely
- Asymmetric scoring: Even for identical semantic content, similarity scores can differ across languages, causing bias in the top-k results
A Concrete Scenario
Consider a system where product manuals are managed in Japanese and technical specifications in English. For a Japanese query such as "冷却ファンの回転数制御" (cooling fan RPM control), even if the English specification document containing "cooling fan RPM control" is the most semantically appropriate source, it is likely to fail to surface in the top search results.
Mitigation Strategies
- Adopt a "query expansion" approach that automatically translates queries into multiple languages before retrieval
- Consider an architecture that splits indexes by language and normalizes scores at merge time
- Combine multilingual-specialized models with BM25 to compensate for retrieval accuracy
- Regularly measure retrieval accuracy by language using evaluation metrics, and establish a cycle for re-evaluating model selection
Mixed-language issues tend to surface only once the system is in use. Auditing the language composition of documents at the design stage and incorporating countermeasures early is the most direct path to a successful production release.
Core Design Principles for Failure-Proof RAG

Reflecting on the failure patterns covered so far, a key principle in RAG design is not to "aim for perfection from the start," but rather to "choose structures that are resistant to failure." If any one of retrieval strategy, chunk design, or evaluation metrics is missing, problems will inevitably surface at some point in the process. This section takes a deeper look at the design philosophy behind hybrid search — widely regarded as particularly effective — and at the Agentic RAG pattern, which handles complex queries with greater flexibility.
Why Combine Hybrid Search (Vector Search + BM25)
Queries that vector search alone cannot capture occur frequently in real-world deployments. Hybrid search is one of the most proven approaches available today for addressing that weakness.
Weaknesses of Vector Search and BM25
- Weaknesses of vector search: Similarity scores tend to drop for proper nouns, model numbers, and command names when exact spelling does not match. Identifiers such as "AWS Lambda" and "CVE-2024-XXXX" tend to be difficult to embed in semantically meaningful neighborhoods.
- Weaknesses of BM25: Because it relies on surface-level word matching, it handles paraphrases and synonyms poorly. Expressions like "cost reduction" and "expense compression" are treated as entirely separate queries.
Combining both approaches allows coverage of both semantic approximation and lexical matching.
Score Integration: RRF as the Standard
Reciprocal Rank Fusion (RRF) has been widely adopted for score integration. It is a simple method that weights each result's rank by its reciprocal and sums the values, with the key advantage of combining two systems with different score scales without normalization. Major search platforms such as Elasticsearch, OpenSearch, and Weaviate are advancing native support for RRF-based hybrid search, reducing implementation costs compared to before.
Cases Where It Tends to Be Effective
- Documents such as product manuals and technical specifications where model numbers and commands appear frequently
- Internal regulations and legal documents where article numbers and proprietary terms serve as search keys
- Chatbot applications where users mix colloquial expressions with technical terminology
Multiple cases have reported improvements in search accuracy for queries containing proper nouns. A practical approach is to start with a low BM25 weight so that vector search takes the lead, then adjust the ratio while monitoring accuracy logs. When combining with Agentic RAG in the next section, establishing hybrid search as the foundation of the retrieval layer further enhances the ability to handle complex queries.
Design Patterns for Handling Complex Queries with Agentic RAG
Standard RAG assumes a simple pipeline of "1 query → 1 retrieval → 1 answer." In practice, however, queries that require multi-step reasoning or the integration of multiple sources arise frequently. Agentic RAG is the design pattern built to handle such complex queries.
Agentic RAG refers to an architecture in which the LLM functions as an agent, autonomously repeating a loop of retrieval, reasoning, and re-retrieval. The main design patterns can be organized into the following three types.
- Plan-and-Execute: Upon receiving a query, the agent first generates a list of subtasks, then independently performs retrieval and answering for each task, and finally produces an integrated answer.
- ReAct (Reasoning + Acting): The agent repeats a cycle of thought → action → observation, and if retrieval results are insufficient, it automatically rephrases the query and searches again.
- Self-RAG: The model evaluates its own generated answer against the question "Is this answer supported by the documents?" and, if uncertain, redoes the retrieval.
For example, a query such as "Compare and explain the technical reasons behind the price difference between Product A and Product B" is difficult to handle with a single retrieval. With the Plan-and-Execute pattern, it can be decomposed into three steps: retrieving Product A's specifications, retrieving Product B's specifications, and performing comparative reasoning.
The following points should be kept in mind during implementation:
- Set a maximum iteration count (max_iterations) to prevent infinite loops.
- Log each step so that it is possible to trace which retrieval failed.
- Since tool calls accumulate, monitoring latency and token consumption is essential.
Agentic RAG is powerful, but the added complexity introduces more potential failure points. A hybrid configuration—using standard RAG for simple queries and activating the agent only for complex ones—tends to be effective from an operational stability standpoint.
Frequently Asked Questions

We have selected two questions frequently raised by developers and engineers working on RAG construction and operation. "Does RAG make sense even for a small number of documents?" and "Which LLM should I choose?" are questions that continue to come up regularly in practice. Because they directly inform design decisions, each is explained below with concrete perspectives.
Is RAG Effective Even with a Small Number of Documents?
To state the conclusion upfront: there are many cases where RAG functions effectively even with a small number of documents. However, the smaller the scale, the more the design considerations change.
In small-scale environments (on the order of tens to hundreds of documents), the vector search index size is small, so retrieval latency tends to be low. On the other hand, because the absolute number of candidate chunks is limited, rough chunk design is more likely to directly impact answer quality—a point that warrants attention.
Examples of use cases where it tends to be effective:
- Document sets such as internal regulations and manuals that are updated infrequently and carry high authority
- Cases centered on structured text, such as product specifications and FAQs
- Internal knowledge bases with many specialized terms limited to a specific domain
Conversely, one risk to be aware of is that fewer documents means more cases where "retrieval returns no hits." When no chunk corresponding to a query exists, the LLM tends to fill in the gap with its own knowledge, making hallucinations more likely. Explicitly incorporating an instruction such as "do not answer if the information is not found in the documents" into the prompt becomes especially important.
Key takeaways:
- RAG is effective even at small scale, but the precision of chunk design and prompt design becomes more critical.
- Consider introducing Query Expansion to improve hit rates.
- With fewer documents, quality measurement using tools such as RAGAS through full-set evaluation is relatively easy to carry out.
Which Should You Use: GPT, Claude, or Gemini?
When building RAG, the basic approach to selecting an LLM is not "which is the most powerful" but "which fits my use case." GPT, Claude, and Gemini all deliver high performance, and it is difficult to rank them definitively.
Characteristics and suitable use cases for each model:
- GPT (OpenAI): Has an extensive track record with tool integration and function calling, with many documented integration cases with LangChain and LlamaIndex. Accumulated know-how for building RAG pipelines makes its ecosystem highly mature.
- Claude (Anthropic): Tends to excel in scenarios that handle long documents in bulk, owing to its very large context window. Its design is reported to prioritize faithfulness to instructions and safety.
- Gemini (Google): Has strong affinity with Google Workspace and search infrastructure, with robust multimodal support. Japanese language processing quality is also on an improving trend.
Points to verify when selecting a model:
- Whether the target documents are long or short (for long documents, a model with a larger context length is advantageous)
- Integration costs with existing infrastructure and frameworks
- The proportion of Japanese-language documents (Japanese language quality varies across models)
- Cost structure (check each vendor's official pricing page for reference values at the time of writing)
In practice, it is effective to evaluate multiple models rather than relying on a single one before making a selection. It is recommended to use an evaluation framework such as RAGAS with your own documents and query sets to actually measure scores before deciding on a production model. Choosing a model simply because it is well-known can lead to unexpected issues in accuracy, cost, or latency.
Conclusion: How to Use the Checklist to Successfully Deploy RAG in Production

The 10 failure patterns covered so far are scattered across the design, implementation, and operations phases. Rather than trying to resolve everything at once, a more practical approach is to work through them incrementally using a phase-specific checklist.
Quantitative evaluation combining multiple metrics — Retrieval accuracy, Answer Relevancy, and Faithfulness — using evaluation frameworks such as RAGAS and TruLens is becoming the standard. Measuring a baseline before going to production makes it easier to iterate through improvement cycles.
Key points for operating the checklist are as follows:
- Design phase: Verify chunk size, overlap width, and language compatibility of the embedding model
- Implementation phase: Validate whether hybrid search can be introduced, whether re-ranking is in place, and the maximum prompt length
- Operations phase: Periodically review index update frequency, the monitoring framework for evaluation metrics, and the mixed presence of multilingual documents
One area that is particularly easy to overlook is continuous monitoring during the operations phase. Each time documents are updated, the index tends to become stale, and answer quality silently degrades. Building quality checks in as a systematic mechanism is what leads to stable long-term operation.
RAG is not a system you build once and leave alone — it is one that must be continuously cultivated in response to data and user queries. Establishing the checklist not as a "design-time ritual" but as an "operational habit" is arguably the most direct path to a successful production deployment.
Author & Supervisor
Chi
Majored in Information Science at the National University of Laos, where he contributed to the development of statistical software, building a practical foundation in data analysis and programming. He began his career in web and application development in 2021, and from 2023 onward gained extensive hands-on experience across both frontend and backend domains. At our company, he is responsible for the design and development of AI-powered web services, and is involved in projects that integrate natural language processing (NLP), machine learning, and generative AI and large language models (LLMs) into business systems. He has a voracious appetite for keeping up with the latest technologies and places great value on moving swiftly from technical validation to production implementation.
Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).


