What is Hybrid Search? How It Works and Implementation to Improve RAG Accuracy with Vector Search × Full-Text Search

Updated:April 7, 2026Published:April 7, 2026

Vector search alone misses certain queries. Hybrid search is a practical approach to fill those blind spots.

Hybrid search is a search architecture that combines vector search (Dense Model) and full-text search (Sparse Model such as BM25) to compensate for the weaknesses of each approach.

The accuracy of a RAG (Retrieval-Augmented Generation) system is directly tied to the quality of the retrieval phase. While vector search excels at semantic similarity, it tends to miss documents in situations that require exact keyword matching—such as model numbers, proper nouns, and code snippets.

This article provides a systematic explanation of practical knowledge you can immediately apply in production: the mechanics of hybrid search, score integration via RRF (Reciprocal Rank Fusion), implementation examples in Weaviate, Qdrant, Elasticsearch, and PostgreSQL, and methods for evaluating accuracy. It is intended for engineers looking to improve retrieval accuracy in RAG systems and for teams preparing to deploy such systems in production environments.

Hybrid search is a retrieval method that combines vector search (Dense Model) and full-text search (Sparse Model such as BM25). Vector search excels at semantic similarity, while BM25 excels at keyword matching—yet each has queries it cannot fully cover on its own. By using both together, it becomes possible to capture both semantic similarity and keyword matches simultaneously. This is currently one of the most practical approaches for improving retrieval accuracy in RAG systems.

The limitations of vector search and full-text search

Vector search (Dense Model) represents semantic proximity through embeddings and can handle synonyms and paraphrases. However, it struggles with queries where lexical matching is essential—such as model numbers, proper nouns, and code. For example, searching for "PS-3200A" may cause the vector space to prioritize semantically similar documents, burying the document that contains the exact model number.

Key limitations of vector search

Recall tends to drop for queries requiring exact lexical matches (model numbers, legal article numbers, codes)
Technical terms not present in the training data are not accurately represented in the semantic space
Depending on the parameters of approximate nearest neighbor (ANN) search, true nearest neighbors may be missed

Full-text search (BM25) calculates scores based on term frequency and inverse document frequency, directly evaluating whether keywords appear in a document—making it strong for searching model numbers and proper nouns. However, if the wording differs despite having the same meaning—such as "purchase" vs. "buy"—it may fail to retrieve relevant documents entirely.

Key limitations of full-text search (BM25)

Handling of synonyms, paraphrases, and abbreviations depends on dictionary maintenance
Long-query intent is difficult to capture because context is not considered
For languages such as Japanese and Thai, search quality degrades significantly without properly configured analyzers/tokenizers for each product

By combining the two, it becomes possible to capture both semantic similarity and keyword matches simultaneously. Integrating scores with RRF makes it easier for relevant documents that would not have ranked highly with either method alone to surface, and also reduces the zero-hit rate.

Why does hybrid search matter in RAG systems?

RAG accuracy is heavily influenced by the quality of the retrieval phase. No matter how capable the LLM is, retrieval quality will suffer if relevant documents are missed. Relying solely on semantic search makes it easy to miss queries involving model numbers, proper nouns, and other cases where matching should be based on the string itself rather than its meaning.

Use cases requiring keyword matching: part numbers, proper nouns, and code

The areas where semantic search struggles are clear. Model numbers such as "ABC-1234-X" are difficult to distinguish from similar codes in the embedding space, and proper nouns like personal names and place names tend to have unstable meaning vectors due to low frequency of occurrence. git rebase --onto and git rebase are semantically close but behave very differently.

For example, if a user queries an internal knowledge base for "specifications for part number XR-990," vector search may return the spec sheets for XR-991 or XR-880 at the top of the results. In manufacturing or medical device fields where a single-character difference in a model number means entirely different specifications, this can be fatal.

BM25 calculates scores based on token frequency and IDF (Inverse Document Frequency). Model numbers and proper nouns appear infrequently across the corpus, resulting in high IDF values and a significant boost to BM25 scores. Introducing hybrid search enables a division of roles: "BM25 catches model numbers, vector search supplements with context." Particularly for systems where accurate string matching is critical to quality—such as product manual search, legal databases, and API documentation—incorporating full-text search should be a top priority.

Hallucination patterns caused by semantic search alone

RAG systems that rely solely on semantic search are prone to hallucinations in specific patterns. The root cause is a structural problem: "even when semantically similar documents are retrieved, they may not contain the exact information needed."

Contamination by similar documents: When querying "Product A's specifications," "Product B's specifications" ranks at the top, causing the LLM to output incorrect model numbers and specs
Version mix-ups: Release notes for v2.1 and v2.0 are close in embedding space, making it difficult for search scores to differentiate between them even when outdated information is referenced
Overlooked negations and exceptions: Phrases like "cannot do ~" or "excluding ~" are poorly reflected in semantic vectors, causing affirmative documents to be prioritized

BM25 directly scores lexical matches, making it more accurate than vector search for model numbers, version numbers, and proper nouns. With semantic search alone, there is a risk that the LLM generates plausible misinformation based on documents that are "related but not precise." Hybrid search is an effective means of structurally reducing this risk.

What prerequisites should you verify before implementation?

How to make a vector database and a BM25 index coexist, and how to design chunk sizes and embedding models — if these two points are not settled in advance, large-scale refactoring tends to occur at integration time. Each point is explained in detail below.

Choosing a vector database and coexisting with a BM25 index

The first decision point is infrastructure selection. To run both vector search and keyword search on the same system, you can either choose a platform that natively supports both capabilities, or combine dedicated tools.

Weaviate: Keyword search (BM25F) and vector index coexist within a single collection. Hybrid queries are supported as a first-class feature, eliminating the need for additional external services and making it easier to keep operational costs down
Qdrant: Dense vectors and sparse named vectors are stored in a single collection. The sparse side can represent anything from classic BM25-based search to learned sparse retrieval such as SPLADE, and the Query API enables server-side hybrid fusion
Elasticsearch: BM25 is mature. Current versions also natively support vector search, with hybrid search via RRF and linear combination provided as standard features
PostgreSQL (pgvector + tsvector): Enables hybrid search on existing infrastructure. Officially documented by Supabase, allowing you to avoid adding a dedicated vector database

Key points for coexistence design

Since vector indexes and keyword indexes are built separately for the same documents, a mechanism to synchronize index update timing is required
The analyzer/tokenizer settings for keyword search text fields directly affect search quality (especially in Japanese-language environments)
For greenfield new builds, Weaviate or Qdrant tend to simplify the design; if Elasticsearch is already in place, leveraging its native hybrid features is the lower-migration-cost approach

Upfront design of chunk size and embedding models

Because changing chunk size and embedding model choices after the fact incurs significant re-indexing costs, it is important to establish a policy in advance.

Chunk size design guidelines

Short chunks (128–256 tokens): Suited for pinpoint fact extraction. Works well with BM25 and tends to improve keyword hit rates
Long chunks (512–1024 tokens): Suited for explanatory text and procedural documents that require contextual continuity. Semantic similarity in vector search tends to be more stable
Sliding window approach: Adding an overlap of 50–100 tokens between chunks reduces information loss at boundaries

For technical manuals and regulatory documents, 256–512 tokens is commonly adopted. However, since the optimal value varies depending on the nature of the documents and the use case, it is practical to prepare 2–3 candidate patterns and compare them using Recall@K.

Embedding model selection points

Multilingual support: When Japanese, Thai, or other non-English languages are involved, choose a multilingual model (e.g., multilingual-e5, Gemini Embedding 2). English-only models tend to have lower representation accuracy for non-English text
Dimensionality and cost: Dimensionality varies significantly by model (Gemini Embedding 2 defaults to 3072 dimensions but can be reduced). Select based on the balance of storage, latency, and accuracy required
Domain fit: In domains with heavy specialized terminology, domain-specific models tend to improve retrieval accuracy

Since chunk size and embedding model interact with each other, they must be evaluated together as a set.

Score integration with RRF (Reciprocal Rank Fusion)

Because vector search and BM25 scores use different units, simply adding them together does not yield meaningful results. RRF (Reciprocal Rank Fusion) solves this problem. Rather than using absolute score values, RRF uses rankings to perform the fusion, enabling stable merging of results from different retrieval methods.

The RRF formula and how to tune weight parameters

RRF calculates each document's score as Σ 1/(k + rank_i), where rank_i is the rank in retrieval method i and k is a smoothing constant. The default value of k varies by product: 60 in Elasticsearch, 2 in Qdrant, and 50 in Supabase's official samples.

For example, with k=60, if a document ranks 1st in vector search and 3rd in BM25, its score is 1/61 + 1/63 ≈ 0.0323. If another document ranks 5th in vector search and 2nd in BM25, its score is 1/65 + 1/62 ≈ 0.0315. The first document appears higher in the results.

Guidelines for Tuning k

A smaller k increases concentration toward top ranks, while a larger k reduces the impact of rank differences
It is common practice to check the default value for your platform and adjust while monitoring Recall@K and MRR

Extension to Weighted RRF

It is also possible to extend RRF by multiplying a weight per retrieval method: α × 1/(k + rank_vector) + β × 1/(k + rank_BM25). This allows you to increase β for technical documents with many part numbers, and increase α for conceptual FAQs. Normalizing α and β so that they sum to 1 stabilizes threshold settings. It is recommended to determine specific values through offline evaluation using a golden set.

Implementation examples in Weaviate, Qdrant, and Elasticsearch

Major platforms each support hybrid search through different approaches.

Weaviate — The hybrid query executes keyword search (BM25F) and vector search from a single API. The balance is adjusted via alpha (0–1, where 0 = pure keyword and 1 = pure vector). Two fusion methods are available: Relative Score Fusion (the default as of v1.24 and later) and Ranked Fusion (rank-based fusion using 1/(rank+60)). For Japanese, CJK-oriented tokenization such as gse or kagome_ja must be configured.

Qdrant — Hybrid search combining dense vectors and sparse vectors is implemented using prefetch + fusion. The sparse side supports everything from classic BM25-based retrieval to learned sparse retrieval such as SPLADE, with RRF and Distribution-Based Score Fusion available for selection server-side. The default value of the RRF constant k is 2 (unlike Elasticsearch's 60).

Elasticsearch — Supports native hybrid search. The retriever API provides both RRF and linear combination as standard features (added in 8.14, GA in 8.16). A key strength is the ability to directly leverage existing BM25 tuning knowledge from existing infrastructure. The kuromoji analyzer is available for Japanese, and the thai analyzer or ICU tokenizer for Thai.

PostgreSQL / Supabase — Can be implemented with pgvector + tsvector/tsquery. Supabase officially documents hybrid search and provides a sample that combines a GIN index (full-text search) with HNSW (vector search). RRF is implemented as a SQL function.

Since specifications change frequently across all of these platforms, be sure to consult the latest documentation before implementing.

Advanced patterns combining reranking models

RRF is a "fusion of ranks" and does not directly measure the semantic relevance between a query and a document. For this reason, a configuration that adds a reranking model as a downstream stage is widely adopted.

Typical Pipeline Configuration

Stage 1 (Retrieve): Retrieve the top 50–100 results via hybrid search
Stage 2 (Rerank): Score each query–document pair using a Cross-Encoder
Stage 3 (Generate): Pass the top 3–10 results to an LLM as context

A Cross-Encoder encodes the query and document simultaneously to output a relevance score. While computationally more expensive than Bi-Encoder-based vector search, it offers superior accuracy, and using it in a downstream stage after narrowing down candidates keeps latency within an acceptable range.

Representative options include the Cohere Rerank API (multilingual, cloud-based), bge-reranker-v2-m3 (multilingual, OSS), and cross-encoder/ms-marco-MiniLM (English, lightweight). For multilingual RAG, it is recommended to pre-evaluate accuracy on the target language using a golden set. When adding reranking, be mindful of increased latency and ensure throughput with asynchronous processing and caching.

Common failure patterns and how to avoid them

Rushing into implementation makes it easy to fall into unexpected pitfalls. Errors in how scores are integrated or how language processing is configured can result in lower accuracy than using a single retrieval method alone. Understanding these issues at the design stage can significantly reduce rework.

Cases where forgetting score normalization skews results

The most common implementation mistake is integrating scores without aligning their scales.

The scores from vector search vary in meaning depending on the engine and distance function—cosine similarity, cosine distance, internally converted scores, and so on (mathematical cosine similarity ranges from -1 to 1). BM25 scores also fluctuate widely in range depending on the size of the document collection and text length, sometimes reaching values in the tens to hundreds. Simply adding these scores together as-is causes the BM25 side to dominate the results, rendering the benefits of semantic search nearly zero.

Patterns that commonly manifest as symptoms:

Short chunks containing keywords monopolize the top results, while longer chunks with high contextual relevance fall out of the rankings
Top results change drastically with only a slight variation in query phrasing
Particularly pronounced in implementations that use a weighted sum rather than RRF for integration

Method	Overview	Caveats
Min-Max Normalization	Transforms each score to a 0–1 range	Susceptible to outliers
Z-Score Normalization	Transforms to mean 0, standard deviation 1	Effective when distribution is approximately normal
RRF	Rank-based; no normalization required	Information about score magnitude is lost

Adopting RRF eliminates the need for normalization, but when you want to leverage the magnitude of scores, a linear combination with explicit normalization is the practical choice. Making it a habit to log and visualize both score distributions during development allows you to detect discrepancies early.

Tokenizer mismatches in multilingual environments such as Japanese and Thai

There are cases where the analyzer/tokenizer on the keyword search side causes a significant drop in search accuracy. BM25 implementations designed for English split tokens by whitespace, so they cannot correctly tokenize languages like Japanese or Thai, where there are no spaces between words.

Common Problems

Searching for "natural language processing" in Japanese without appropriate tokenization causes it to be treated as a single token, making partial matching non-functional
Thai requires dedicated processing to detect word boundaries, but this is ignored under default settings
In documents mixing English, Japanese, and Thai, token granularity differs per language, making score comparisons unfair

Platform-Specific Solutions

Elasticsearch / OpenSearch: The kuromoji analyzer for Japanese and the thai analyzer or ICU tokenizer for Thai are provided as standard. A design that separates field mappings per language is recommended
Weaviate: Provides tokenization options such as gse and kagome_ja for CJK languages. Since Japanese segmentation accuracy degrades with default settings, configuration changes are required
PostgreSQL / Supabase: PGroonga supports CJK languages including Japanese and Chinese. A bigram index via pg_bigm is also an option

When documents in multiple languages are mixed together, insert a language detection step and apply a different analyzer per language. The vector search side can absorb language differences using multilingual embedding models, but on the keyword search side, the analyzer/tokenizer settings must be verified and configured for each product to ensure accuracy.

How do you evaluate retrieval accuracy?

Even after introducing hybrid search, the improvement cycle cannot function without quantitatively verifying whether accuracy has truly improved. A subjective sense of "it feels somewhat better" is not sufficient as a basis for production decisions. Regression testing using Recall@K, MRR, NDCG, and a golden set forms the foundation.

Core metrics: Recall@K, MRR, and NDCG

Recall@K — Measures how many correct documents are included in the top K results. In RAG, it is used to check for "missed retrievals," and K=5 or K=10 are common in practice.

MRR (Mean Reciprocal Rank) — The average of the reciprocal ranks at which the correct answer first appears. A rank of 1 yields 1.0; a rank of 3 yields 0.33. It is effective when you want to evaluate the quality of the first result seen, and pairs well with chatbot-style RAG.

NDCG — A metric that allows graded relevance scores (exact match / partial match / irrelevant) to be assigned. The higher the relevance of documents appearing near the top, the higher the score. Since labeling costs are high, the recommended approach is to first evaluate quickly with Recall@K and MRR, then use NDCG for deeper analysis when scores are closely matched.

In practice, three conditions — vector search only, BM25 only, and hybrid — are compared on the same test set, and the degree of improvement is confirmed numerically. The key is to interpret each metric according to its role: Recall@K for detecting "missed retrievals" and MRR for verifying "top-result accuracy."

Designing continuous regression tests using a golden set

A golden set is a test dataset in which pairs of "queries" and "expected correct documents" are manually defined. A starting point of at least 50–100 entries is recommended.

Key Points for Creation

Coverage: Include both keyword-match queries and semantic queries
Difficulty distribution: Intentionally include edge cases such as synonyms, abbreviations, and mixed-language content
Domain representativeness: Extracting from actual user logs or inquiry histories yields realistic difficulty levels

Integration into CI/CD

The golden set should be version-controlled just like code and incorporated into the test step of the CI/CD pipeline. Each time chunk sizes are changed or models are updated, Recall@K and MRR are automatically calculated and compared against the previous version, and merges are blocked if the metrics drop below a defined threshold.

Since correct labels can become stale as documents are updated, it is also recommended to build a feedback loop that automatically collects search failure cases from production logs and adds them to the golden set.

Advanced topics: Graph RAG and Agentic RAG

Hybrid search is a foundational technology for the retrieval layer, but there are complex queries that flat document retrieval alone cannot fully handle.

GraphRAG is an approach that combines hybrid search with a knowledge graph. Named entities are extracted from chunks retrieved by hybrid search and linked to nodes in the graph, enabling cross-retrieval of information multiple hops away — for example, "Product X → Related Standard → Applicable Region." A practical design places Neo4j or Amazon Neptune at the graph layer and Qdrant or Weaviate at the vector layer, calling them in parallel.

Agentic RAG is an approach that incorporates hybrid search into a multi-step reasoning agent. The agent decomposes a question, dynamically switching the alpha value so that sub-queries containing named entities are handled primarily by keyword search, while conceptual sub-queries are handled primarily by vector search. Defining the hybrid search node as an independent state in LangGraph or LlamaIndex Workflows makes retry and branching logic straightforward.

However, the greater the complexity, the higher the operational cost. In most cases, the approach adopted is to first confirm the accuracy ceiling with simple hybrid search, then expand incrementally.

Frequently asked questions

Does Hybrid Search Increase Costs?

Depending on the design, incremental costs can generally be kept limited. The main cost drivers are dual index management (increased storage) and parallel execution of two search pipelines (compute resources), but BM25 is lightweight and computationally cheaper than vector search, and it can often piggyback on existing infrastructure. On the other hand, improved search accuracy can reduce wasted context passed to the LLM, potentially cutting unnecessary token consumption. Caching frequently occurring queries and optimizing chunk sizes are also effective measures. Cost increases tend to become significant in cases involving millions of documents or more with frequent real-time updates.

Does Implementation Differ Between Cloud and On-Premises?

Yes, it does. In the cloud, services such as Azure AI Search and Amazon OpenSearch Service provide hybrid search including RRF at the API level, reducing the burden of infrastructure management. Scaling out is also handled by the service provider. For on-premises deployments, self-hosting Qdrant or Elasticsearch is the common approach; both offer server-side score fusion capabilities, so there is no need to implement everything at the application layer. When regulatory requirements prohibit sending data to external parties—such as in finance, healthcare, or government—on-premises deployment becomes mandatory.

Summary: Boosting production RAG accuracy with hybrid search

Hybrid search is a practical approach that combines vector search and BM25 to cover queries that either method alone would miss. It handles both scenarios where keyword matching is required—such as part numbers, proper nouns, and code snippets—and scenarios where documents need to be retrieved based on semantic similarity.

Here is a summary of the key points covered in this article:

Design phase: Aligning vector DB selection, chunk size, and embedding model upfront forms the foundation for accuracy
Score integration: RRF is straightforward to implement and tends to deliver results effectively. Note that the default value of k varies by platform, so be sure to verify it
Multilingual support: For languages such as Japanese and Thai, mismatches in analyzers/tokenizers are a common source of accuracy degradation
Evaluation design: Continuously monitor quality using Recall@K, MRR, and a golden set

Looking ahead, extensions toward GraphRAG and Agentic RAG are worth considering, but the practical approach is to first confirm accuracy improvements with a simple hybrid search before advancing incrementally. Reducing hallucinations and improving answer quality are difficult to achieve without improving the retrieval layer. Treat the introduction of hybrid search as the starting point for an overall RAG quality improvement cycle.

Author & Supervisor

Yusuke Ishihara

Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).