
Hybrid search is a search architecture that combines vector search (Dense Model) and full-text search (Sparse Model such as BM25) to compensate for the weaknesses of each approach.
The accuracy of a RAG (Retrieval-Augmented Generation) system is directly tied to the quality of the retrieval phase. While vector search excels at semantic similarity, it tends to miss documents in situations that require exact keyword matching—such as model numbers, proper nouns, and code snippets.
This article provides a systematic explanation of practical knowledge you can immediately apply in production: the mechanics of hybrid search, score integration via RRF (Reciprocal Rank Fusion), implementation examples in Weaviate, Qdrant, Elasticsearch, and PostgreSQL, and methods for evaluating accuracy. It is intended for engineers looking to improve retrieval accuracy in RAG systems and for teams preparing to deploy such systems in production environments.
Hybrid search is a retrieval method that combines vector search (Dense Model) and full-text search (Sparse Model such as BM25). Vector search excels at semantic similarity, while BM25 excels at keyword matching—yet each has queries it cannot fully cover on its own. By using both together, it becomes possible to capture both semantic similarity and keyword matches simultaneously. This is currently one of the most practical approaches for improving retrieval accuracy in RAG systems.
Vector search (Dense Model) represents semantic proximity through embeddings and can handle synonyms and paraphrases. However, it struggles with queries where lexical matching is essential—such as model numbers, proper nouns, and code. For example, searching for "PS-3200A" may cause the vector space to prioritize semantically similar documents, burying the document that contains the exact model number.
Key limitations of vector search
Full-text search (BM25) calculates scores based on term frequency and inverse document frequency, directly evaluating whether keywords appear in a document—making it strong for searching model numbers and proper nouns. However, if the wording differs despite having the same meaning—such as "purchase" vs. "buy"—it may fail to retrieve relevant documents entirely.
Key limitations of full-text search (BM25)
By combining the two, it becomes possible to capture both semantic similarity and keyword matches simultaneously. Integrating scores with RRF makes it easier for relevant documents that would not have ranked highly with either method alone to surface, and also reduces the zero-hit rate.
RAG accuracy is heavily influenced by the quality of the retrieval phase. No matter how capable the LLM is, retrieval quality will suffer if relevant documents are missed. Relying solely on semantic search makes it easy to miss queries involving model numbers, proper nouns, and other cases where matching should be based on the string itself rather than its meaning.
The areas where semantic search struggles are clear. Model numbers such as "ABC-1234-X" are difficult to distinguish from similar codes in the embedding space, and proper nouns like personal names and place names tend to have unstable meaning vectors due to low frequency of occurrence. git rebase --onto and git rebase are semantically close but behave very differently.
For example, if a user queries an internal knowledge base for "specifications for part number XR-990," vector search may return the spec sheets for XR-991 or XR-880 at the top of the results. In manufacturing or medical device fields where a single-character difference in a model number means entirely different specifications, this can be fatal.
BM25 calculates scores based on token frequency and IDF (Inverse Document Frequency). Model numbers and proper nouns appear infrequently across the corpus, resulting in high IDF values and a significant boost to BM25 scores. Introducing hybrid search enables a division of roles: "BM25 catches model numbers, vector search supplements with context." Particularly for systems where accurate string matching is critical to quality—such as product manual search, legal databases, and API documentation—incorporating full-text search should be a top priority.
RAG systems that rely solely on semantic search are prone to hallucinations in specific patterns. The root cause is a structural problem: "even when semantically similar documents are retrieved, they may not contain the exact information needed."
BM25 directly scores lexical matches, making it more accurate than vector search for model numbers, version numbers, and proper nouns. With semantic search alone, there is a risk that the LLM generates plausible misinformation based on documents that are "related but not precise." Hybrid search is an effective means of structurally reducing this risk.
How to make a vector database and a BM25 index coexist, and how to design chunk sizes and embedding models — if these two points are not settled in advance, large-scale refactoring tends to occur at integration time. Each point is explained in detail below.
The first decision point is infrastructure selection. To run both vector search and keyword search on the same system, you can either choose a platform that natively supports both capabilities, or combine dedicated tools.
Key points for coexistence design
Because changing chunk size and embedding model choices after the fact incurs significant re-indexing costs, it is important to establish a policy in advance.
Chunk size design guidelines
For technical manuals and regulatory documents, 256–512 tokens is commonly adopted. However, since the optimal value varies depending on the nature of the documents and the use case, it is practical to prepare 2–3 candidate patterns and compare them using Recall@K.
Embedding model selection points
Since chunk size and embedding model interact with each other, they must be evaluated together as a set.
Because vector search and BM25 scores use different units, simply adding them together does not yield meaningful results. RRF (Reciprocal Rank Fusion) solves this problem. Rather than using absolute score values, RRF uses rankings to perform the fusion, enabling stable merging of results from different retrieval methods.
RRF calculates each document's score as Σ 1/(k + rank_i), where rank_i is the rank in retrieval method i and k is a smoothing constant. The default value of k varies by product: 60 in Elasticsearch, 2 in Qdrant, and 50 in Supabase's official samples.
For example, with k=60, if a document ranks 1st in vector search and 3rd in BM25, its score is 1/61 + 1/63 ≈ 0.0323. If another document ranks 5th in vector search and 2nd in BM25, its score is 1/65 + 1/62 ≈ 0.0315. The first document appears higher in the results.
Guidelines for Tuning k
Extension to Weighted RRF
It is also possible to extend RRF by multiplying a weight per retrieval method: α × 1/(k + rank_vector) + β × 1/(k + rank_BM25). This allows you to increase β for technical documents with many part numbers, and increase α for conceptual FAQs. Normalizing α and β so that they sum to 1 stabilizes threshold settings. It is recommended to determine specific values through offline evaluation using a golden set.
Major platforms each support hybrid search through different approaches.
Weaviate — The hybrid query executes keyword search (BM25F) and vector search from a single API. The balance is adjusted via alpha (0–1, where 0 = pure keyword and 1 = pure vector). Two fusion methods are available: Relative Score Fusion (the default as of v1.24 and later) and Ranked Fusion (rank-based fusion using 1/(rank+60)). For Japanese, CJK-oriented tokenization such as gse or kagome_ja must be configured.
Qdrant — Hybrid search combining dense vectors and sparse vectors is implemented using prefetch + fusion. The sparse side supports everything from classic BM25-based retrieval to learned sparse retrieval such as SPLADE, with RRF and Distribution-Based Score Fusion available for selection server-side. The default value of the RRF constant k is 2 (unlike Elasticsearch's 60).
Elasticsearch — Supports native hybrid search. The retriever API provides both RRF and linear combination as standard features (added in 8.14, GA in 8.16). A key strength is the ability to directly leverage existing BM25 tuning knowledge from existing infrastructure. The kuromoji analyzer is available for Japanese, and the thai analyzer or ICU tokenizer for Thai.
PostgreSQL / Supabase — Can be implemented with pgvector + tsvector/tsquery. Supabase officially documents hybrid search and provides a sample that combines a GIN index (full-text search) with HNSW (vector search). RRF is implemented as a SQL function.
Since specifications change frequently across all of these platforms, be sure to consult the latest documentation before implementing.
RRF is a "fusion of ranks" and does not directly measure the semantic relevance between a query and a document. For this reason, a configuration that adds a reranking model as a downstream stage is widely adopted.
Typical Pipeline Configuration
A Cross-Encoder encodes the query and document simultaneously to output a relevance score. While computationally more expensive than Bi-Encoder-based vector search, it offers superior accuracy, and using it in a downstream stage after narrowing down candidates keeps latency within an acceptable range.
Representative options include the Cohere Rerank API (multilingual, cloud-based), bge-reranker-v2-m3 (multilingual, OSS), and cross-encoder/ms-marco-MiniLM (English, lightweight). For multilingual RAG, it is recommended to pre-evaluate accuracy on the target language using a golden set. When adding reranking, be mindful of increased latency and ensure throughput with asynchronous processing and caching.
Rushing into implementation makes it easy to fall into unexpected pitfalls. Errors in how scores are integrated or how language processing is configured can result in lower accuracy than using a single retrieval method alone. Understanding these issues at the design stage can significantly reduce rework.
The most common implementation mistake is integrating scores without aligning their scales.
The scores from vector search vary in meaning depending on the engine and distance function—cosine similarity, cosine distance, internally converted scores, and so on (mathematical cosine similarity ranges from -1 to 1). BM25 scores also fluctuate widely in range depending on the size of the document collection and text length, sometimes reaching values in the tens to hundreds. Simply adding these scores together as-is causes the BM25 side to dominate the results, rendering the benefits of semantic search nearly zero.
Patterns that commonly manifest as symptoms:
| Method | Overview | Caveats |
|---|---|---|
| Min-Max Normalization | Transforms each score to a 0–1 range | Susceptible to outliers |
| Z-Score Normalization | Transforms to mean 0, standard deviation 1 | Effective when distribution is approximately normal |
| RRF | Rank-based; no normalization required | Information about score magnitude is lost |
Adopting RRF eliminates the need for normalization, but when you want to leverage the magnitude of scores, a linear combination with explicit normalization is the practical choice. Making it a habit to log and visualize both score distributions during development allows you to detect discrepancies early.
There are cases where search accuracy drops significantly due to the analyzer/tokenizer on the keyword search side. BM25 implementations designed for English split tokens by whitespace, so they cannot correctly segment languages like Japanese or Thai, where there are no spaces between words.
Common Problems
Platform-Specific Solutions
When documents in multiple languages are mixed together, insert a language detection step and apply a different analyzer per language. The vector search side can absorb language differences using multilingual embedding models, but on the keyword search side, the analyzer/tokenizer settings must be verified and configured for each product to ensure accuracy.
Even after introducing hybrid search, the improvement cycle cannot function without quantitatively verifying whether accuracy has truly improved. A subjective sense of "it feels somewhat better" is not sufficient as a basis for production decisions. Regression testing using Recall@K, MRR, NDCG, and a golden set forms the foundation.
Recall@K — Measures how many correct documents are included in the top K results. In RAG, it is used to check for "missed retrievals," and K=5 or K=10 are common in practice.
MRR (Mean Reciprocal Rank) — The average of the reciprocal ranks at which the correct answer first appears. A rank of 1 yields 1.0; a rank of 3 yields 0.33. It is effective when you want to evaluate the quality of the first result seen, and pairs well with chatbot-style RAG.
NDCG — A metric that allows graded relevance scores (exact match / partial match / irrelevant) to be assigned. The higher the relevance of documents appearing near the top, the higher the score. Since labeling costs are high, the recommended approach is to first evaluate quickly with Recall@K and MRR, then use NDCG for deeper analysis when scores are closely matched.
In practice, three conditions — vector search only, BM25 only, and hybrid — are compared on the same test set, and the degree of improvement is confirmed numerically. The key is to interpret each metric according to its role: Recall@K for detecting "missed retrievals" and MRR for verifying "top-result accuracy."
A golden set is a test dataset in which pairs of "queries" and "expected correct documents" are manually defined. A starting point of at least 50–100 entries is recommended.
Key Points for Creation
Integration into CI/CD
The golden set should be version-controlled just like code and incorporated into the test step of the CI/CD pipeline. Each time chunk sizes are changed or models are updated, Recall@K and MRR are automatically calculated and compared against the previous version, and merges are blocked if the metrics drop below a defined threshold.
Since correct labels can become stale as documents are updated, it is also recommended to build a feedback loop that automatically collects search failure cases from production logs and adds them to the golden set.
Hybrid search is a foundational technology for the retrieval layer, but there are complex queries that flat document retrieval alone cannot fully handle.
GraphRAG is an approach that combines hybrid search with a knowledge graph. Named entities are extracted from chunks retrieved by hybrid search and linked to nodes in the graph, enabling cross-retrieval of information multiple hops away — for example, "Product X → Related Standard → Applicable Region." A practical design places Neo4j or Amazon Neptune at the graph layer and Qdrant or Weaviate at the vector layer, calling them in parallel.
Agentic RAG is an approach that incorporates hybrid search into a multi-step reasoning agent. The agent decomposes a question, dynamically switching the alpha value so that sub-queries containing named entities are handled primarily by keyword search, while conceptual sub-queries are handled primarily by vector search. Defining the hybrid search node as an independent state in LangGraph or LlamaIndex Workflows makes retry and branching logic straightforward.
However, the greater the complexity, the higher the operational cost. In most cases, the approach adopted is to first confirm the accuracy ceiling with simple hybrid search, then expand incrementally.
Does Hybrid Search Increase Costs?
Depending on the design, incremental costs can generally be kept limited. The main cost drivers are dual index management (increased storage) and parallel execution of two search pipelines (compute resources), but BM25 is lightweight and computationally cheaper than vector search, and it can often piggyback on existing infrastructure. On the other hand, improved search accuracy can reduce wasted context passed to the LLM, potentially cutting unnecessary token consumption. Caching frequently occurring queries and optimizing chunk sizes are also effective measures. Cost increases tend to become significant in cases involving millions of documents or more with frequent real-time updates.
Does Implementation Differ Between Cloud and On-Premises?
Yes, it does. In the cloud, services such as Azure AI Search and Amazon OpenSearch Service provide hybrid search including RRF at the API level, reducing the burden of infrastructure management. Scaling out is also handled by the service provider. For on-premises deployments, self-hosting Qdrant or Elasticsearch is the common approach; both offer server-side score fusion capabilities, so there is no need to implement everything at the application layer. When regulatory requirements prohibit sending data to external parties—such as in finance, healthcare, or government—on-premises deployment becomes mandatory.
Hybrid search is a practical approach that combines vector search and BM25 to cover queries that either method alone would miss. It handles both scenarios where keyword matching is required—such as part numbers, proper nouns, and code snippets—and scenarios where documents need to be retrieved based on semantic similarity.
Here is a summary of the key points covered in this article:
Looking ahead, extensions toward GraphRAG and Agentic RAG are worth considering, but the practical approach is to first confirm accuracy improvements with a simple hybrid search before advancing incrementally. Reducing hallucinations and improving answer quality are difficult to achieve without improving the retrieval layer. Treat the introduction of hybrid search as the starting point for an overall RAG quality improvement cycle.

Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).