What is Semantic Caching? How AI Gateways Reduce LLM Costs and Implementation Guide

What is Semantic Caching? How AI Gateways Reduce LLM Costs and Implementation Guide

Semantic caching is a technique that reduces API calls by reusing past LLM responses for semantically similar queries. This article explains step-by-step how engineers and technical practitioners can implement semantic caching in an AI gateway to significantly cut LLM API costs.

Semantic cache is a mechanism that determines cache hits based on "semantic similarity" rather than exact matches of query strings. Since past responses can be reused for questions with the same intent even if phrased differently, it captures reuse opportunities that conventional exact-match caching would miss. Here, the differences from regular caching are organized around three points: the hit determination method, applicable domains, and impact on cost reduction.

The Limits of Exact-Match Caching and the Problems Semantic Caching Solves

Exact-match caching stores responses using the prompt string as a key (typically its hash value) and reuses them only when the exact same string appears again. This works effectively in scenarios where applications query using fixed templates.

The problem is that natural language queries entered freely by end users have virtually infinite surface variations. "Tell me how to return an item," "How do I make a return?" and "What's the return process?" all share the same intent, but as strings they are distinct—so exact-match caching treats each as a separate key, causing the hit rate to drop to nearly zero.

Semantic cache aims to map this group of queries that share meaning but differ in wording to a single response. By converting queries into semantic vectors and measuring proximity via distance in vector space rather than string comparison, it absorbs differences in phrasing when determining cache hits.

How Cache Hit Detection Works Using Vector Similarity

Hit determination in semantic cache begins by using an embedding model to convert a query into a fixed-length vector. Because queries with similar meanings are positioned close to each other in vector space, measuring the distance between a new query's vector and the vectors of previously stored queries allows the degree of semantic similarity to be quantified.

Cosine similarity is commonly used as the distance metric. When a new query arrives, the closest match (nearest neighbor) is found among the stored vectors, and if its similarity meets or exceeds a predefined threshold, the query is judged a "hit" and the stored response is returned as-is. If it falls below the threshold, it is treated as a "miss" and the LLM is called as usual.

As the number of stored entries grows, brute-force comparison becomes costly, so in production it is common practice to delegate the search to a vector database equipped with approximate nearest neighbor (ANN) search.

Why Semantic Caching Is Effective at Reducing LLM Costs

LLM API pricing is fundamentally proportional to the volume of input and output tokens. When a cache hit occurs, the entire LLM call for that request can be skipped, saving both token charges and the latency of a network round trip.

The magnitude of the benefit depends on how much repetition exists in the query distribution. Use cases where similar questions recur at high frequency—such as FAQ responses, internal knowledge search, and routine support inquiries—tend to achieve higher hit rates and therefore greater cost reductions. Conversely, the benefit is limited in use cases where the majority of queries are unique and one-off.

On the cost trade-off side, every hit determination incurs an embedding model call and a vector search. However, the unit cost of embedding APIs is often much lower than that of generative LLM calls, and as long as the token charges saved on a hit exceed this additional cost, the overall expense decreases. The actual break-even point varies with the unit pricing of the models used and the hit rate, making post-deployment measurement essential.

Prerequisites and Required Components to Verify Before Implementation

Implementing semantic cache requires at minimum three components: an AI gateway, a vector database, and an embedding model. Because the choice of each affects latency, operational overhead, and accuracy, it is advisable to clarify the role of each component and establish policies for the threshold and target queries before beginning implementation. The following organizes the selection considerations into three discussion points.

Selecting an AI Gateway, Vector Database, and Embedding Model

AI Gateway is a proxy that sits between your application and LLM providers, serving as the insertion point for a caching layer. Using one that includes caching and routing capabilities—such as LiteLLM, Portkey, or Cloudflare AI Gateway—reduces implementation overhead. Embedding it into your own proxy is also an option.

Vector databases store query vectors and responses, and handle nearest-neighbor search. Candidates include pgvector (a PostgreSQL extension), Redis's vector capabilities, Qdrant, Weaviate, Milvus, and Pinecone. If PostgreSQL or Redis is already in your stack, starting with their extensions avoids adding new infrastructure to operate.

Embedding models vectorize queries. Cloud-based embedding APIs are easy to adopt and tend to deliver stable accuracy, but they add per-call costs and latency. Self-hosted lightweight models can reduce costs, but increase operational and GPU burden. When selecting a model, consider search accuracy, per-call latency, unit cost, and data sovereignty—whether it is acceptable to send data to an external service.

Similarity Threshold Concepts and the Trade-off with Acceptable Answer Accuracy

The similarity threshold is the most critical parameter governing the behavior of semantic caching. Setting the threshold high means only queries that are truly close in meaning will result in a cache hit, reducing false hits (incorrect cached responses), but also lowering the hit rate and diminishing cost savings.

Conversely, lowering the threshold increases the hit rate, but raises the risk of returning past responses to queries with subtly different meanings. Queries involving negation or reversed conditions are particularly problematic: they may be close in vector space yet have completely opposite answers, making a low threshold a potential source of errors.

The appropriate value varies by domain and embedding model, so there is no universal answer. In domains where answer accuracy is critical, it is safest to start high, then gradually lower the threshold while reviewing logs, seeking the maximum hit rate within an acceptable error rate.

Classification Criteria for Cacheable vs. Non-Cacheable Queries

Not every query should be eligible for caching. Queries well-suited to caching are those with stable answers that would be the same regardless of who asks—such as product specifications, term definitions, internal policies, and FAQs, where the content does not change over a given period.

On the other hand, certain queries are clearly unsuitable for caching. Personalized responses that depend on a user's account information, highly time-sensitive data such as inventory levels, prices, or exchange rates, conversational continuations that depend on the immediately preceding context, and high-stakes decisions where incorrect answers cannot be tolerated should, as a rule, be excluded. Mistakenly caching these can result in returning stale information or responses intended for someone else.

In your implementation, set up a routing mechanism that examines query type and metadata—such as whether it is user-specific or time-dependent—to determine whether caching should be applied.

Steps to Implement Semantic Caching in an AI Gateway

Implementing semantic caching proceeds in three steps: vectorizing the query (Step 1), registering and searching the vector database (Step 2), and integrating with the AI Gateway and setting up routing (Step 3). Below, we walk through what to do at each step and highlight common pitfalls. The code examples are illustrative pseudocode intended to convey the approach; adapt them as needed for the libraries you are using.

Step 1: Configuring the Embedding Model and Vectorizing Queries

The first step is to configure the embedding model and convert the input query into a vector. Normalizing the query before conversion helps stabilize the hit rate. Remove leading and trailing whitespace, unify full-width and half-width characters, strip boilerplate greetings, and smooth out surface-level differences that are irrelevant to meaning.

Pass the normalized query to the embedding API to obtain a fixed-length vector.

python
1def embed(query: str) -> list[float]: 2 normalized = normalize(query) # strip whitespace, unify notation, etc. 3 return embedding_client.create(input=normalized).vector

The critical point here is to always use the same embedding model and the same normalization process both when registering entries in the cache and when searching it. If models or versions are mixed, the same query will produce different vectors, causing hit detection to fail. When swapping out an embedding model, operate under the assumption that the cache must be rebuilt from scratch.

Step 2: Registering Cache Entries in the Vector Database and Performing Searches

Next, we implement registration and search in the vector database. The flow is simple: "search, and on a miss, call the LLM, then register the result."

For search, we retrieve the single nearest neighbor using the new query's vector, and if the similarity score meets or exceeds the threshold, we treat it as a hit and return the stored response. On a miss, we call the LLM, then register the resulting response along with the vector, original query, metadata, and TTL.

python
1def get_or_generate(query: str) -> str: 2 vec = embed(query) 3 hit = vector_db.search(vec, top_k=1) 4 if hit and hit.score >= THRESHOLD: 5 return hit.response # Cache hit 6 answer = llm.generate(query) # Miss: call LLM 7 vector_db.upsert(vec, query=query, response=answer, ttl=TTL) 8 return answer

Storing the original query string, timestamp, and the similarity score used for threshold evaluation at registration time is useful for auditing hit contents and tuning the threshold later.

Step 3: Integrating the Cache Layer into the AI Gateway and Configuring Routing

Finally, we integrate this caching logic into the request path of the AI gateway. The basic flow is: receive request → embed query → search cache → if a hit, return immediately; if a miss, forward to the LLM, register the response, then return it.

The advantage of inserting this at the gateway layer is that caching can be enabled without modifying application-side code. With gateways that have built-in cache configuration—such as LiteLLM or Portkey—it may be sufficient to simply pass the threshold and TTL as configuration values.

It is also worth preparing for cases where the cache should be bypassed. For personalized or time-sensitive queries, make it possible to mark requests as non-cacheable via request headers or routes, and have the gateway route them accordingly. This allows the policy established in the earlier step—"only cache queries that are suited for caching"—to be enforced directly in the actual request path.

How to Tune for a Higher Cache Hit Rate

The hit rate can be improved by continuously tuning three factors: threshold, TTL, and query normalization. Rather than fixing any one of these to a static value, the basic approach is to adjust them incrementally while monitoring logs, maximizing reuse without increasing false hits. The following sections walk through each of these three tuning points in turn.

Methods for Tuning Similarity Thresholds and Evaluation Metrics

Threshold tuning is most reliably done using actual query logs. Start by collecting queries over a given period and labeling them—either manually or with rules—as semantically equivalent or not. Then, vary the threshold while measuring the "correct hit rate" and the "false hit rate" (the proportion of incorrectly returned responses).

Key metrics to track include the hit rate (the proportion of all queries served from cache), the false hit rate (the proportion of incorrect responses returned), and the quality of responses actually delivered to users. Tracking hit rate alone risks overlooking false hits, so both must always be evaluated together.

In practice, a safe approach is to first set an upper bound on the false hit rate (i.e., define an acceptable error rate), and then select the threshold that maximizes the hit rate within that constraint. The threshold should not be set once and left unchanged—it should be reviewed periodically as the query distribution evolves.

TTL (Expiration) Design and Cache Invalidation Timing

TTL (time-to-live) is the setting that determines how long a cached response may be reused. It is best designed separately for each type of query. Information that remains stable over long periods—such as product definitions or terminology—warrants a longer TTL, while information that may change—such as prices or inventory—should have a shorter TTL or be excluded from caching altogether.

In addition to time-based expiration via TTL, it is worth having a mechanism to explicitly invalidate the cache when the underlying data is updated. For example, if a source document such as an internal policy or FAQ is revised, the corresponding cache entries should be deleted. One effective approach is to store the document version in the metadata and treat any entry whose version has been incremented as expired.

If the TTL is too long, stale information will continue to be returned; if it is too short, the hit rate will drop. Set values on a per-query-type basis, balancing freshness requirements against hit rate.

Query Normalization Techniques Through Prompt Engineering

Query normalization, performed as preprocessing before embedding, makes it easier to cluster semantically identical queries near the same vector. This raises the overall hit rate.

Concretely, the basics include removing leading/trailing whitespace and symbols, standardizing full-width/half-width and uppercase/lowercase characters, and stripping greetings and honorific expressions—such as "please tell me" or "I'd appreciate it"—that contribute nothing to meaning. A further approach involves extracting the intent from a query and converting it into a normalized, concise question.

However, over-normalizing can strip away nuances that should actually be distinguished, causing false hits. Elements that affect meaning—such as negations ("cannot," "other than"), quantities, and conditions—must not be removed. Normalization rules should explicitly state what to keep and what to discard, and their effectiveness should be verified in conjunction with threshold settings.

Common Failure Patterns and How to Avoid Them

The three most common failures in semantic caching are misconfigured thresholds, reuse of stale responses, and bloat in the vector database. None of these tend to surface immediately after deployment; they are more likely to become problems as operation matures. By understanding the countermeasures in advance, you can prevent incidents where cost savings come at the expense of accuracy or speed.

Cases Where a Threshold Set Too Low Returns Incorrect Cached Responses

Setting the threshold too low causes false hits—returning past responses even for queries with different meanings. Particularly dangerous are queries where a negation or condition is reversed. "Is A cheaper than B?" and "Is B cheaper than A?" are very close in vector space, yet the answers they require are exact opposites. A low threshold will confuse the two.

As a countermeasure, first raise the threshold so that only genuinely similar queries count as hits. In addition, after narrowing candidates via nearest-neighbor search, inserting a re-ranker that accounts for negations and numeric conditions makes it easier to reject cases where surface form is similar but intent differs.

During the initial rollout, it is safest to set the threshold high, confirm through logs that no false hits are occurring, and then lower it gradually. Because false hits can make the hit rate appear to improve, it is important not to treat hit rate alone as the success metric.

The Risk of Stale Cached Responses Containing Hallucinations Being Reused

LLM responses can contain content that sounds plausible but is incorrect (hallucinations). Caching such responses as-is locks in the wrong answer, causing it to be returned repeatedly for queries with the same intent. Because the scope of impact is broader than with exact-match caching, the damage tends to be greater as well.

Countermeasures should be considered in layers. First, establish a policy of not caching responses with low confidence or responses that have not been verified. Second, set a TTL so that entries are always regenerated after a fixed period, preventing errors from persisting indefinitely. Third, put in place a mechanism to expire or delete the relevant entry when user feedback (low ratings or corrections) is received.

Where possible, insert a lightweight validity check on a response before registering it in the cache. At a minimum, build in an operational practice of periodically auditing cache contents for high-priority query groups to confirm that incorrect answers have not crept in.

Addressing Latency Increases Caused by Vector Database Bloat

As cache entries continue to grow, memory usage in the vector database expands and nearest-neighbor search latency degrades. The very cache that was meant to gain speed ends up being offset by slower searches.

The fundamental approach is to avoid accumulating unnecessary entries and to remove them promptly. Combine LRU-style eviction that deletes entries not accessed for a certain period, deduplication (dedup) of vectors that are too similar—arising from nearly identical queries—and automatic expiration via TTL.

Index tuning is also effective. Approximate nearest-neighbor indexes such as HNSW have parameters that govern the trade-off between search accuracy and speed; adjust these to match the scale of your data. If the entry count grows even larger, consider sharding or a design that separates indexes by use case. Monitor capacity and latency regularly so you can act before thresholds are exceeded.

How to Measure the Cost Reduction Impact on LLM API Expenses

The cost reduction effect can be understood by measuring the cache hit rate and the volume of tokens and costs of LLM calls that were avoided as a result. Rather than relying on intuition or estimates, it is important to compare pre- and post-implementation using the same metrics. This section organizes the specific figures to measure and how to visualize them.

Methods for Measuring Token Consumption, API Call Count, and Cost

Measuring cost reduction begins with collecting baseline metrics. What needs to be measured is the cache hit rate, the number of LLM calls avoided, and the volume of tokens avoided. A rough estimate of the savings can be calculated as: "number of calls avoided × average tokens per call × unit price."

For an accurate evaluation, record a baseline before introducing the cache (number of calls, token volume, and cost over a fixed period), then compare it against the same period after introduction. Subtract the small costs of embedding and vector search that still occur on cache hits to arrive at the net savings.

Note that the unit prices for LLMs and embedding models are subject to change, so any figures cited or estimated should be treated as reference values at the time of writing; always verify against the latest pricing pages before using them in production estimates.

<!-- TODO: Measure and insert specific reduction rates and amounts from our actual production environment -->

Visualizing Cache Effectiveness Using AI Observability Tools

To continuously track the measured metrics, using an AI observability tool is the most efficient approach. Tools such as Langfuse and Helicone record token consumption, cost, and latency per request, and can visualize cache hits and misses separately.

By monitoring the trend in hit rate on a dashboard, you can review the effect of changes to thresholds or TTL over time, providing a basis for tuning decisions. Viewing this alongside cost trends makes it immediately clear how much has been saved since introducing the cache. Signs of false hits—such as quality degradation or complaints on specific queries—can also be traced back through the logs to the offending cache entry.

Recording the hit/miss status, the threshold used, and the similarity score in the gateway access logs allows you to combine this data with observability tools for continuous monitoring of both cost reduction and response quality. Implementation is not the end goal; stable cost savings are only achieved once a cycle of measurement and adjustment becomes an established operational practice.

Author & Supervisor

Yusuke Ishihara

Yusuke Ishihara

Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).