What is Prompt Caching for Multi-Tenant? A Design for Reducing Inference Costs in B2B SaaS

Lead
Prompt Caching for Multi-Tenant refers to a design approach that reduces inference costs and latency in LLM (Large Language Model) inference infrastructure shared by multiple tenants, by caching each tenant's system prompts and context windows.
In B2B SaaS, as the number of tenants grows, API calls to the LLM increase rapidly, and the cost of input tokens tends to put pressure on revenue. While properly designed prompt caching can significantly reduce costs at cache read time, it also carries the risk of data leakage between tenants.
This article is intended for engineers and architects, and explains the following topics in order.
Prompt caching is a mechanism that retains the first portion of a long prompt—sent with every LLM request—on the server side, allowing subsequent processing to be skipped. In B2B SaaS, where system prompts and fixed contexts are used repeatedly, this mechanism directly impacts inference costs.
However, in a multi-tenant environment, it is not that straightforward. When attempting to cache system prompts and permission information that differ per tenant, the problem of isolation arises: "whose cache should be stored where?" Prioritizing cost reduction by sharing caches creates the risk of information leakage, while fully isolating caches per tenant lowers the cache hit rate and diminishes the cost-saving effect.
Here, we will work through the basic operating principles of prompt caching, the cost structure specific to B2B SaaS, and the reasons why cache design becomes complex in multi-tenant environments.
How Prompt Caching Works and the Principles Behind Token Reduction
Prompt caching is a mechanism in which the "unchanging leading portion (prefix)" of an API request to an LLM is held in memory on the inference server and reused in subsequent requests. Instead of reprocessing the same system prompt or large context every time, the cached Key-Value tensors are read out, which can dramatically reduce the processing cost of input tokens.
It is easy to initially think of caching as storing the entire response, but the more accurate understanding is that only the input-side prefix of the prompt is cached. Since the output is generated anew each time, the reduction target is limited strictly to the processing cost of input tokens.
The mechanism can be summarized in the following flow:
- Write: On the first request, the prefix is saved as Key-Value tensors
- Read: The cache is reused for subsequent requests that share the same prefix
- Cost difference: With Anthropic's Claude, the cache read cost is 0.1x the standard input price, while cache write costs 1.25x (for a 5-minute TTL)
The minimum number of tokens required for caching to take effect varies by provider. OpenAI requires 1,024 tokens or more, while Claude on AWS Bedrock Opus requires 4,096 tokens or more.
TTL (Time to Live) is also an important design variable.
Breakdown of Inference Costs and Challenges in B2B SaaS
A key characteristic of inference costs in B2B SaaS is that they scale as the product of number of tenants × number of requests × number of tokens, rather than by user count alone. Unlike services targeting individual end users, system prompts and business rules unique to each enterprise customer are sent with every request, causing input token counts to be chronically high.
Breaking down the cost structure reveals three overlapping layers. First is the system prompt layer, which bundles each tenant's business rules, constraints, and persona definitions—depending on the content, this can reach hundreds to thousands of tokens. Next is the context layer, consisting of conversation history and document fragments retrieved via RAG, which is also re-sent with every request. Finally, there is the user input layer—the actual queries typed by end users—which is relatively small in token count.
When the number of tenants is in the tens, monthly costs can often still be kept within an acceptable range, but the situation changes at the scale of hundreds to thousands of tenants. The cost of continuously sending full prompts of the same structure to every tenant becomes impossible to ignore as you scale.
Particularly serious is the problem of "redundant transmission of shared content." Many B2B SaaS products embed content that is common across tenants—such as product specifications, FAQs, and compliance rules—into their system prompts. Even though the content is identical regardless of which tenant is involved, it is counted as billable tokens on every request. This is an area that tends to become a blind spot in cost structure.
Why Caching Is Difficult in Multi-Tenant Environments
"If I cache Tenant A's system prompt, could Tenant B's request end up hitting that cache?"—This is the concern many engineers encounter first when they begin considering prompt caching in a multi-tenant environment.
This concern is not merely unfounded worry. The main reasons why caching is difficult in multi-tenant environments can be summarized in the following three points:
- Data isolation requirements between tenants: Each tenant's system prompt contains proprietary information such as business rules, pricing structures, and confidential policies. If cache key design is flawed, there is a risk that a different tenant's request could reference the same cache.
- Diversity of prompt structures: Because the length and composition of system prompts differ per tenant, shared prefixes tend to be short, lowering the cache hit rate. The fact that prompt caching has little effect without a shared prefix of 1,024 tokens or more further increases the difficulty of design.
- Mismatch between TTL and tenant lifecycle: When a tenant makes a contract change or plan update, if the cache's TTL has not yet expired, there is a risk that outdated settings will continue to be used. Anthropic's Claude has a standard TTL of 5 minutes, extendable up to 1 hour as an option, but a separate mechanism is needed to balance the frequency of tenant configuration changes with TTL.
Furthermore, as the number of tenants increases, cache entries also increase, and management costs tend to grow at a superlinear rate.
What to Verify Before Design: Prerequisites and Architecture Choices
Conclusion: Before beginning design, confirming three points — the cache specifications of the LLM provider you will use, the tenant isolation model, and the structure of the system prompt — is a prerequisite for maximizing cost reduction.
Mistakes in selecting each element can lead to significant rework in later stages, so we will work through them in order.
Checking Supported LLM Providers and Cache-Compatible APIs
Before starting cache design, you need to accurately understand "under what conditions caching becomes effective" for the provider you are using. It is easy to assume at first that "any model can be used the same way," but in practice, the minimum token count for caching, TTL, and API call methods differ significantly between providers.
The specifications of major providers are summarized below.
- OpenAI: Caching is enabled for prompts of 1,024 tokens or more. The number of cache hits can be checked via
usage.prompt_tokens_details.cached_tokensin the response. In-memory cache expires after approximately 5–10 minutes of inactivity and is retained for up to 1 hour. With Extended, retention of up to 24 hours is possible. It is recommended to keepprompt_cache_keyrequests for the same prefix below 15 requests/minute. - Anthropic (Claude): The default TTL is 5 minutes. The 1-hour option incurs an additional cost; the cache write cost is 1.25× the base input price (5 minutes) or 2× (1 hour), while reads are 0.1×.
- AWS Bedrock (Claude Sonnet): The minimum cache token count is 1,024 tokens, and the default TTL is 5 minutes. For Claude Opus, a minimum of 4,096 tokens is required, and up to 4 cache breakpoints can be configured.
Choosing a Tenant Isolation Model (Silo, Pool, and Hybrid)
The choice of tenant isolation model forms the foundation of cache design and must therefore be made at the very earliest stage of architectural decisions. The characteristics of the three primary models are summarized below.
Silo A model in which dedicated LLM endpoints and cache areas are allocated per tenant. It carries the lowest risk of data leakage and is suited for financial and healthcare SaaS with strict compliance requirements. However, since the scope of cache reuse is limited to within each tenant, infrastructure costs tend to increase linearly as the number of tenants grows.
Pool A model in which multiple tenants share a common cache area and system prompt. Cache hit rates for shared prefixes are high, enabling significant reductions in inference costs. On the other hand, if cache key design is insufficient, there is a risk of data leakage between tenants, making the cache key design discussed later especially important.
Hybrid A compromise model in which shared components (product-wide system prompts) are managed using the Pool approach, while tenant-specific context is isolated using the Silo approach. It is easier to achieve both cost efficiency and security, making it suitable for many B2B SaaS products.
The decision criteria for selection are as follows: if data sovereignty or compliance requirements differ per tenant, Silo or Hybrid is appropriate; if requirements are uniform across tenants and cost optimization is the priority, Pool is suitable.
Context Window and System Prompt Design Guidelines
"How much of the system prompt should be shared, and from what point should it become tenant-specific?" — many teams struggle with drawing this line. Because the design policy for the context window and system prompt directly affects cache efficiency, it must be clarified before implementation.
The fundamental design principle is to "consolidate invariant content at the top." LLM (large language model) prompt caching caches matching prefixes from the beginning of the prompt. By placing common context that does not change at the top and tenant-specific information toward the end, the cache hit rate improves significantly.
It is effective to design the system prompt in the following 3 layers:
- Common layer (shared across all tenants): Basic product behavior rules, response formats, prohibited actions, etc. Aggregate the least frequently changed and longest content here.
- Plan layer (shared by plan or industry): Context that tenants on the same plan can share, such as descriptions of enterprise features and industry-specific guidelines.
- Tenant-specific layer: Tenant name, custom settings, individual policies, etc. Design this to be as short as possible.
In Anthropic's Claude, the minimum cache size is 1,024 tokens. Designing the common layer to exceed this threshold enables cache writes to take effect. For Claude Opus on AWS Bedrock, a minimum of 4,096 tokens is required, so adjust the volume of the common layer accordingly for each provider.
How to Design: Cache Key Design Patterns per Tenant
Conclusion: Because mistakes in cache key design directly lead to data leakage between tenants, it is essential to finalize naming conventions and the isolation structure in advance.
In cache key design per tenant, the core issues are how to combine shared prefixes with tenant-specific suffixes, and how to manage TTL.
How to Incorporate Tenant ID into Cache Keys
The first mistake people tend to make in cache key design is the idea that "you can just use the user ID or session ID directly as the key." In practice, however, always placing the tenant ID at the top as the highest-level namespace improves both isolation reliability and cache efficiency.
Why must it come first? LLM provider prompt caches determine hits by matching from the beginning of the prompt string (prefix matching), so the structure of the cache key must correspond to the ordering of the prompt. Following this principle, the recommended key structure is as follows:
{tenant_id}:{model_id}:{prompt_version}:{context_hash}
The leading tenant_id should be a UUID or similar value that uniquely identifies the tenant. By using this as the starting point, there is simply no room for caches from different tenants to intermingle. The subsequent model_id prevents collisions when the same tenant uses multiple models, prompt_version is used to automatically invalidate old caches when the system prompt changes, and context_hash is a value such as a SHA-256 hash of the user-specific context.
Note that you can reuse an identifier managed on the application side as the tenant ID, but using an internal UUID that is not exposed externally reduces security risks.
Separation Structure of Shared Prefixes and Tenant-Specific Suffixes
Separating prompts into a "shared prefix" and a "tenant-specific suffix" is the fundamental pattern for maximizing cache efficiency in multi-tenant environments.
Basic Structure of the Separation
The entire prompt is assembled in the following 3 layers:
- Shared prefix layer: System prompts common to all tenants (product descriptions, safety policies, output format instructions, etc.)
- Tenant-specific layer: Per-tenant configuration (company name, industry, custom rules, etc.)
- User request layer: User input for each request
The primary target for caching is the "shared prefix layer." Because this layer is common across all requests, it achieves the highest cache hit rate.
Design Branching Based on Conditions
When the common portions of the system prompt are large across tenants, making the shared prefix longer increases cache efficiency. However, when policies or language settings differ significantly between tenants, a design that places the tenant-specific layer first and partitions the cache per tenant is more appropriate.
Implementation Example (Conceptual)
[Shared prefix: 2,000 tokens] You are an AI assistant for a SaaS platform. Output in Japanese.
Cache Expiration and TTL Management Concepts
"How long should the cache TTL be set?"—this is typically the first question that comes up during implementation.
TTL configuration is determined by the trade-off between cost reduction and data freshness. Summarizing based on the specifications of major providers, the following applies:
Default TTL for Major Providers
- Anthropic (Claude): Default 5 minutes, optionally 1 hour (at additional cost)
- AWS Bedrock (Claude Sonnet): Default 5 minutes
- AWS Bedrock (Claude Opus): Choice of 5 minutes or 1 hour
- OpenAI: Expires after approximately 5–10 minutes of inactivity, maximum 1 hour (Extended: maximum 24 hours)
Decision Criteria for TTL Selection
In practice, TTL is best determined along two axes: "system prompt update frequency" and "request density."
- When update frequency is low and requests are concentrated during certain periods → A 1-hour TTL makes it easier to recoup cache write costs
- When system prompts change frequently per tenant → A 5-minute TTL maintains freshness while targeting cost reduction during short-term bursts
Under Anthropic's pricing model, the write cost for a 1-hour cache is 2× the base input price, while reads are 0.1×. To recoup the cost of a single write, the same cache must be hit multiple times within the TTL.
How to Implement: Step-by-Step Build Instructions
Conclusion: Once the design policy is established, proceed with implementation in three steps: templatization, cache layer construction, and measurement.
Start with templatizing the system prompt, then move on to introducing the cache layer, and finally build out hit rate measurement. The details of each step are explained in the H3 sections below.
Step 1: Templatizing System Prompts per Tenant
It is tempting at first to think "just prepare a completely different system prompt for each tenant," but in practice a template structure that clearly separates common parts from differences is more effective in terms of both cache hit rate and maintainability.
The fundamental approach to templatization is to divide the prompt into a two-tier structure consisting of a "shared prefix layer" and a "tenant-specific layer."
Shared Prefix Layer (Cache Target)
- Immutable rules such as AI role definitions, output format instructions, and prohibited actions
- Business logic and guidelines common to all tenants
- For OpenAI, a minimum of 1,024 tokens is required; similarly, for Anthropic (Claude), a minimum of 1,024 tokens is required—design the common portion to meet or exceed this threshold
Tenant-Specific Layer (Not Cached)
- Dynamic parameters such as tenant name, subscription plan, and enabled feature flags
- Tenant-specific tone specifications and brand guidelines
During implementation, it is common practice to use a template engine such as Jinja2 to manage the shared prefix as a static string. Tenant-specific parameters are appended using placeholders such as {{ tenant_name }}, and the design ensures that the byte sequence of the prefix portion does not change after rendering.
One important note: if trailing newlines or whitespace are mixed into the end of the shared prefix, the provider may treat the cache key as a different entry.
Step 2: Introducing a Cache Layer and Managing Prompt Hashes
Simply passing a templated system prompt directly to the API is not enough to reliably benefit from caching. Maintaining a separate cache layer with centralized management via "prompt hashes" is essential for stable operation in multi-tenant environments.
Basic Cache Layer Architecture
The common approach is to insert a thin cache proxy between the application layer and the LLM API. The processing flow is as follows:
- Combine the tenant ID and system prompt version number, then generate a hash value using SHA-256 or similar
- Register a cache entry in an in-memory store such as Redis, using the hash value as the key
- If a request with the same hash arrives, return the response from cache and skip the call to the LLM API
Decision Criteria for Prompt Hash Management
For high-request-volume tenants, managing hashes in-memory (Redis) is appropriate to minimize latency, while for low-frequency tenants, a cost-first design that persists hashes on the database side is more suitable.
Integration with LLM Providers
Since Anthropic's cache read cost is 0.1x the base input price, a hash match that also hits the provider-side cache yields a double cost reduction. The cache write cost with a 5-minute TTL is 1.
Step 3: Measuring Cache Hit Rate and Configuring AI Observability
A common situation in practice is: "We implemented caching, but we have no way to confirm whether it's actually hitting." Without measurement, there is no visibility into optimization opportunities, so configuring AI observability should proceed in parallel with implementation.
The starting point for measurement is the usage.prompt_tokens_details.cached_tokens field in the API response. Use this field to retrieve cache hit counts and continuously record the following metrics:
- Cache hit rate: Calculated as
cached_tokens ÷ total_prompt_tokens - Per-tenant hit rate: Aggregated by tenant ID to identify tenants with low hit rates
- TTL expiration count: Record the time difference between cache writes and hits to inform decisions on whether a 5-minute or 1-hour TTL is more appropriate
These metrics are typically sent to an observability platform such as Datadog or Grafana and visualized on a dashboard. As explained in How to Measure the Impact of AI Agent Adoption | From KPI Design to Continuous Improvement, defining KPIs in advance makes it easier to drive improvement cycles.
Setting up alerts is also essential. Below are examples of recommended threshold values.
How to Prevent Data Leakage Between Tenants: Applying Privacy by Isolation
The more caching is utilized, the greater the risk of information leakage between tenants. This is not a problem that can be addressed retroactively in the design — it must be factored in at the stage when the cache structure is determined. The following sections outline the specific risks of prompt leaking and how to apply Privacy by Isolation to cache design.
Information Leakage Risks via Cache and Prompt Leaking Countermeasures
When caching is viewed solely as a "cost reduction mechanism," it is easy to overlook the serious risk of information leakage between tenants. In practice, there are reported cases where flawed cache design has become a breeding ground for prompt leaking (system prompt disclosure).
How to Incorporate Privacy by Isolation into Cache Design
When incorporating Privacy by Isolation into cache design, the starting point is to clearly define upfront "what is being isolated."
Specifically, isolation is designed across the following three layers:
- Cache storage layer: Always assign the tenant ID as a namespace in the cache key, ensuring that keys in Redis or Memcached cannot collide with those of other tenants
- Prompt structure layer: Physically separate and manage the system prompt shared across all tenants (shared prefix) from tenant-specific context (suffix)
- API call layer: Before evaluating a cache hit, always verify that the tenant ID of the requesting party matches the namespace of the cache key being retrieved
The implementation approach varies depending on the tenant isolation model. In a silo model (independent infrastructure per tenant), the cache storage itself can be isolated, simplifying management. In a pool model (shared infrastructure), key namespace management at the application layer and ownership verification prior to retrieval are indispensable.
Careful attention must also be paid to the content written to the cache. As a general principle, avoid including tenant-specific personal information or confidential data in cached prompts, and limit what is placed in the cache to static or semi-static context such as tenant settings, role definitions, and common knowledge.
What Are Common Failure Patterns and How to Avoid Them?
Conclusion: Identifying failure patterns that are easily overlooked during the design and implementation phases is a prerequisite for stable cache operations.
Cache key collisions and cache invalidation timing mismatches are issues that occur particularly often in multi-tenant environments. Below, we outline the causes and mitigation strategies for each.
Cross-Tenant Data Contamination Due to Cache Key Collisions
Cache key collisions are like mail intended for different residents being delivered to the same mailbox. If Tenant A's system prompt is returned in response to Tenant B's request, it directly leads to a confidential information leak.
The typical cause of this problem is insufficient cache key design. For example, if only the prompt content is hashed to generate the key, two different tenants that happen to use the identical system prompt string will produce the same key. As a result, the context cached by the first tenant may be returned to the other tenant.
There are three major collision patterns. The first is keying on prompt content alone — because the tenant ID is not included in the key, identical strings cause collisions. This can be prevented by standardizing the key format as {tenant_id}:{prompt_hash}. The second is sharing a global cache namespace across tenants. When using Redis or similar systems, if the cache store itself is not isolated, cross-contamination can occur at the namespace level; it is therefore necessary to separate tenants by key prefix or DB number. The third is computing the hash before template variable expansion — using a shared template as-is for the key generates the same key even when the expanded content differs. Always compute the hash against the prompt string after variable expansion.
Cost Increases Caused by Misaligned Cache Invalidation Timing
It is tempting to think that cache invalidation simply means "purge immediately whenever there is an update," but mistiming invalidation can easily backfire by causing a sharp increase in new write costs.
Under Anthropic's pricing structure, writing to a 5-minute cache costs 1.25× the base input price, while a 1-hour cache costs 2×. When prompts change frequently, the cache is regenerated repeatedly, creating a cycle in which the TTL expires before the benefits of cache reads (0.1×) can be realized. With providers whose default TTL is 5 minutes, cache hits will be nearly zero unless requests from the same tenant are concentrated within that 5-minute window. Furthermore, bulk-invalidating all tenants' caches every time a setting is changed in the admin panel causes the hit rate to drop sharply and costs to spike temporarily.
A practical mitigation for these issues is Lazy Invalidation. By version-controlling prompt changes and switching to the new version at the time of the next request rather than purging immediately, you can suppress the cascading increase in write costs.
FAQ
Q1. Is Prompt Caching available with all LLM providers?
Support is growing among major providers, but the conditions vary by model and subscription plan. OpenAI requires at least 1,024 tokens for cache activation, and in-memory caches expire in approximately 5–10 minutes. Anthropic (Claude) has a default TTL of 5 minutes, with a 1-hour option also available. On AWS Bedrock, thresholds differ by model — for example, Claude Opus 4.5 requires a minimum of 4,096 tokens. We recommend checking each provider's official documentation for supported models and minimum token requirements before adoption.
Q2. When using caching in a multi-tenant environment, is there a risk of data leaking between tenants?
If cache key design is flawed, there is a risk that different tenants' contexts become associated with the same cache entry. As a countermeasure, it is effective to always include the tenant ID in the cache key and to clearly separate tenant-specific system prompts from shared prefixes. Applying the principle of Privacy by Isolation — designing the system so that cross-tenant cache references cannot structurally occur — is essential. In addition to measuring cache hit rates, we recommend periodically reviewing audit logs of tenant boundaries.
Q3. How should cache TTL be configured to maximize cost efficiency?
The optimal TTL depends on request frequency and how often system prompts are updated. Under Anthropic's pricing structure, cache write costs are 1.25× the base input price (5-minute TTL) or 2× (1-hour TTL), while cache reads drop to 0.1×. A 1-hour TTL is advantageous for tenants with frequent requests, while a 5-minute TTL is often more cost-appropriate for low-frequency tenants. It is effective to periodically revisit TTL choices based on actual measured cache hit rates.
Q4. How should caching be handled when a tenant frequently updates their system prompt?
When a system prompt changes, the corresponding cache entry must be invalidated. Designing the cache key to incorporate the hash of the prompt means that a prompt change automatically generates a different cache entry, preventing references to stale entries. Tenants with high update frequency tend to have lower cache hit rates, so we recommend monitoring per-tenant hit rates on an AI observability dashboard and adjusting TTL settings and caching strategies on an individual basis.
Q5. How can the effectiveness of Prompt Caching be evaluated quantitatively?
With OpenAI, you can check the number of cache-hit tokens via the usage.prompt_tokens_details.cached_tokens field in the API response. Accumulating this data allows you to calculate the cache hit rate and the number of tokens saved per tenant. Combining three metrics — "cache hit rate," "tokens saved," and "latency improvement" — is a practical approach to evaluation. Referencing resources such as How to Measure the Impact of AI Agent Adoption | From KPI Design to Continuous Improvement and establishing a regular measurement cycle will enable ongoing cost optimization.
Author & Supervisor
Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).


