LLM Cost Optimization Guide — Token Reduction, Model Selection, and Cache Implementation

Lead
LLM cost optimization is the ongoing effort to reduce API expenses and inference costs across three axes—token consumption, model selection, and cache utilization—while maintaining the accuracy and quality of generative AI systems.
Cases have been reported where monthly costs balloon to several times the initial estimate the moment a system goes into production. This is due to the accumulation of token waste that was invisible during the PoC stage, over-specified model selection, and duplicate requests caused by unused caching.
This article is aimed at engineers, architects, and LLM FinOps practitioners operating LLMs in production, and systematically covers four pillars: token reduction, model selection, prompt caching, and RAG design. Readers can learn step-by-step implementation patterns for cutting costs in half without sacrificing accuracy.
As production adoption of generative AI accelerates, the operational costs of LLMs (large language models) tend to grow at an unexpected pace. Cases have been reported where token consumption—difficult to observe during the proof-of-concept (PoC) stage—drives up monthly costs the moment it is exposed to real user traffic. Deferring cost management causes AI ROI (return on investment) projections to break down and can affect business decisions. This section organizes the structure behind cost growth and the approach to advancing optimization without sacrificing accuracy.
The Structure of Escalating Monthly Costs in Enterprise Operations
Cases have been reported where monthly costs balloon to several times the initial estimate immediately after an LLM (large language model) goes into production. Continuing operations without understanding this structure allows costs that were invisible during the PoC (proof-of-concept) stage to keep accumulating.
The main drivers of cost growth can be classified into the following three layers.
① Token consumption bloat
- Redundant explanations and unnecessary examples accumulate in the system prompt, consuming hundreds to thousands of tokens per request
- RAG (Retrieval-Augmented Generation) search results are stuffed directly into the context window, sending even low-relevance documents to the model
- In multi-step reasoning and agent orchestration, the same information is repeatedly transmitted across multiple turns
② Inefficient model selection
- High-performance dense models are applied uniformly even to lightweight tasks such as classification and summarization
- Reasoning models significantly increase output tokens due to CoT (chain-of-thought), making them cost-ineffective for simple tasks
③ Unused caching
- Even when identical or similar prompts are sent repeatedly, prompt caching is not configured, resulting in full charges on every request
When these factors compound, monthly costs tend to grow faster than the rate of increase in request volume. The first step toward optimization is identifying which layer is generating costs in your own environment.
Defining the Cost Optimization vs. Accuracy Trade-off
Before pursuing cost reduction, it is essential to clearly define the acceptable threshold for accuracy degradation. Skipping this definition often leads to optimization initiatives being derailed by pushback from the field.
Three axes for structuring the trade-off
- Accuracy requirements: Whether incorrect answers are acceptable, and if so, within what percentage
- Latency requirements: The upper limit on response time that does not impact user experience
- Cost budget: The maximum value that can be set as a monthly or per-request cost ceiling
Proceeding with optimization without first reaching agreement on these three axes causes evaluation criteria to shift from one initiative to the next.
Examples of acceptable thresholds by task
The acceptable trade-offs tend to vary significantly depending on the nature of the task.
- For low-risk tasks such as internal FAQ search and routine classification, prioritizing cost reduction even at the expense of a few percentage points of accuracy is rational
- For high-risk tasks such as contract review and medical information delivery, maintaining accuracy is the top priority, and the room for cost reduction is limited
- For intermediate tasks such as summarization and translation, adjustments can be made incrementally based on evaluation metrics (BLEU scores or human evaluation)
The concept of a "regression budget"
Defining the acceptable margin of accuracy degradation as a numerical value is referred to as a "regression budget." For example, setting a threshold such as "accuracy may not drop by more than 2 percentage points" enables quantitative evaluation of the impact of model changes or prompt compression. Since this budget is also used in subsequent evaluation phases, it is important to reach agreement with stakeholders at this stage.
Cost and accuracy are not necessarily a zero-sum trade-off. With the right measurement infrastructure in place, initiatives that improve both can sometimes be found. The next section explains how to build that measurement infrastructure.
Prerequisites — Cost Visibility and Baseline Measurement
Before implementing cost reduction measures, the indispensable first step is to "accurately understand the current state." Without visibility into what needs to be reduced, it is impossible to prioritize initiatives or measure their effectiveness.
This section explains how to build a measurement infrastructure that visualizes token consumption and costs, and how to establish the baseline that serves as the starting point for optimization. The necessary components for implementation—from selecting observability tools to designing FinOps tags—are organized in sequence.
Building a Token-Based Cost Measurement Foundation
The first step in cost optimization is accurately understanding "what you are spending money on and how much." Since LLM (Large Language Model) billing occurs on a per-token (Token) basis, tracking only the number of requests does not reveal the true picture.
Minimum Metrics to Measure
- Input token count: The total of the entire prompt (system prompt + user message + context)
- Output token count: The length of the text generated by the model
- Cache hit count: The number of times prompt caching was applied and the amount of tokens saved
- Model identifier: When multiple models are used within the same application, aggregate by model
Each API response includes a usage object, and simply recording prompt_tokens and completion_tokens for every request provides the foundational data. Inserting a thin middleware layer early on that writes these values to a data store will make subsequent tuning significantly easier.
Understanding the Characteristics of BPE Tokenizers
BPE tokenizers (Byte-Pair Encoding Tokenizers) tend to convert multi-byte characters such as Japanese and Chinese into more tokens than English. Since cost varies by language even for the same amount of information, per-language aggregation is essential for multilingual products.
Implementation Priorities
- Save the
usagefield from all API responses to logs (no DROP allowed) - Attach tags for endpoint, user ID, and feature name
- Automatically aggregate daily and weekly cost calculations using token unit price × usage
Only once a measurement foundation is in place can you quantitatively determine which endpoints have a "poor cost-to-value ratio." The observability stack introduced in the next section functions on top of this foundation.
Observability Stack (Langfuse / LangSmith / Helicone / OTel GenAI semconv) and FinOps Tag Design
Once the cost measurement foundation is in place, the next step is to introduce an observability stack that visualizes the details of each request. The following are guidelines for selecting tools.
- Langfuse: Primarily open-source with self-hosting capability. Records token counts, latency, and cost at the trace level, and excels at cross-team cost comparison.
- LangSmith: Highly compatible with the LangChain ecosystem and can visualize intermediate steps of agents.
- Helicone: Proxy-based, requiring minimal changes to existing code. Features a simple dashboard suited for small teams.
- OTel GenAI semconv: OpenTelemetry's semantic conventions for generative AI. Standardizes vendor-neutral span attributes (e.g.,
gen_ai.usage.input_tokens) and integrates easily with existing observability platforms (Grafana, Datadog, etc.).
One aspect that tends to be overlooked after selecting a tool is FinOps tag design. Without tags, it becomes impossible to later isolate costs by "which team, which use case, and which model." It is recommended to attach at minimum the following 4 dimensions as tags.
| Tag Key | Example |
|---|---|
team | search, support, analytics |
use_case | summarization, rag, code_review |
model | gpt, claude, gemini |
env | prod, staging |
Tags are embedded as metadata at request time and filtered on the observability tool side. Adding tags retroactively breaks log continuity, so it is important to design them at the start of a project. Only when both visualization and tagging are in place does it become possible to measure the effectiveness of the token reduction measures described in the next step.
Step 1: Token Reduction — Prompt Design and Compression
The first step in reducing costs is to decrease the number of tokens (Tokens) sent to the LLM itself. Before changing models or caching strategies, there are many cases where simply revisiting prompt design can significantly compress both input and output costs.
In this step, we cover two approaches: structuring the system prompt (System Prompt) and compressing long-form context. Both have low implementation costs and offer the advantage of allowing you to measure their effects on the same day they are applied.
Structuring Redundant System Prompts
The system prompt (System Prompt) is billed as tokens (Tokens) on every request to the LLM (Large Language Model). Leaving a lengthy prompt unattended tends to have a non-negligible impact on monthly costs.
To first understand the current state, measure the token count of your system prompt. Using a BPE tokenizer (Byte-Pair Encoding Tokenizer), Japanese text often consumes roughly 1–2 tokens per character. Cases have been reported where a 500-character prompt exceeds 700 tokens in practice, so optimization without measurement is inadvisable.
4 Steps for Structuring
- Remove duplication: Consolidate instructions with overlapping meaning—such as "Please answer politely" and "Please be attentive to the user"—into a single line.
- Convert negatives to positives: Rewriting expressions like "Please do not ~" into positive form tends to reduce token count.
- Use Markdown bullet points: Bullet points are often more token-efficient than long paragraphs.
- Simplify role definitions: Lengthy role definitions such as "You are an expert in ○○ with a background in ~~…" should be shortened to retain only the essential points.
Before / After Benchmarks
Cases have been reported where system prompts that previously exceeded 500 tokens were reduced to the 200–300 token range through the above refinements. As the proportion of the overall context window (Context Window) occupied by the system prompt decreases, there is also the secondary benefit of being able to allocate more space to user turns.
Note that when utilizing prompt caching (detailed in the next step), it is important to fix the leading portion of the system prompt in order to maximize the cache hit rate. Being mindful of cache design at the same time as structuring allows the optimization effects to compound.
Context Compression and Criteria for Applying LLMLingua / LongLLMLingua
As the context window (Context Window) grows longer, input token counts tend to increase super-linearly, causing costs to spike sharply. For use cases where long inputs are unavoidable—such as document summarization, long-form QA, and multi-turn conversations—context compression is an effective measure.
LLMLingua and LongLLMLingua are representative OSS libraries. The following serves as a guideline for choosing between them.
- LLMLingua: Targets medium-length prompts of a few thousand tokens and removes tokens with low importance scores. A certain degree of compression effectiveness has been reported, making it suited for short-to-medium summarization and classification tasks.
- LongLLMLingua: Targets long texts on the scale of tens of thousands of tokens and selectively retains important chunks based on their relevance to the query. Particularly effective when retrieval results are numerous in RAG (Retrieval-Augmented Generation) scenarios.
However, applying these tools requires clear judgment criteria. Consider adoption when the following conditions are met:
- A significant portion of input tokens consists of boilerplate phrases and redundant background explanations.
- An acceptable threshold for degradation in response accuracy has been defined in advance (linked to the regression budget described later).
- The latency of the compression process itself is within an acceptable range.
On the other hand, there are also situations where application should be avoided. In domains such as legal documents and medical records, where missing context can be fatal, the risk of misinformation increases. Additionally, due to the characteristics of BPE tokenizers (Byte-Pair Encoding Tokenizers), Japanese may yield lower compression efficiency than English, making per-language validation essential.
It is recommended to always measure the three metrics of token count, accuracy, and latency before and after implementation to quantitatively confirm the cost reduction effect.
Step 2: Model Selection — Tiered Design by Task
Alongside token reduction, model selection is another lever that directly drives down costs. Applying the highest-performing LLM (Large Language Model) to every request causes costs to balloon without limit. On the other hand, tiering models according to task difficulty can yield significant cost savings while maintaining quality.
This section covers model selection along two axes: routing strategies and the use of local LLMs.
- Routing to lightweight models: Automatically dispatching requests to models based on task complexity
- TCO decisions for hybrid configurations: Identifying the break-even point between cloud and on-premises deployments
Routing Strategies to Lightweight Models (Selection Criteria for RouteLLM / Martian / OpenRouter)
Continuously sending all requests to high-performance models leads to unbounded costs. A "routing strategy" — distributing requests across models based on task complexity — is the core of LLM cost optimization.
The fundamental idea behind routing
- Simple classification and summarization → Handled by SLMs or lightweight models
- Complex reasoning and multi-step tasks → Escalated to high-performance models
- The router dynamically selects a model on a per-request basis
Criteria for selecting key tools
RouteLLM is an OSS framework that calculates a difficulty score in real time and forwards requests to a higher-tier model only when the score exceeds a threshold. Its strength lies in the ability to numerically tune the trade-off between cost reduction and quality degradation. Note that calibration to your own traffic patterns is required, so budget for initial setup costs.
Martian is a router provided as a cloud API that automatically classifies task characteristics to select a model. While it requires minimal implementation effort, vendor lock-in and additional API costs must be taken into account.
OpenRouter is a proxy that aggregates models from multiple providers under a single endpoint. It makes price comparison and automatic fallback straightforward, making it a useful starting point for multi-model experimentation.
Implementation considerations
Never forget that the router itself introduces latency and cost. Using a high-performance model for routing decisions defeats the purpose. Additionally, low routing accuracy increases the risk of quality degradation, so regular accuracy validation using an evaluation dataset is essential. Combining routing with local LLMs — covered in the next section — can yield further cost reductions.
TCO Decision Criteria for Local LLMs and Hybrid Configurations
Local LLMs (self-hosting open-weight models) eliminate API billing, but hidden costs accumulate in the form of GPU procurement, infrastructure operations, and model updates. Without an accurate TCO (Total Cost of Ownership) estimate before deployment, it is not uncommon to end up paying more than you would with a cloud API.
Key cost items to examine in a TCO comparison
- Cloud API side: Input/output token unit price × monthly request volume; retry costs when latency SLAs are exceeded
- Local LLM side: GPU/instance costs (on-premises or cloud GPU); engineering hours for building and maintaining the model serving infrastructure; costs for tuning via quantization or LoRA
- Common to both: Security audits, monitoring and logging infrastructure, personnel costs for the engineers responsible
Scenarios where a hybrid configuration is effective
A hybrid configuration combining local LLMs with cloud APIs tends to be cost-effective when the following conditions overlap:
- A large proportion of requests contain internal documents or personal information that cannot be sent to the cloud
- The same prompt patterns repeat frequently and throughput is consistently high (enabling sustained GPU utilization)
- Lightweight tasks are processed locally with an SLM (Small Language Model), while only complex reasoning is routed to the cloud API
Decision guidelines
When monthly token volume exceeds a certain threshold and there is a reasonable expectation of keeping GPUs running at high utilization, a local configuration is more likely to offer a cost advantage. Conversely, when requests are sporadic, the cost of idle GPUs becomes a burden. A practical approach is to first measure actual cloud API costs at PoC scale, then compare that figure against the TCO of a local configuration before making a migration decision.
Step 3: Prompt Caching and Result Reuse
Once the foundation has been established through token reduction and model selection, the next area to address is cost compression through caching. Running full inference on every request for the same input is a direct waste of computational resources.
There are broadly two approaches to prompt caching and result reuse: native caching features provided by the model provider, and semantic caching implemented at the application layer. Because the two differ in both mechanism and applicable scenarios, it is important to understand the criteria for choosing between them.
The H3 sections that follow provide a detailed explanation of the differences in caching specifications across Anthropic, OpenAI, and Google, as well as how to handle the risk of false cache hits.
Spec Differences Across Anthropic / OpenAI / Google and an Application Decision Flowchart
Prompt caching specifications differ by provider, so it is important to understand those differences before implementation. Flawed design has been reported to result in caching that does not function as intended, reducing the cost-saving effect to nearly zero.
Specification comparison across the three major providers
| Item | Anthropic (Claude) | OpenAI (GPT) | Google (Gemini) |
|---|---|---|---|
| Scope | System prompts, long-form context | System prompts | System prompts, long-form context |
| Minimum cache unit | 1,024 tokens or more | 1,024 tokens or more | Refer to official documentation |
| Cache retention period | Approximately 5 minutes (TTL-based) | Per session | Minimum 1 hour and above (configurable) |
| Pricing model | Surcharge on cache write; discount on cache read | Discounted read cost | Separate storage cost applies |
※ The above values are reference figures at the time of writing. Always check the latest pricing pages.
Decision flow for applicability
- Is the system prompt fewer than 1,024 tokens? → If yes, no caching benefit can be obtained. Prioritize token reduction first.
- Is the request frequency sufficient? → Not suitable for low-frequency batch processing where reuse within the TTL is unlikely.
- Is the beginning of the prompt fixed? → Caching is predicated on prefix matching. A design that places dynamic variables at the end is essential.
- When using Google Gemini → The
cachedContentAPI must be called explicitly, and note that storage costs are added.
Practical points for selection
- Anthropic is well-suited to high-frequency use cases with long-form context
- OpenAI operates on a per-session basis, so it tends to be effective for chatbot-type use cases
- Google allows longer cache retention periods, but requires careful calculation of the balance against storage costs
Keep in mind that the design philosophy differs from the semantic caching covered in the next section.
Semantic Cache Implementation Patterns and False Positive Risks
Semantic caching works by converting input prompts into embeddings, searching a vector database for similar queries, and reusing past responses. Because it eliminates the API call itself, it can deliver greater cost savings than prompt caching.
Basic Implementation Flow
- Vectorize user input using an embedding model
- Calculate cosine similarity in a vector database (Pinecone, Qdrant, Redis VSS, etc.)
- If similarity exceeds a threshold (e.g., 0.92 or above), return the cached response
- If below the threshold, call the LLM and add the result to the cache
The threshold setting determines operational quality. Set it too low and false positives increase, creating the risk of returning incorrect answers to semantically different questions. Set it too high and the cache hit rate drops, diminishing the cost-saving effect.
Patterns Prone to False Positives
- Queries that share a similar structure but require different answers, such as "What's the weather in Tokyo?" vs. "What's the weather in Osaka?"
- Cases where numbers or proper nouns differ but context is similar (e.g., "2023 sales" vs. "2024 sales")
- Questions that require personalization per user
Implementation Strategies to Mitigate Risk
- Metadata filtering: Add user ID, tenant ID, and date as filter conditions to narrow the scope
- TTL settings for entries: Set short expiration times for time-sensitive information to prevent stale responses from being reused
- Cache layer log collection: Send input/output pairs at cache hit time to an AI observability platform and conduct regular quality reviews
Semantic caching is powerful, but it carries an inherent risk of accuracy degradation. It is essential to design operations that monitor both hit rate and quality in combination with the evaluation datasets covered in the next section.
Operations — Accuracy Evaluation and Guardrails
The moment a cost-reduction measure goes into production, the first question asked is: "Has accuracy degraded?" Token reduction, model switching, and cache introduction all carry the risk of quality degradation. This section organizes the design of evaluation datasets for quantitatively comparing quality before and after changes, along with the concept of guardrail operations for safely advancing cost reduction.
Evaluation Dataset and Regression Budget Design
The more cost optimization measures are layered on, the more quietly the risk of accuracy degradation accumulates. It is not uncommon for user complaints to surge the month after celebrating a successful reduction. Managing that risk quantitatively is the purpose of evaluation datasets and regression budget design.
Principles for Structuring an Evaluation Dataset
- Golden set: Representative input/output pairs extracted from production logs. Aim to collect a minimum of 100–300 examples
- Edge case set: Intentionally include cases where incorrect answers were reported and boundary conditions
- Task-based splits: Separate evaluation metrics by task type—summarization, classification, generation, etc. (e.g., F1 for classification, BERTScore for summarization)
Do not treat the dataset as something built once and forgotten. It is important to operate with the practice of adding new error patterns to the set whenever they are observed in production.
How to Design a Regression Budget
A regression budget is a pre-defined allowance for how much accuracy degradation is acceptable for a given optimization measure.
- Example: Keep accuracy on primary tasks within ±2%, do not allow the hallucination rate to worsen relative to the current baseline, etc.
- It is best managed as a concept where each measure consumes part of the budget, with an automated rollback trigger when the allowance is exceeded
Integration into CI/CD
Evaluation should be embedded in the deployment pipeline and run automatically with each measure. Integrating with AI observability tools (such as Langfuse) enables continuous monitoring by linking production traces to evaluation scores. Having cost-reduction impact and accuracy changes visible on the same dashboard is the ideal state for LLM FinOps.
FAQ (Common Failures and Anti-Patterns)
In LLM cost optimization practice, teams repeatedly fall into the same pitfalls. Here is a summary of the most common anti-patterns.
Q1. "Switching to a cheaper model degraded accuracy"
A common scenario is switching to a lightweight model without preparing an evaluation dataset and relying on intuition alone. The mitigation is the regression budget approach described in the previous section—define acceptable error tolerances by task before migrating.
Q2. "I enabled prompt caching but costs didn't go down"
The three most common causes are:
- The cached prefix varies from conversation to conversation
- Timestamps or dynamic variables are mixed into the end of the system prompt
- Prompts are too short to meet the minimum token count (1,024 tokens for Anthropic)
The basic approach is to consolidate dynamic elements at the end of the prompt and keep the static portion at the beginning fixed.
Q3. "Semantic caching returned an incorrect response"
Setting the similarity threshold too low causes false hits, where an old response is returned for a question with a different meaning. It is recommended to start with a threshold around 0.95 and adjust based on the vocabulary characteristics of the domain.
Q4. "After introducing cost-reduction measures, the numbers don't match the monthly report"
This is a case where FinOps tag design is insufficient, causing costs across different models and features to become mixed together. If a tagging taxonomy is not established at project launch, separating costs after the fact tends to be difficult.
Common lesson: Optimization without measurement easily becomes counter-optimization. It is important to make changes one at a time and build the habit of always verifying impact with A/B testing.
Conclusion
LLM cost optimization is not a set-it-and-forget-it measure. What matters is a cycle of continuous improvement that combines the four pillars of token reduction, model selection, prompt caching, and RAG design.
Reviewing the approaches covered in this article, the following order of priority emerges:
- Visibility first: Optimization cannot begin without a cost measurement foundation. The starting point is understanding the baseline of token consumption using AI observability tools
- Token reduction next: Structuring system prompts and compressing context requires no additional infrastructure and delivers quick results
- Design a model tier hierarchy: Rather than routing all requests to a high-performance model, route to SLMs or local LLMs based on task complexity
- Eliminate duplication with caching: Combining prompt caching and semantic caching can dramatically compress the cost of repetitive processing
Accuracy and cost are not a trade-off—they can coexist with the right design. Establishing evaluation datasets and a regression budget is essential for building a system that quantitatively monitors whether cost reduction is causing quality degradation.
LLM FinOps is a field that will continue to evolve, and it requires a mindset of revisiting routing strategies in response to specification changes from providers and the emergence of new models. Use the framework in this article as a foundation to map out an optimization roadmap suited to your own use cases.
Author & Supervisor
Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).


