
LLM cost optimization is the ongoing effort to reduce API expenses and inference costs across three axes—token consumption, model selection, and cache utilization—while maintaining the accuracy and quality of generative AI systems.
Cases have been reported where monthly costs balloon to several times the initial estimate the moment a system goes into production. This is due to the accumulation of token waste that was invisible during the PoC stage, over-specified model selection, and duplicate requests caused by unused caching.
This article is aimed at engineers, architects, and LLM FinOps practitioners operating LLMs in production, and systematically covers four pillars: token reduction, model selection, prompt caching, and RAG design. Readers can learn step-by-step implementation patterns for cutting costs in half without sacrificing accuracy.
As production adoption of generative AI accelerates, the operational costs of LLMs (large language models) tend to grow at an unexpected pace. Cases have been reported where token consumption—difficult to observe during the proof-of-concept (PoC) stage—drives up monthly costs the moment it is exposed to real user traffic. Deferring cost management causes AI ROI (return on investment) projections to break down and can affect business decisions. This section organizes the structure behind cost growth and the approach to advancing optimization without sacrificing accuracy.
Cases have been reported where monthly costs balloon to several times the initial estimate immediately after an LLM (large language model) goes into production. Continuing operations without understanding this structure allows costs that were invisible during the PoC (proof-of-concept) stage to keep accumulating.
The main drivers of cost growth can be classified into the following three layers.
① Token consumption bloat
② Inefficient model selection
③ Unused caching
When these factors compound, monthly costs tend to grow faster than the rate of increase in request volume. The first step toward optimization is identifying which layer is generating costs in your own environment.
Before pursuing cost reduction, it is essential to clearly define the acceptable threshold for accuracy degradation. Skipping this definition often leads to optimization initiatives being derailed by pushback from the field.
Three axes for structuring the trade-off
Proceeding with optimization without first reaching agreement on these three axes causes evaluation criteria to shift from one initiative to the next.
Examples of acceptable thresholds by task
The acceptable trade-offs tend to vary significantly depending on the nature of the task.
The concept of a "regression budget"
Defining the acceptable margin of accuracy degradation as a numerical value is referred to as a "regression budget." For example, setting a threshold such as "accuracy may not drop by more than 2 percentage points" enables quantitative evaluation of the impact of model changes or prompt compression. Since this budget is also used in subsequent evaluation phases, it is important to reach agreement with stakeholders at this stage.
Cost and accuracy are not necessarily a zero-sum trade-off. With the right measurement infrastructure in place, initiatives that improve both can sometimes be found. The next section explains how to build that measurement infrastructure.
Before implementing cost reduction measures, the indispensable first step is to "accurately understand the current state." Without visibility into what needs to be reduced, it is impossible to prioritize initiatives or measure their effectiveness.
This section explains how to build a measurement infrastructure that visualizes token consumption and costs, and how to establish the baseline that serves as the starting point for optimization. The necessary components for implementation—from selecting observability tools to designing FinOps tags—are organized in sequence.
The first step in cost optimization is accurately understanding "what you are spending money on and how much." Since LLM (Large Language Model) billing occurs on a per-token (Token) basis, tracking only the number of requests does not reveal the true picture.
Minimum Metrics to Measure
Each API response includes a usage object, and simply recording prompt_tokens and completion_tokens for every request provides the foundational data. Inserting a thin middleware layer early on that writes these values to a data store will make subsequent tuning significantly easier.
Understanding the Characteristics of BPE Tokenizers
BPE tokenizers (Byte-Pair Encoding Tokenizers) tend to convert multi-byte characters such as Japanese and Chinese into more tokens than English. Since cost varies by language even for the same amount of information, per-language aggregation is essential for multilingual products.
Implementation Priorities
usage field from all API responses to logs (no DROP allowed)Only once a measurement foundation is in place can you quantitatively determine which endpoints have a "poor cost-to-value ratio." The observability stack introduced in the next section functions on top of this foundation.
Once the cost measurement foundation is in place, the next step is to introduce an observability stack that visualizes the details of each request. The following are guidelines for selecting tools.
gen_ai.usage.input_tokens) and integrates easily with existing observability platforms (Grafana, Datadog, etc.).One aspect that tends to be overlooked after selecting a tool is FinOps tag design. Without tags, it becomes impossible to later isolate costs by "which team, which use case, and which model." It is recommended to attach at minimum the following 4 dimensions as tags.
| Tag Key | Example |
|---|---|
team | search, support, analytics |
use_case | summarization, rag, code_review |
model | gpt, claude, gemini |
env | prod, staging |
Tags are embedded as metadata at request time and filtered on the observability tool side. Adding tags retroactively breaks log continuity, so it is important to design them at the start of a project. Only when both visualization and tagging are in place does it become possible to measure the effectiveness of the token reduction measures described in the next step.
The first step in reducing costs is to decrease the number of tokens (Tokens) sent to the LLM itself. Before changing models or caching strategies, there are many cases where simply revisiting prompt design can significantly compress both input and output costs.
In this step, we cover two approaches: structuring the system prompt (System Prompt) and compressing long-form context. Both have low implementation costs and offer the advantage of allowing you to measure their effects on the same day they are applied.
The system prompt (System Prompt) is billed as tokens (Tokens) on every request to the LLM (Large Language Model). Leaving a lengthy prompt unattended tends to have a non-negligible impact on monthly costs.
To first understand the current state, measure the token count of your system prompt. Using a BPE tokenizer (Byte-Pair Encoding Tokenizer), Japanese text often consumes roughly 1–2 tokens per character. Cases have been reported where a 500-character prompt exceeds 700 tokens in practice, so optimization without measurement is inadvisable.
4 Steps for Structuring
Before / After Benchmarks
Cases have been reported where system prompts that previously exceeded 500 tokens were reduced to the 200–300 token range through the above refinements. As the proportion of the overall context window (Context Window) occupied by the system prompt decreases, there is also the secondary benefit of being able to allocate more space to user turns.
Note that when utilizing prompt caching (detailed in the next step), it is important to fix the leading portion of the system prompt in order to maximize the cache hit rate. Being mindful of cache design at the same time as structuring allows the optimization effects to compound.
As the context window (Context Window) grows longer, input token counts tend to increase super-linearly, causing costs to spike sharply. For use cases where long inputs are unavoidable—such as document summarization, long-form QA, and multi-turn conversations—context compression is an effective measure.
LLMLingua and LongLLMLingua are representative OSS libraries. The following serves as a guideline for choosing between them.
However, applying these tools requires clear judgment criteria. Consider adoption when the following conditions are met:
On the other hand, there are also situations where application should be avoided. In domains such as legal documents and medical records, where missing context can be fatal, the risk of misinformation increases. Additionally, due to the characteristics of BPE tokenizers (Byte-Pair Encoding Tokenizers), Japanese may yield lower compression efficiency than English, making per-language validation essential.
It is recommended to always measure the three metrics of token count, accuracy, and latency before and after implementation to quantitatively confirm the cost reduction effect.
Alongside token reduction, model selection is another lever that directly drives down costs. Applying the highest-performing LLM (Large Language Model) to every request causes costs to balloon without limit. On the other hand, tiering models according to task difficulty can yield significant cost savings while maintaining quality.
This section covers model selection along two axes: routing strategies and the use of local LLMs.
Continuously sending all requests to high-performance models leads to unbounded costs. A "routing strategy" — distributing requests across models based on task complexity — is the core of LLM cost optimization.
The fundamental idea behind routing
Criteria for selecting key tools
RouteLLM is an OSS framework that calculates a difficulty score in real time and forwards requests to a higher-tier model only when the score exceeds a threshold. Its strength lies in the ability to numerically tune the trade-off between cost reduction and quality degradation. Note that calibration to your own traffic patterns is required, so budget for initial setup costs.
Martian is a router provided as a cloud API that automatically classifies task characteristics to select a model. While it requires minimal implementation effort, vendor lock-in and additional API costs must be taken into account.
OpenRouter is a proxy that aggregates models from multiple providers under a single endpoint. It makes price comparison and automatic fallback straightforward, making it a useful starting point for multi-model experimentation.
Implementation considerations
Never forget that the router itself introduces latency and cost. Using a high-performance model for routing decisions defeats the purpose. Additionally, low routing accuracy increases the risk of quality degradation, so regular accuracy validation using an evaluation dataset is essential. Combining routing with local LLMs — covered in the next section — can yield further cost reductions.
Local LLMs (self-hosting open-weight models) eliminate API billing, but hidden costs accumulate in the form of GPU procurement, infrastructure operations, and model updates. Without an accurate TCO (Total Cost of Ownership) estimate before deployment, it is not uncommon to end up paying more than you would with a cloud API.
Key cost items to examine in a TCO comparison
Scenarios where a hybrid configuration is effective
A hybrid configuration combining local LLMs with cloud APIs tends to be cost-effective when the following conditions overlap:
Decision guidelines
When monthly token volume exceeds a certain threshold and there is a reasonable expectation of keeping GPUs running at high utilization, a local configuration is more likely to offer a cost advantage. Conversely, when requests are sporadic, the cost of idle GPUs becomes a burden. A practical approach is to first measure actual cloud API costs at PoC scale, then compare that figure against the TCO of a local configuration before making a migration decision.
Once the foundation has been established through token reduction and model selection, the next area to address is cost compression through caching. Running full inference on every request for the same input is a direct waste of computational resources.
There are broadly two approaches to prompt caching and result reuse: native caching features provided by the model provider, and semantic caching implemented at the application layer. Because the two differ in both mechanism and applicable scenarios, it is important to understand the criteria for choosing between them.
The H3 sections that follow provide a detailed explanation of the differences in caching specifications across Anthropic, OpenAI, and Google, as well as how to handle the risk of false cache hits.
Prompt caching specifications differ by provider, so it is important to understand those differences before implementation. Flawed design has been reported to result in caching that does not function as intended, reducing the cost-saving effect to nearly zero.
Specification comparison across the three major providers
| Item | Anthropic (Claude) | OpenAI (GPT) | Google (Gemini) |
|---|---|---|---|
| Scope | System prompts, long-form context | System prompts | System prompts, long-form context |
| Minimum cache unit | 1,024 tokens or more | 1,024 tokens or more | Refer to official documentation |
| Cache retention period | Approximately 5 minutes (TTL-based) | Per session | Minimum 1 hour and above (configurable) |
| Pricing model | Surcharge on cache write; discount on cache read | Discounted read cost | Separate storage cost applies |
※ The above values are reference figures at the time of writing. Always check the latest pricing pages.
Decision flow for applicability
cachedContent API must be called explicitly, and note that storage costs are added.Practical points for selection
Keep in mind that the design philosophy differs from the semantic caching covered in the next section.
Semantic caching works by converting input prompts into embeddings, searching a vector database for similar queries, and reusing past responses. Because it eliminates the API call itself, it can deliver greater cost savings than prompt caching.
Basic Implementation Flow
The threshold setting determines operational quality. Set it too low and false positives increase, creating the risk of returning incorrect answers to semantically different questions. Set it too high and the cache hit rate drops, diminishing the cost-saving effect.
Patterns Prone to False Positives
Implementation Strategies to Mitigate Risk
Semantic caching is powerful, but it carries an inherent risk of accuracy degradation. It is essential to design operations that monitor both hit rate and quality in combination with the evaluation datasets covered in the next section.
The moment a cost-reduction measure goes into production, the first question asked is: "Has accuracy degraded?" Token reduction, model switching, and cache introduction all carry the risk of quality degradation. This section organizes the design of evaluation datasets for quantitatively comparing quality before and after changes, along with the concept of guardrail operations for safely advancing cost reduction.
The more cost optimization measures are layered on, the more quietly the risk of accuracy degradation accumulates. It is not uncommon for user complaints to surge the month after celebrating a successful reduction. Managing that risk quantitatively is the purpose of evaluation datasets and regression budget design.
Principles for Structuring an Evaluation Dataset
Do not treat the dataset as something built once and forgotten. It is important to operate with the practice of adding new error patterns to the set whenever they are observed in production.
How to Design a Regression Budget
A regression budget is a pre-defined allowance for how much accuracy degradation is acceptable for a given optimization measure.
Integration into CI/CD
Evaluation should be embedded in the deployment pipeline and run automatically with each measure. Integrating with AI observability tools (such as Langfuse) enables continuous monitoring by linking production traces to evaluation scores. Having cost-reduction impact and accuracy changes visible on the same dashboard is the ideal state for LLM FinOps.
In LLM cost optimization practice, teams repeatedly fall into the same pitfalls. Here is a summary of the most common anti-patterns.
Q1. "Switching to a cheaper model degraded accuracy"
A common scenario is switching to a lightweight model without preparing an evaluation dataset and relying on intuition alone. The mitigation is the regression budget approach described in the previous section—define acceptable error tolerances by task before migrating.
Q2. "I enabled prompt caching but costs didn't go down"
The three most common causes are:
The basic approach is to consolidate dynamic elements at the end of the prompt and keep the static portion at the beginning fixed.
Q3. "Semantic caching returned an incorrect response"
Setting the similarity threshold too low causes false hits, where an old response is returned for a question with a different meaning. It is recommended to start with a threshold around 0.95 and adjust based on the vocabulary characteristics of the domain.
Q4. "After introducing cost-reduction measures, the numbers don't match the monthly report"
This is a case where FinOps tag design is insufficient, causing costs across different models and features to become mixed together. If a tagging taxonomy is not established at project launch, separating costs after the fact tends to be difficult.
Common lesson: Optimization without measurement easily becomes counter-optimization. It is important to make changes one at a time and build the habit of always verifying impact with A/B testing.
LLM cost optimization is not a set-it-and-forget-it measure. What matters is a cycle of continuous improvement that combines the four pillars of token reduction, model selection, prompt caching, and RAG design.
Reviewing the approaches covered in this article, the following order of priority emerges:
Accuracy and cost are not a trade-off—they can coexist with the right design. Establishing evaluation datasets and a regression budget is essential for building a system that quantitatively monitors whether cost reduction is causing quality degradation.
LLM FinOps is a field that will continue to evolve, and it requires a mindset of revisiting routing strategies in response to specification changes from providers and the emergence of new models. Use the framework in this article as a foundation to map out an optimization roadmap suited to your own use cases.

Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).