LLM Cost Optimization Guide — Token Reduction, Model Selection, and Cache Implementation

Updated:April 22, 2026Published:April 22, 2026

Lead

LLM cost optimization is the ongoing effort to reduce API expenses and inference costs across three axes—token consumption, model selection, and cache utilization—while maintaining the accuracy and quality of generative AI systems.

Cases have been reported where monthly costs balloon to several times the initial estimate the moment a system goes into production. This is due to the accumulation of token waste that was invisible during the PoC stage, over-specified model selection, and duplicate requests caused by unused caching.

This article is aimed at engineers, architects, and LLM FinOps practitioners operating LLMs in production, and systematically covers four pillars: token reduction, model selection, prompt caching, and RAG design. Readers can learn step-by-step implementation patterns for cutting costs in half without sacrificing accuracy.

As production adoption of generative AI accelerates, the operational costs of LLMs (large language models) tend to grow at an unexpected pace. Cases have been reported where token consumption—difficult to observe during the proof-of-concept (PoC) stage—drives up monthly costs the moment it is exposed to real user traffic. Deferring cost management causes AI ROI (return on investment) projections to break down and can affect business decisions. This section organizes the structure behind cost growth and the approach to advancing optimization without sacrificing accuracy.

The Structure of Escalating Monthly Costs in Enterprise Operations

Cases have been reported where monthly costs balloon to several times the initial estimate immediately after an LLM (large language model) goes into production. Continuing operations without understanding this structure allows costs that were invisible during the PoC (proof-of-concept) stage to keep accumulating.

The main drivers of cost growth can be classified into the following three layers.

① Token consumption bloat

Redundant explanations and unnecessary examples accumulate in the system prompt, consuming hundreds to thousands of tokens per request
RAG (Retrieval-Augmented Generation) search results are stuffed directly into the context window, sending even low-relevance documents to the model
In multi-step reasoning and agent orchestration, the same information is repeatedly transmitted across multiple turns

② Inefficient model selection

High-performance dense models are applied uniformly even to lightweight tasks such as classification and summarization
Reasoning models significantly increase output tokens due to CoT (chain-of-thought), making them cost-ineffective for simple tasks

③ Unused caching

Even when identical or similar prompts are sent repeatedly, prompt caching is not configured, resulting in full charges on every request

When these factors compound, monthly costs tend to grow faster than the rate of increase in request volume. The first step toward optimization is identifying which layer is generating costs in your own environment.

Defining the Cost Optimization vs. Accuracy Trade-off

Before pursuing cost reduction, it is essential to clearly define the acceptable threshold for accuracy degradation. Skipping this definition often leads to optimization initiatives being derailed by pushback from the field.

Three axes for structuring the trade-off

Accuracy requirements: Whether incorrect answers are acceptable, and if so, within what percentage
Latency requirements: The upper limit on response time that does not impact user experience
Cost budget: The maximum value that can be set as a monthly or per-request cost ceiling

Proceeding with optimization without first reaching agreement on these three axes causes evaluation criteria to shift from one initiative to the next.

Examples of acceptable thresholds by task

The acceptable trade-offs tend to vary significantly depending on the nature of the task.

For low-risk tasks such as internal FAQ search and routine classification, prioritizing cost reduction even at the expense of a few percentage points of accuracy is rational
For high-risk tasks such as contract review and medical information delivery, maintaining accuracy is the top priority, and the room for cost reduction is limited
For intermediate tasks such as summarization and translation, adjustments can be made incrementally based on evaluation metrics (BLEU scores or human evaluation)

The concept of a "regression budget"

Defining the acceptable margin of accuracy degradation as a numerical value is referred to as a "regression budget." For example, setting a threshold such as "accuracy may not drop by more than 2 percentage points" enables quantitative evaluation of the impact of model changes or prompt compression. Since this budget is also used in subsequent evaluation phases, it is important to reach agreement with stakeholders at this stage.

Cost and accuracy are not necessarily a zero-sum trade-off. With the right measurement infrastructure in place, initiatives that improve both can sometimes be found. The next section explains how to build that measurement infrastructure.

Prerequisites — Cost Visibility and Baseline Measurement

Before implementing cost reduction measures, the indispensable first step is to "accurately understand the current state." Without visibility into what needs to be reduced, it is impossible to prioritize initiatives or measure their effectiveness.

This section explains how to build a measurement infrastructure that visualizes token consumption and costs, and how to establish the baseline that serves as the starting point for optimization. The necessary components for implementation—from selecting observability tools to designing FinOps tags—are organized in sequence.

Building a Token-Based Cost Measurement Foundation

The first step in cost optimization is accurately understanding "what you are spending money on and how much." Since LLM (Large Language Model) billing occurs on a per-token (Token) basis, tracking only the number of requests does not reveal the true picture.

Minimum Metrics to Measure

Input token count: The total of the entire prompt (system prompt + user message + context)
Output token count: The length of the text generated by the model
Cache hit count: The number of times prompt caching was applied and the amount of tokens saved
Model identifier: When multiple models are used within the same application, aggregate by model

Each API response includes a usage object, and simply recording prompt_tokens and completion_tokens for every request provides the foundational data. Inserting a thin middleware layer early on that writes these values to a data store will make subsequent tuning significantly easier.

Understanding the Characteristics of BPE Tokenizers

BPE tokenizers (Byte-Pair Encoding Tokenizers) tend to convert multi-byte characters such as Japanese and Chinese into more tokens than English. Since cost varies by language even for the same amount of information, per-language aggregation is essential for multilingual products.

Implementation Priorities

Save the usage field from all API responses to logs (no DROP allowed)
Attach tags for endpoint, user ID, and feature name
Automatically aggregate daily and weekly cost calculations using token unit price × usage

Only once a measurement foundation is in place can you quantitatively determine which endpoints have a "poor cost-to-value ratio." The observability stack introduced in the next section functions on top of this foundation.

Observability Stack (Langfuse / LangSmith / Helicone / OTel GenAI semconv) and FinOps Tag Design

Once the cost measurement foundation is in place, the next step is to introduce an observability stack that visualizes the details of each request. The following are guidelines for selecting tools.

Langfuse: Primarily open-source with self-hosting capability. Records token counts, latency, and cost at the trace level, and excels at cross-team cost comparison.
LangSmith: Highly compatible with the LangChain ecosystem and can visualize intermediate steps of agents.
Helicone: Proxy-based, requiring minimal changes to existing code. Features a simple dashboard suited for small teams.
OTel GenAI semconv: OpenTelemetry's semantic conventions for generative AI. Standardizes vendor-neutral span attributes (e.g., gen_ai.usage.input_tokens) and integrates easily with existing observability platforms (Grafana, Datadog, etc.).

One aspect that tends to be overlooked after selecting a tool is FinOps tag design. Without tags, it becomes impossible to later isolate costs by "which team, which use case, and which model." It is recommended to attach at minimum the following 4 dimensions as tags.

Tag Key	Example
`team`	search, support, analytics
`use_case`	summarization, rag, code_review
`model`	gpt, claude, gemini
`env`	prod, staging

Tags are embedded as metadata at request time and filtered on the observability tool side. Adding tags retroactively breaks log continuity, so it is important to design them at the start of a project. Only when both visualization and tagging are in place does it become possible to measure the effectiveness of the token reduction measures described in the next step.

Step 1: Token Reduction — Prompt Design and Compression

The first step in reducing costs is to decrease the number of tokens (Tokens) sent to the LLM itself. Before changing models or caching strategies, there are many cases where simply revisiting prompt design can significantly compress both input and output costs.

In this step, we cover two approaches: structuring the system prompt (System Prompt) and compressing long-form context. Both have low implementation costs and offer the advantage of allowing you to measure their effects on the same day they are applied.

Structuring Redundant System Prompts

The system prompt (System Prompt) is billed as tokens (Tokens) on every request to the LLM (Large Language Model). Leaving a lengthy prompt unattended tends to have a non-negligible impact on monthly costs.

To first understand the current state, measure the token count of your system prompt. Using a BPE tokenizer (Byte-Pair Encoding Tokenizer), Japanese text often consumes roughly 1–2 tokens per character. Cases have been reported where a 500-character prompt exceeds 700 tokens in practice, so optimization without measurement is inadvisable.

4 Steps for Structuring

Remove duplication: Consolidate instructions with overlapping meaning—such as "Please answer politely" and "Please be attentive to the user"—into a single line.
Convert negatives to positives: Rewriting expressions like "Please do not ~" into positive form tends to reduce token count.
Use Markdown bullet points: Bullet points are often more token-efficient than long paragraphs.
Simplify role definitions: Lengthy role definitions such as "You are an expert in ○○ with a background in ~~…" should be shortened to retain only the essential points.

Before / After Benchmarks

Cases have been reported where system prompts that previously exceeded 500 tokens were reduced to the 200–300 token range through the above refinements. As the proportion of the overall context window (Context Window) occupied by the system prompt decreases, there is also the secondary benefit of being able to allocate more space to user turns.

Note that when utilizing prompt caching (detailed in the next step), it is important to fix the leading portion of the system prompt in order to maximize the cache hit rate. Being mindful of cache design at the same time as structuring allows the optimization effects to compound.

Context Compression and Criteria for Applying LLMLingua / LongLLMLingua

As the context window (Context Window) grows longer, input token counts tend to increase super-linearly, causing costs to spike sharply. For use cases where long inputs are unavoidable—such as document summarization, long-form QA, and multi-turn conversations—context compression is an effective measure.

LLMLingua and LongLLMLingua are representative OSS libraries. The following serves as a guideline for choosing between them.

LLMLingua: Targets medium-length prompts of a few thousand tokens and removes tokens with low importance scores. A certain degree of compression effectiveness has been reported, making it suited for short-to-medium summarization and classification tasks.
LongLLMLingua: Targets long texts on the scale of tens of thousands of tokens and selectively retains important chunks based on their relevance to the query. Particularly effective when retrieval results are numerous in RAG (Retrieval-Augmented Generation) scenarios.

However, applying these tools requires clear judgment criteria. Consider adoption when the following conditions are met:

A significant portion of input tokens consists of boilerplate phrases and redundant background explanations.
An acceptable threshold for degradation in response accuracy has been defined in advance (linked to the regression budget described later).
The latency of the compression process itself is within an acceptable range.

On the other hand, there are also situations where application should be avoided. In domains such as legal documents and medical records, where missing context can be fatal, the risk of misinformation increases. Additionally, due to the characteristics of BPE tokenizers (Byte-Pair Encoding Tokenizers), Japanese may yield lower compression efficiency than English, making per-language validation essential.

It is recommended to always measure the three metrics of token count, accuracy, and latency before and after implementation to quantitatively confirm the cost reduction effect.

Step 2: Model Selection — Tiered Design by Task

Alongside token reduction, model selection is another lever that directly drives down costs. Applying the highest-performing LLM (Large Language Model) to every request causes costs to balloon without limit. On the other hand, tiering models according to task difficulty can yield significant cost savings while maintaining quality.

This section covers model selection along two axes: routing strategies and the use of local LLMs.

Routing to lightweight models: Automatically dispatching requests to models based on task complexity
TCO decisions for hybrid configurations: Identifying the break-even point between cloud and on-premises deployments

Routing Strategies to Lightweight Models (Selection Criteria for RouteLLM / Martian / OpenRouter)

Continuously sending all requests to high-performance models leads to unbounded costs. A "routing strategy" — distributing requests across models based on task complexity — is the core of LLM cost optimization.

The fundamental idea behind routing

Simple classification and summarization → Handled by SLMs or lightweight models
Complex reasoning and multi-step tasks → Escalated to high-performance models
The router dynamically selects a model on a per-request basis

Criteria for selecting key tools

RouteLLM is an OSS framework that calculates a difficulty score in real time and forwards requests to a higher-tier model only when the score exceeds a threshold. Its strength lies in the ability to numerically tune the trade-off between cost reduction and quality degradation. Note that calibration to your own traffic patterns is required, so budget for initial setup costs.

Martian is a router provided as a cloud API that automatically classifies task characteristics to select a model. While it requires minimal implementation effort, vendor lock-in and additional API costs must be taken into account.

OpenRouter is a proxy that aggregates models from multiple providers under a single endpoint. It makes price comparison and automatic fallback straightforward, making it a useful starting point for multi-model experimentation.

Implementation considerations

Never forget that the router itself introduces latency and cost. Using a high-performance model for routing decisions defeats the purpose. Additionally, low routing accuracy increases the risk of quality degradation, so regular accuracy validation using an evaluation dataset is essential. Combining routing with local LLMs — covered in the next section — can yield further cost reductions.

TCO Decision Criteria for Local LLMs and Hybrid Configurations

Local LLMs (self-hosting open-weight models) eliminate API billing, but hidden costs accumulate in the form of GPU procurement, infrastructure operations, and model updates. Without an accurate TCO (Total Cost of Ownership) estimate before deployment, it is not uncommon to end up paying more than you would with a cloud API.

Key cost items to examine in a TCO comparison

Cloud API side: Input/output token unit price × monthly request volume; retry costs when latency SLAs are exceeded
Local LLM side: GPU/instance costs (on-premises or cloud GPU); engineering hours for building and maintaining the model serving infrastructure; costs for tuning via quantization or LoRA
Common to both: Security audits, monitoring and logging infrastructure, personnel costs for the engineers responsible

Scenarios where a hybrid configuration is effective

A hybrid configuration combining local LLMs with cloud APIs tends to be cost-effective when the following conditions overlap:

A large proportion of requests contain internal documents or personal information that cannot be sent to the cloud
The same prompt patterns repeat frequently and throughput is consistently high (enabling sustained GPU utilization)
Lightweight tasks are processed locally with an SLM (Small Language Model), while only complex reasoning is routed to the cloud API

Decision guidelines

When monthly token volume exceeds a certain threshold and there is a reasonable expectation of keeping GPUs running at high utilization, a local configuration is more likely to offer a cost advantage. Conversely, when requests are sporadic, the cost of idle GPUs becomes a burden. A practical approach is to first measure actual cloud API costs at PoC scale, then compare that figure against the TCO of a local configuration before making a migration decision.

Step 3: Prompt Caching and Result Reuse

Once the foundation has been established through token reduction and model selection, the next area to address is cost compression through caching. Running full inference on every request for the same input is a direct waste of computational resources.

There are broadly two approaches to prompt caching and result reuse: native caching features provided by the model provider, and semantic caching implemented at the application layer. Because the two differ in both mechanism and applicable scenarios, it is important to understand the criteria for choosing between them.

The H3 sections that follow provide a detailed explanation of the differences in caching specifications across Anthropic, OpenAI, and Google, as well as how to handle the risk of false cache hits.

Spec Differences Across Anthropic / OpenAI / Google and an Application Decision Flowchart

Prompt caching specifications differ by provider, so it is important to understand those differences before implementation. Flawed design has been reported to result in caching that does not function as intended, reducing the cost-saving effect to nearly zero.

Specification comparison across the three major providers

Item	Anthropic (Claude)	OpenAI (GPT)	Google (Gemini)
Scope	System prompts, long-form context	System prompts	System prompts, long-form context
Minimum cache unit	1,024 tokens or more	1,024 tokens or more	Refer to official documentation
Cache retention period	Approximately 5 minutes (TTL-based)	Per session	Minimum 1 hour and above (configurable)
Pricing model	Surcharge on cache write; discount on cache read	Discounted read cost	Separate storage cost applies

※ The above values are reference figures at the time of writing. Always check the latest pricing pages.

Decision flow for applicability

Is the system prompt fewer than 1,024 tokens? → If yes, no caching benefit can be obtained. Prioritize token reduction first.
Is the request frequency sufficient? → Not suitable for low-frequency batch processing where reuse within the TTL is unlikely.
Is the beginning of the prompt fixed? → Caching is predicated on prefix matching. A design that places dynamic variables at the end is essential.
When using Google Gemini → The cachedContent API must be called explicitly, and note that storage costs are added.

Practical points for selection

Anthropic is well-suited to high-frequency use cases with long-form context
OpenAI operates on a per-session basis, so it tends to be effective for chatbot-type use cases
Google allows longer cache retention periods, but requires careful calculation of the balance against storage costs

Keep in mind that the design philosophy differs from the semantic caching covered in the next section.

Semantic Cache Implementation Patterns and False Positive Risks

Semantic caching works by converting input prompts into embeddings, searching a vector database for similar queries, and reusing past responses. Because it eliminates the API call itself, it can deliver greater cost savings than prompt caching.

Basic Implementation Flow

Vectorize user input using an embedding model
Calculate cosine similarity in a vector database (Pinecone, Qdrant, Redis VSS, etc.)
If similarity exceeds a threshold (e.g., 0.92 or above), return the cached response
If below the threshold, call the LLM and add the result to the cache

The threshold setting determines operational quality. Set it too low and false positives increase, creating the risk of returning incorrect answers to semantically different questions. Set it too high and the cache hit rate drops, diminishing the cost-saving effect.

Patterns Prone to False Positives

Queries that share a similar structure but require different answers, such as "What's the weather in Tokyo?" vs. "What's the weather in Osaka?"
Cases where numbers or proper nouns differ but context is similar (e.g., "2023 sales" vs. "2024 sales")
Questions that require personalization per user

Implementation Strategies to Mitigate Risk

Metadata filtering: Add user ID, tenant ID, and date as filter conditions to narrow the scope
TTL settings for entries: Set short expiration times for time-sensitive information to prevent stale responses from being reused
Cache layer log collection: Send input/output pairs at cache hit time to an AI observability platform and conduct regular quality reviews

Semantic caching is powerful, but it carries an inherent risk of accuracy degradation. It is essential to design operations that monitor both hit rate and quality in combination with the evaluation datasets covered in the next section.

Operations — Accuracy Evaluation and Guardrails

The moment a cost-reduction measure goes into production, the first question asked is: "Has accuracy degraded?" Token reduction, model switching, and cache introduction all carry the risk of quality degradation. This section organizes the design of evaluation datasets for quantitatively comparing quality before and after changes, along with the concept of guardrail operations for safely advancing cost reduction.

Evaluation Dataset and Regression Budget Design

The more cost optimization measures are layered on, the more quietly the risk of accuracy degradation accumulates. It is not uncommon for user complaints to surge the month after celebrating a successful reduction. Managing that risk quantitatively is the purpose of evaluation datasets and regression budget design.

Principles for Structuring an Evaluation Dataset

Golden set: Representative input/output pairs extracted from production logs. Aim to collect a minimum of 100–300 examples
Edge case set: Intentionally include cases where incorrect answers were reported and boundary conditions
Task-based splits: Separate evaluation metrics by task type—summarization, classification, generation, etc. (e.g., F1 for classification, BERTScore for summarization)

Do not treat the dataset as something built once and forgotten. It is important to operate with the practice of adding new error patterns to the set whenever they are observed in production.

How to Design a Regression Budget

A regression budget is a pre-defined allowance for how much accuracy degradation is acceptable for a given optimization measure.

Example: Keep accuracy on primary tasks within ±2%, do not allow the hallucination rate to worsen relative to the current baseline, etc.
It is best managed as a concept where each measure consumes part of the budget, with an automated rollback trigger when the allowance is exceeded

Integration into CI/CD

Evaluation should be embedded in the deployment pipeline and run automatically with each measure. Integrating with AI observability tools (such as Langfuse) enables continuous monitoring by linking production traces to evaluation scores. Having cost-reduction impact and accuracy changes visible on the same dashboard is the ideal state for LLM FinOps.

FAQ (Common Failures and Anti-Patterns)

In LLM cost optimization practice, teams repeatedly fall into the same pitfalls. Here is a summary of the most common anti-patterns.

Q1. "Switching to a cheaper model degraded accuracy"

A common scenario is switching to a lightweight model without preparing an evaluation dataset and relying on intuition alone. The mitigation is the regression budget approach described in the previous section—define acceptable error tolerances by task before migrating.

Q2. "I enabled prompt caching but costs didn't go down"

The three most common causes are:

The cached prefix varies from conversation to conversation
Timestamps or dynamic variables are mixed into the end of the system prompt
Prompts are too short to meet the minimum token count (1,024 tokens for Anthropic)

The basic approach is to consolidate dynamic elements at the end of the prompt and keep the static portion at the beginning fixed.

Q3. "Semantic caching returned an incorrect response"

Setting the similarity threshold too low causes false hits, where an old response is returned for a question with a different meaning. It is recommended to start with a threshold around 0.95 and adjust based on the vocabulary characteristics of the domain.

Q4. "After introducing cost-reduction measures, the numbers don't match the monthly report"

This is a case where FinOps tag design is insufficient, causing costs across different models and features to become mixed together. If a tagging taxonomy is not established at project launch, separating costs after the fact tends to be difficult.

Common lesson: Optimization without measurement easily becomes counter-optimization. It is important to make changes one at a time and build the habit of always verifying impact with A/B testing.

Conclusion

LLM cost optimization is not a set-it-and-forget-it measure. What matters is a cycle of continuous improvement that combines the four pillars of token reduction, model selection, prompt caching, and RAG design.

Reviewing the approaches covered in this article, the following order of priority emerges:

Visibility first: Optimization cannot begin without a cost measurement foundation. The starting point is understanding the baseline of token consumption using AI observability tools
Token reduction next: Structuring system prompts and compressing context requires no additional infrastructure and delivers quick results
Design a model tier hierarchy: Rather than routing all requests to a high-performance model, route to SLMs or local LLMs based on task complexity
Eliminate duplication with caching: Combining prompt caching and semantic caching can dramatically compress the cost of repetitive processing

Accuracy and cost are not a trade-off—they can coexist with the right design. Establishing evaluation datasets and a regression budget is essential for building a system that quantitatively monitors whether cost reduction is causing quality degradation.

LLM FinOps is a field that will continue to evolve, and it requires a mindset of revisiting routing strategies in response to specification changes from providers and the emergence of new models. Use the framework in this article as a foundation to map out an optimization roadmap suited to your own use cases.

Author & Supervisor

Yusuke Ishihara

Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).