
Context engineering is a design methodology for optimizing the tokens input to an LLM at inference time — including system prompts, instructions, knowledge, history, and tool definitions. Positioned as a superset of "prompt engineering," which focuses solely on refining prompt text, the concept has been advocated by Anthropic since mid-2025 and is recognized as a core technology in OpenAI's Agents SDK and Vercel's production agents.
This article is intended for teams developing LLM applications in B2B contexts, and covers the differences from prompt engineering, the components of the context window, design patterns, and operational pitfalls. After reading, you should come away with a perspective that reframes your own RAG systems and agents as "a problem of designing the entire context."
Context engineering is the practice of designing the entire information environment that an LLM "sees" at inference time. While prompt engineering is confined to refining instruction text, context engineering treats memory, retrieval results, tool definitions, and conversation history as a single information architecture.
It has become widely recognized in production environments that the quality of LLM applications depends less on how prompts are written and more on "what is included in the context window, in what order, and in what quantity." Anthropic's engineering blog defines context as "the set of tokens sampled at inference time" and frames optimizing its utility as an engineering problem of managing a finite resource (source: Anthropic Engineering – Effective context engineering for AI agents).
Prompt engineering focuses on how to craft the instructions (prompts) given to an LLM in order to elicit better task performance. Context engineering, by contrast, extends the scope of design to the elements surrounding those instructions — namely retrieved documents (RAG), past conversation history, definitions of available tools, long-term memory, and few-shot examples.
The relationship between the two is akin to "part and whole." Prompt engineering is one component of context engineering, and on its own it cannot sustain the quality of a production agent. This is because agents operating across multiple turns accumulate tool outputs and intermediate reasoning with each turn, making it impossible to manage the quality of the growing context simply by refining the prompt itself.
As a concrete example, Vercel's engineering blog reported that reducing the number of tools in their agent from 15 to 2 resulted in accuracy improving from 80% to 100% on a 5-query benchmark, a 37% reduction in token count, and a 3.5x improvement in response speed. This was not the result of rewriting prompt text — it was the result of narrowing down the information loaded into the context.
The rapid rise in attention to context engineering from mid-2025 onward is rooted in the shift of LLM applications from "one-off question answering" to "multi-turn autonomous agents" as the primary use case.
Agents accumulate retrieval results, tool execution outputs, and intermediate reasoning into the context one after another within their reasoning loops. Anthropic has framed this as "a problem of managing a finite resource" and classifies failure modes into three categories: first, hallucination due to insufficient information; second, context overflow due to excessive information; and third, degraded relevance due to poor placement of information.
This framework makes explicit the limits of the conventional mindset that "clever prompting can solve anything." Without designing the shape of the context itself, long-running agents will inevitably see their performance degrade sooner or later.
The elements loaded into the context window can be broadly classified into six categories. Because each one influences the LLM's judgment differently, designing with an awareness of each element's role is essential.
Decomposing the components rather than lumping everything together as simply "the prompt" is the first step in context design. In our own work, when designing internal agents, we frequently begin with a workshop that breaks down the context into these six elements.
A system prompt is the foundational text that instructs an LLM on its role, constraints, and output format. It is typically kept fixed across all turns and forms the base of the context. System prompts that are too long—for example, exceeding 3,000 tokens—can crowd out the dynamic elements that accumulate on top of them, so caution is warranted.
Task instructions correspond to the specific request delivered by the user each turn. While the system prompt defines "how to behave," task instructions define "what to do this time."
Few-shot examples are input-output pairs inserted to demonstrate the desired output format or reasoning process. More examples are not necessarily better; the established practice is to carefully select 2–5 examples that cover a diverse range of patterns. Tasks where output variability is a concern tend to benefit most from well-designed few-shot examples.
External knowledge inserted via RAG refers to relevant documents retrieved through vector search or full-text search. Too many documents dilute relevance, while too few lead to insufficient information and increased hallucination. For accuracy improvements through hybrid search, see the related article "What Is Hybrid Search? The Mechanism and Implementation for Improving RAG Accuracy with Vector Search × Full-Text Search."
Tool definitions are the specifications (name, arguments, and description) of functions that an agent can call. As demonstrated by Vercel's case study, simply including unnecessary tools causes a notable drop in accuracy. The principle is to limit tools not to "whatever might be useful" but to "what will actually be used in the current task."
Conversation history and long-term memory are the elements that retain the user's utterances, model responses, and tool results from past turns in chronological order. In long-running tasks, this area expands the fastest, making the compression strategies discussed later essential.
Rather than treating context as a static "monolithic prompt," production deployments require a mindset of dynamically assembling the necessary elements for each task. This section organizes three design patterns that are repeatedly used in practice.
The most fundamental pattern is the dynamic slot method, in which slots (placeholders) are defined in a template and filled with the necessary elements for each turn before sending. Slots are divided into sections such as "system instructions," "task definition," "relevant documents," "tool definitions," "recent history," and "user utterance," with the content for each slot generated by separate logic.
The advantage of this approach is that it makes it easy to isolate which elements are contributing to quality. For example, A/B testing only the relevant documents slot allows the impact of improvements to the retrieval strategy to be measured independently. Conversely, if slot separation is neglected and a large prompt string is assembled directly, it becomes extremely difficult to retroactively identify which part caused a failure.
When we designed an internal inquiry agent at our company, we initially operated with a single prompt template, but the handling of history and the handling of search results became entangled, making it impossible to isolate problems. After introducing slots, we were able to discuss improvements to the retrieval layer and improvements to the history strategy in parallel.
When the number of documents retrieved via RAG or conversation history entries is large, rather than stuffing everything in, items are fed into the context in priority order based on relevance scores. It is common practice to combine vector similarity, BM25, and reranker models (e.g., Cohere Rerank) and retain only the top N results.
A common pitfall in designing priority control is mechanically cutting off results at a score threshold. If contextually inseparable pieces of information that straddle the threshold—such as a clause and its supplement within the same contract—are cut off, the coherence of the output breaks down. In practice, the selection logic often settles on a hybrid approach of "top N results by score + related sections within the same document."
For tasks where evaluating relevance itself is difficult, adding domain-specific correct pairs to the reranker's training data tends to have a significant effect. This is also a key mitigation strategy highlighted in the related article "10 Failure Patterns in RAG Construction and How to Avoid Them."
In long-running agents, conversation history accumulates in the tens of thousands of tokens, making context compression necessary under certain conditions. The "compaction" approach advocated by Anthropic is a technique in which the model itself summarizes older history and replaces it with the resulting summary text.
Another important pattern is memory separation. Short-term conversation history and persistently retained user preferences and project information are stored in separate stores, and selectively injected into the context as needed. The session memory provided by OpenAI's Agents SDK is positioned as an attempt to standardize this separation. Both a trimming approach—retaining only the most recent N entries—and a summarization approach—replacing older history with summaries—are supported.
The timing of compression is determined by operational conditions, such as "when input tokens exceed a threshold" or "when a certain number of turns have elapsed within the same task." Delaying compression too long risks an overflow failure, while compressing too early causes subtle nuances in context to be lost.
The context challenge unique to agents lies in the fact that history grows exponentially as multiple turns accumulate. Here we organize design guidelines from two perspectives: long-running tasks and preventing context bloat.
In coding agents and autonomous research agents, a single task can involve dozens or more tool calls. Accumulating all tool results as-is causes the context to overflow almost immediately, making summarization of tool results a practical necessity.
The structured note-taking (scratchpad) pattern recommended by Anthropic has the agent write notes to an external store, so that on the next turn it can reconstruct the details simply by referencing those notes. This creates a structure that keeps raw tool output data out of the context while retaining the information needed for decision-making.
For long-task support, it is also effective to explicitly maintain the overall task goal, completion criteria, and remaining tasks as a "plan," injecting it at the beginning of each turn. This prevents the model from losing track of where it is and helps keep its approach from drifting midway through. Similar techniques are introduced in the related article "How to Deploy AI Agents in Production? Practical Steps from Pilot to Scale."
As a concrete measure to prevent bloat, start by allocating a token budget to each context element. For example, set allocations such as "system: 1,500 / task: 500 / RAG: 3,000 / history: 2,000 / tool definitions: 1,000," back-calculate from the total, and build in logic to compress or remove elements when a budget is exceeded.
On the monitoring side, log the input token count, output token count, cost, and latency for every turn, and visualize trends on a dashboard. As discussed in the related article "What Is AI Observability? A Guide to Monitoring LLMs in Production," structuring logs so they can be filtered by request ID, session ID, and model version will speed up future investigations.
Additionally, periodically auditing for information that is duplicated in both the system instructions and the conversation history often reveals a surprising number of surplus tokens. A common rule of thumb is that roughly 20% of a context consists of content that "remains only for historical reasons and is actually unnecessary."
Context engineering is not something you design once and consider finished. It must be operated as an ongoing process of continuously measuring the quality of reasoning outputs and iteratively refining the context structure.
Measuring context precision is designed across three broad layers. The first layer is the retrieval layer, which evaluates whether RAG is surfacing the correct documents at the top using Recall@K and MRR. The second layer is the utilization layer, which measures the proportion of retrieved documents actually used in the output (attribution). The third layer is the output layer, which verifies the accuracy of the final answer using LLM-as-a-Judge or human evaluation.
By measuring these three layers separately, it becomes possible to isolate whether a poor answer stems from insufficient retrieval, failure to use what was retrieved, or incorrect reasoning. Operating in production without this kind of isolation makes it easy to fall into a state where rewriting the prompt repeatedly yields no performance improvement.
The evaluation methodology is covered in detail in the related article "What Is LLM-as-a-Judge? A Method for Evaluating AI Output with AI and Implementing Hallucination Detection." When it comes to running improvement cycles in context engineering, building an evaluation infrastructure as an upfront investment is well worth the effort.
Changes to context composition should always be validated through A/B testing before being reflected in production. Typical targets for change include "number of documents to retrieve," "timing of history summarization," and "tool filtering logic." Because these tend to interfere with one another, it is a cardinal rule to change only one element at a time.
In a staged rollout, the first step is to check for regressions against an internal evaluation set (typically 100–500 golden queries), then limit exposure to 5–10% of traffic and observe production metrics. If no errors or cost increases are observed, the rollout is gradually expanded. Skipping this procedure frequently leads to situations where performance appeared to improve on the evaluation set but actually degraded in production.
In our own internal agent operations, we once applied a change to tool definition filtering directly to all users, only to discover afterward that answer quality had dropped in certain edge cases. Since then, our practice has settled on never skipping staged rollouts, no matter how "obviously an improvement" a change may seem.
Many of the pitfalls in context engineering arise from the well-intentioned decision to add more information. This section organizes two failure patterns commonly encountered in practice, before moving on to the conclusion.
The most frequently occurring failure is the pattern of stuffing as much information as possible into the context. The idea that "if we include all potentially relevant documents, the model will be smart enough to pick what it needs" almost invariably backfires.
LLM attention is not unlimited, and in long contexts there is a well-known phenomenon called "lost in the middle," where information placed in the middle portion tends to be overlooked. Once the volume of information exceeds a threshold, output quality actually declines. Vercel's case of reducing tools (from 15 to 2) and seeing accuracy rise from 80% to 100% can be read as a textbook example of resolving exactly this problem.
Another side effect is latency. Because processing time scales superlinearly with the number of input tokens, context bloat directly translates to degraded UX. From the user's perspective, this can result in the worst possible combination: slow responses and poor quality.
The hidden nature of token costs is another serious pitfall. In architectures where an agent calls tools repeatedly and accumulates history internally, token consumption per user grows at an accelerating rate with each interaction. It is not uncommon to look at a bill and realize for the first time that "this use case was never economically viable."
Visualizing costs requires more than simply tracking "monthly API charges." It is necessary to break costs down by session, user, and endpoint, and to establish a target for "how many tokens a given task should require." By applying the principles introduced in the related article "LLM Cost Optimization Guide — Token Reduction, Model Selection, and Cache Implementation" at the context design layer as well, runaway costs can be prevented before they occur.
As a conclusion to this article, we want to emphasize that in context engineering, quality is determined less by "what to include" and more by "what to leave out." While prompt engineering could be advanced by refining individual instruction strings, context design in the age of agents has information filtering and dynamic reconstruction at its core.
Teams running LLM applications in production should begin their improvement cycle by decomposing the context window into six elements and measuring the contribution of each. In our own B2B LLM adoption support work, we have repeatedly seen cases where refactoring context design alone produces a noticeable step-change improvement in perceived quality.

Yusuke Ishihara
Started programming at age 13 with MSX. After graduating from Musashi University, worked on large-scale system development including airline core systems and Japan's first Windows server hosting/VPS infrastructure. Co-founded Site Engine Inc. in 2008. Founded Unimon Inc. in 2010 and Enison Inc. in 2025, leading development of business systems, NLP, and platform solutions. Currently focuses on product development and AI/DX initiatives leveraging generative AI and large language models (LLMs).