ContextEngineeringasSearch

A perspective on why context engineering for LLMs is really a search problem: prompts, RAG, tools, workflows, and agents are all strategies for exploring and compressing vast state into a useful context window.

Written by

Jean-Denis Greze

November 26, 2025

Copy, X (Twitter), Facebook, Bluesky, LinkedIn

A useful mental model for most LLM-powered systems today is simply “how do we get the right data into the model’s context window?”

In this framing, prompts, tools, RAG, skills, sub-agents, MCPs are just methods for shaping the context window. And there is a long spectrum for how to engineer this context.

From hand-crafted context to agents

On one end of the spectrum, you have manually hand-crafted context. A human writes the prompt, decides exactly what data to pull from existing systems, and the model is called once with that hand-picked context. This is where early LLM systems started: carefully curated prompts and inputs. Over time, individual LLM calls were chained together in set workflows, each with its own hand-crafted context. Here, “search” for the right context happens in the developer’s head during development.

On the other end of the spectrum, you have automated context engineering for the idealized agent. A human gives it a goal in natural language and a set of tools. The agent plans, calls tools in a loop, and builds its own context until it achieves the goal (or, more frustratingly, wrongly assumes it has achieved its goal, gets confused, or realizes that it can’t for some reason). In Simon Willison’s words: “An LLM agent runs tools in a loop to achieve a goal.” If this worked perfectly for any goal, we’d have something extremely valuable: generalized, reliable, high-complexity automation. You state what you want and AI does it, end-to-end.

Real systems live somewhere between hand-crafted context and that idealized endpoint.

Context engineering as search

Once you move beyond single LLM calls toward agents, it's useful to think of "context engineering" as a search problem.

The model doesn’t start with perfect input. Instead, it calls tools—search, MCPs, APIs, code execution, headless browsers, and so on. Each tool call is a step in a search over possible context windows. Search acts as compression over a vast state space (the web, a codebase, company documents, the legal code, etc.), distilling it down to the few actually important tokens the model needs to solve the problem.

In this framing, common patterns line up:

RAG is a search strategy over data: given a query, fetch only the chunks relevant to the goal.
A workflow is a constrained search script: for this class of problems, use these tools in this order.
A skill is a cached approach to a specific problem: when the problem looks like this, start from this known pattern rather than rethinking from first principles.
A sub-agent is a way to parallelize search while isolating it from your main context, avoiding long-context degradation and dilution.
A sandbox is search in the space of code and actions.

Good tools act like good search operators: they move the system toward useful context quickly, provide feedback to the agent so that it can refine its approach, and avoid blind alleys. They don’t just dump data; they provide navigational cues. If a tool fails, it shouldn’t just return an error, but a hint on how to recover (reinforcement). If it succeeds, it should remind the model of the broader state. “Search” here isn’t just over data (web, internal systems), but also over code paths and pre-defined, reusable skills.

If you think this way, then whether it’s viable to automate work with LLMs for a particular goal hinges on the search being stable and tractable.

And this comes down to whether you can build effective tools such as high-quality search over private data; sandboxes for trying code or transformations (pre-equipped with the right libraries and examples); reusable code paths for common scenarios; memory so that each agent is better the next time around, avoiding past mistakes automatically; tables, forms, docs, approvals that loop in humans when search is incapable of bringing in the right context.

This last one, notifying humans when the agent gets blocked, is extremely powerful and often underused. We do not (yet) have AGI and it is not always possible for the agent to retrieve the best context because it does not have access to the requisite knowledge or know-how — some of which may only live in our human brains. At least for now, humans are not obsolete. The agent can treat a human as a high-latency, high-intelligence search node. But be careful: this requires a dedicated output tool with strict heuristics on when to interrupt and how to ask, or your agent will just spam with bad questions.

Designed well, great tools don't magically give you "better AI" in the abstract—they constrain and accelerate the search for the best context on specific classes of problems, giving the LLM higher odds of achieving the desired goal.

Success for agentic systems

With a single LLM call, you mostly care about accuracy. Once you let a system loop—planning, calling tools, updating context—you care about at least three things:

Is the result correct? Did the agent actually achieve the goal or produce a useful answer? If it did not, did it realize its mistake and did it ask for human guidance?
What's the variance? If you run the same goal multiple times, do you get similar output and behavior, or wildly different traces? High variance often signals that the search process is unstable or that failures are not being isolated. If a "search branch" turns out to be a dead end, does that bad context pollute the main thread, or can the system prune it and reset?
How many turns did it take? How many model calls and tool invocations were needed? Are these turns happening sequentially (slow) or in parallel (breadth-first)? This matters for latency and cost, but also as a proxy for search efficiency. Fewer, more targeted steps generally mean your tools are well-designed for the goal.

You can have a system that appears ‘smart’—producing impressive results—but is practically useless because it is either unstable (high variance) or too expensive (high turn count). A correct answer is of little value if the agent has to ‘brute force’ the search with fifty expensive steps to find what a better tool could locate in two. Evaluating all three axes reveals the difference between a viable product and a money-burning prototype.

Putting it to work

The agentic endgame—“state a goal, get the outcome, every time”—is a useful north star, but not where we are.

In the near term, this search framing gives you a concrete checklist for designing or evaluating an AI system:

Where on the spectrum are you: mostly hand-crafted context, or a loop that searches for its own?
What tools are you giving the model to search for better context, and what parts of the search still happen in a human’s head?
For your target tasks, which tools actually shorten the search and reduce variance, and which just add more branching paths?
When you look at traces, how do result quality, variance, and turn count move as you adjust tools and workflows?

Framed this way, "agents" aren't a separate magic category. RAG, workflows, skills, prompt templates, actions, web search, MCPs, etc. are simply methods for crafting an ideal context window. And agents are just systems that push more of the context search into the loop using these methods, and whose success ultimately depends on how well you can design tools.

About the author

Jean-Denis Greze is CEO & co-founder of Town. Previously he was CTO at Plaid, and prior to that Director of Engineering at Dropbox. Enjoys vibe coding badly named agents.

Next up

SoftwareforEveryone

Town is an applied AI company in San Francisco. We build tools that make it easy for anyone to create and use software in their everyday work.

Read article

Town