Building Agentic Systems: From Simple Loops to Production
Part 2 of the On Using AI Better series. A practical guide to building agentic systems — from the core reasoning loop and tool design to production-readiness with MCP, guardrails, and observability.
Most agent frameworks advertise themselves with elaborate architecture diagrams — multi-layered orchestrators, state machines, planning subsystems. In practice, the teams shipping agents tend to end up somewhere simpler. Octomind, an AI testing startup, used LangChain in production for over a year before the friction became untenable — they needed to dynamically change which tools their agents could access based on state, and the framework had no mechanism for it. Their team ended up "spending as much time understanding and debugging LangChain as building features," so they ripped it out. Anthropic's building guide arrives at a similar conclusion: "The most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns."
In Part 1, we argued that investing in reliable, composable tools beats building ever-more-complex workflow graphs. Here we make that concrete.
TL;DR
- The agent loop is
while not done: think → act → observe. Most frameworks wrap this with varying amounts of abstraction. - Tool descriptions and error handling tend to matter more than model choice or prompt tweaks. See Anthropic's context engineering guide.
- Classify errors and handle them in code where possible — LLM reasoning is expensive and slow for retries.
- Production concerns are mostly guardrails, observability, and cost caps.
The Agent Loop
Strip away the abstractions and every agent framework runs the same while loop.
tools = [search, retrieve, send_email, ...]
conversation = [system_prompt, user_message]
while True:
response = llm(conversation, tools)
if response.has_tool_calls:
for call in response.tool_calls:
result = execute(call)
conversation.append(tool_result(result))
else:
return response.text # final answerThe model gets a conversation, decides whether to call a tool or return a final answer, and the loop continues until it's done. The framework provides the loop and safety rails. The LLM provides the reasoning. Everything else is configuration.
For most tasks, this single loop is enough. More sophisticated patterns exist (multi-step planners, self-improving agents, tree-search), but there's a reason the Anthropic guide says to "start by using LLM APIs directly: many patterns can be implemented in a few lines of code." It's worth trying a while loop and a few functions before reaching for LangGraph.
What the simple loop hides is the cost curve. Agent costs are not linear — they're quadratic. Context grows with each tool call, so every turn is more expensive than the last. In one analysis of real conversations, a single agent session cost ~$13, with cache reads representing 87% of total expenses by the end. By 50,000 tokens, cache reads dominate.
Handle errors in code, not in the LLM. Retry transient failures automatically, return structured feedback for input errors, and reserve LLM reasoning for semantic mismatches.
What Happens in One Turn
Each iteration looks straightforward: call a tool, get a result. But production systems need to handle several concerns within that single turn. Here's how to think about them without over-engineering.
Before Acting
Triage is deciding what to do with the input. Most of the time the model handles this implicitly — it reads the message and picks a tool. For production systems, it helps to think in three buckets: is this an instruction (do something), a clarification (ask before doing), or neither (respond directly)? Keep the categories few. The more you add, the more often the model misclassifies.
Planning exists on a spectrum. Explicit plans (structured step lists with dependencies) add value only when steps depend on each other — when step 3 needs the output of step 1. For sequential tasks, the model decomposes and sequences actions on its own. OpenAI's agent guide recommends the same starting point: a single model call, adding planning layers only when you observe the agent failing without them. The common trap is crafting a 20-step plan upfront that breaks on step 3 when reality doesn't match assumptions. Keep plans short. Let the model replan as it learns from tool results.
After Acting
Once a tool returns, the model evaluates: proceed (result looks good), retry (transient error), replan (approach isn't working), or stop (unrecoverable failure). This is where the principle from above pays off — classify the error, route it to the right handler, and only invoke the LLM when it's genuinely needed:
| Type | Example | Handled by | Strategy |
|---|---|---|---|
| Transient | Rate limit exceeded | System | Retry automatically with backoff |
| Input | Missing required field: query | System → Model | Return error to model — let it fix the call |
| Semantic | 0 results found | Model | Model reflects and replans |
| Fatal | Authentication failed — invalid API key | System → User | Stop execution, surface to user |
Transient
Handled by: System
Example: Rate limit exceeded
Strategy: Retry automatically with backoff
Input
Handled by: System → Model
Example: Missing required field: query
Strategy: Return error to model — let it fix the call
Semantic
Handled by: Model
Example: 0 results found
Strategy: Model reflects and replans
Fatal
Handled by: System → User
Example: Authentication failed — invalid API key
Strategy: Stop execution, surface to user
Tiered error responses — only semantic errors need LLM reasoning.
Transient errors retry automatically. Input errors tell the model what to fix. Only semantic errors need actual reasoning. This classification alone eliminates most unnecessary LLM calls in the error path.
Designing Tools
When an agent underperforms, the tools are usually the first place to look — ahead of the prompt, the model, or the orchestration. A clear tool description that tells the model when to use it and what it returns eliminates more failure modes than prompt engineering alone. Four principles:
1. One tool, one job
Composite tools are tempting but break in practice. If search_and_summarize fails, which half failed? You can't retry just the summary. You can't reuse the search results elsewhere.
Composite
# Bad: does two things, hard to retry or test
def search_and_summarize(query: str) -> str:
results = search(query) # what if this fails?
summary = summarize(results) # can't retry just this
return summaryAtomic
# Good: atomic, composable, independently retryable
def search(query: str) -> list[Document]:
"""Search the corpus. Idempotent, safe to retry."""
...
def summarize(documents: list[Document]) -> str:
"""Summarize a list of documents."""
...2. Clear contracts
Every tool needs a name the model can understand, a description that says when to use it, and explicit input schemas. The better your descriptions, the less the model guesses wrong. The tool definition below follows the Model Context Protocol (MCP) format — released by Anthropic in November 2024, adopted by OpenAI and Google DeepMind within months, and now backed by 10,000+ public servers. In late 2025, Anthropic donated MCP to the Linux Foundation as part of the Agentic AI Foundation, co-founded with Block and OpenAI. Build your tools as MCP servers and any compliant agent can use them without custom glue.
{
"name": "get_weather",
"description": "Get current weather for a location",
"inputSchema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name or coordinates"
}
},
"required": ["location"]
}
}MCP Adoption Timeline
From single-vendor protocol to industry standard in under a year.
Nov 2024
Anthropic releases MCP
Mar 2025
OpenAI adopts MCP
Apr 2025
Google DeepMind adopts
Mid 2025
10,000+ public servers
Nov 2025
Donated to Linux Foundation
A related concern: tool set size. Every tool definition consumes tokens in every request, and the more tools available, the more likely the model misselects. Anthropic's context engineering guide warns against bloated tool sets and recommends keeping tools token-efficient. Start with the minimum viable toolset. Add tools in response to observed failures, not anticipated needs.
3. Idempotency
Tools should be safely retryable. search is naturally idempotent — same query, same results. send_email is not — retrying sends a duplicate. This isn't hypothetical: n8n users have reported 181 items triggering 542 executions in queue mode, and AI agent tools randomly launching subflows with duplicate data. For tools with side effects, use deduplication keys or confirmation gates. It's a small fix that prevents the most common class of production bug.
4. Structured errors
Return an error type, a message, and a machine-actionable suggestion — not a raw stack trace. The error classification from the previous section applies directly here: let tools tell the system exactly what went wrong and what to do about it.
# Bad: raw exception the model can't act on
raise Exception("Request failed")
# Good: structured error the system can route
return ToolError(
type="transient",
message="Rate limit exceeded",
retry_after_ms=2000
)From Demo to Production
Getting an agent to work in a demo takes an afternoon. Making it reliable in production is where most of the engineering effort goes.
Guardrails
Without constraints, models will execute destructive actions, blow through rate limits, or spend $50 on a task worth $0.05. In one widely reported case, two agents entered an infinite conversation loop, ran undetected for 11 days, and burned $47,000. Autonomy caps — max steps, retries, and spend per run — have been the most reliable safeguard in practice. Tier your tool permissions:
For irreversible actions, pause and show the user what's about to happen. A surprisingly common failure mode: the agent completes a task correctly — just not the task the user actually wanted. A brief confirmation step catches both categories.
Observability
You can't improve what you can't measure. Every agent run should produce a structured log: what tool was called, with what arguments, how long it took, whether it succeeded, and what it cost.
Structured Event Log
Every tool call should produce a structured record — not a raw log line.
query: "Q3 financial performance"
Input
1,200
tokens
Output
350
tokens
Cost
$0.002
USD
Track task success rate (did the user's goal get met?), tool success rate (which tools are unreliable?), p95 latency (where are the bottlenecks?), and cost per run (is it trending up?). High retry rates signal brittle tools. High replan rates signal poor error handling. Without these metrics, debugging usually comes down to reading raw transcripts.
Cost Discipline
Agent loops consume tokens every turn, and the costs are not linear — they're quadratic. Context grows with each tool call, and every turn re-reads the entire conversation. The cost scales with tokens times number of calls, not tokens alone. A missing retry cap or a swallowed error can turn a $0.10 run into a $100 one before anyone notices.
The defenses are straightforward: set per-run token budgets (if an agent approaches its limit, it wraps up or asks before continuing), use model tiering (Haiku for triage, Sonnet for execution — not every step needs a frontier model), and manage context aggressively. JetBrains Research found that observation masking — simply truncating old tool outputs instead of keeping the full context — cut costs by over 50% while improving solve rates by 2.6%. Without these measures, you'll discover the costs in your invoice instead of your logs.
Model Tiering: Same Task, Different Cost
Not every step needs a frontier model. Triage with Haiku, execute with Sonnet — same result, fraction of the cost.
Tiered — $0.065
Frontier — $0.280
Illustrative costs for a single agent run. Tiered approach uses Haiku for low-stakes steps, Sonnet for reasoning-heavy execution.
Multi-agent architectures — handoff chains, supervisor-worker patterns — exist for genuinely complex orchestration, but they multiply every cost and observability problem listed above. In our experience, improving tools has more often been the answer than adding agents.
What Compounds
The model improves every quarter on its own. Everything else is yours to build — and it compounds across every agent on your platform.
Improves on its own
Yours to build
If you missed the context, start with Part 1: Workflow Automations vs Agents.