Building Agentic Systems: From Simple Loops to Production
Part 2 of the On Using AI Better series. A practical guide to building agentic systems — from the core reasoning loop and tool design to production-readiness with MCP, guardrails, and observability.
Every agent framework ships with a diagram that looks like mission control — dozens of boxes, arrows everywhere, state machines within state machines. But the teams actually shipping agents to production? They keep telling the same story: they started with the framework, hit a wall, ripped it out, and replaced it with something embarrassingly simple.
On Hacker News, a developer from Octomind put it plainly: "Most LLM applications require nothing more than string handling, API calls, loops, and maybe a vector DB if you're doing RAG. You don't need several layers of abstraction." Anthropic's own building guide arrives at the same conclusion more diplomatically: "The most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns."
In Part 1, we argued that investing in reliable, composable tools beats building ever-more-complex workflow graphs. Here we make that concrete.
TL;DR
- The agent loop is just
while not done: think → act → observe. Most frameworks are wrappers around this. - Tool quality matters more than model quality — fix your tools before you upgrade your model.
- Handle errors in the tool layer, not by asking the LLM to reflect on what went wrong. It's faster and cheaper.
- Production readiness is guardrails, observability, and cost caps — not more sophisticated orchestration.
The Agent Loop
Strip away the abstractions and every agent framework runs the same while loop.
tools = [search, retrieve, send_email, ...]
conversation = [system_prompt, user_message]
while True:
response = llm(conversation, tools)
if response.has_tool_calls:
for call in response.tool_calls:
result = execute(call)
conversation.append(tool_result(result))
else:
return response.text # final answerThe model gets a conversation, decides whether to call a tool or return a final answer, and the loop continues until it's done. The framework provides the loop and safety rails. The LLM provides the reasoning. Everything else is configuration.
For most tasks, this single loop is enough. More sophisticated patterns exist (multi-step planners, self-improving agents, tree-search), but there's a reason the Anthropic guide says to "start by using LLM APIs directly: many patterns can be implemented in a few lines of code." Don't reach for LangGraph when a while loop and three functions will do. If you've spent time on r/LangChain, you've seen the pattern: someone builds with the framework, hits a wall the moment they go off-script, and ends up rewriting in plain Python anyway.
What Happens in One Turn
Each iteration looks straightforward: call a tool, get a result. But production systems need to handle several concerns within that single turn. Here's how to think about them without over-engineering.
Before Acting
Triage is deciding what to do with the input. Most of the time the model handles this implicitly — it reads the message and picks a tool. For production systems, it helps to think in three buckets: is this an instruction (do something), a clarification (ask before doing), or neither (respond directly)? Keep the categories few. The more you add, the more often the model misclassifies.
Planning exists on a spectrum. Modern reasoning models are remarkably good at implicit planning — they decompose problems and sequence actions without being told to. Explicit plans (structured step lists with dependencies) add value only when a task is complex enough that you'd want a human to write a plan too. For most single-turn tasks, let the model think and get out of the way.
A common trap: crafting a 20-step plan upfront that breaks on step 3 when reality doesn't match assumptions. Keep plans short. Let the model replan as it learns from tool results.
After Acting
Once a tool returns, the model evaluates: proceed (result looks good), retry (transient error), replan (approach isn't working), or stop (unrecoverable failure). The instinct is to let the LLM reason about every failure. Resist it.
Handle errors in tools, not in the LLM. A rate-limit error doesn't need the model to "think about what went wrong." A structured error code and an automatic retry are faster and cheaper. Reserve LLM reflection for semantic mismatches — the tool succeeded, but the results are about the wrong topic.
| Type | Example | Handling |
|---|---|---|
| Transient | Rate limit exceeded | Retry automatically with backoff |
| Input | Missing required field: query | Return error to model — let it fix the call |
| Semantic | 0 results found | Model reflects and replans |
| Fatal | Authentication failed — invalid API key | Stop execution, surface to user |
Transient
Example: Rate limit exceeded
Handling: Retry automatically with backoff
Input
Example: Missing required field: query
Handling: Return error to model — let it fix the call
Semantic
Example: 0 results found
Handling: Model reflects and replans
Fatal
Example: Authentication failed — invalid API key
Handling: Stop execution, surface to user
Tiered error responses — each type maps to a different handling strategy.
Transient errors retry automatically. Input errors tell the model what to fix. Only semantic errors need actual reasoning. This classification alone eliminates most unnecessary LLM calls in the error path.
Designing Tools
When your agent underperforms, fix the tools first. Not the prompt, not the model, not the orchestration. The tools. A clear tool description that tells the model when to use it and what it returns eliminates more failure modes than any amount of prompt engineering. Four principles:
1. One tool, one job
Composite tools are tempting but break in practice. If search_and_summarize fails, which half failed? You can't retry just the summary. You can't reuse the search results elsewhere.
# Bad: does two things, hard to retry or test
def search_and_summarize(query: str) -> str:
results = search(query) # what if this fails?
summary = summarize(results) # can't retry just this
return summary# Good: atomic, composable, independently retryable
def search(query: str) -> list[Document]:
"""Search the corpus. Idempotent, safe to retry."""
...
def summarize(documents: list[Document]) -> str:
"""Summarize a list of documents."""
...2. Clear contracts
Every tool needs a name the model can understand, a description that says when to use it, and explicit input schemas. The better your descriptions, the less the model guesses wrong. The tool definition below follows the Model Context Protocol (MCP) format. MCP is becoming the standard for tool integration: build your tools as MCP servers and any compliant agent can use them without custom glue.
{
"name": "get_weather",
"description": "Get current weather for a location",
"inputSchema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name or coordinates"
}
},
"required": ["location"]
}
}3. Idempotency
Tools should be safely retryable. search is naturally idempotent — same query, same results. send_email is not — retrying sends a duplicate. This isn't hypothetical: on the n8n community forums, developers regularly report agent nodes executing twice on the same input, producing duplicate actions seconds apart. For tools with side effects, use deduplication keys or confirmation gates. It's a small fix that prevents the most common class of production bug.
4. Structured errors
Return an error type, a message, and a machine-actionable suggestion — not a raw stack trace. The error classification from the previous section applies directly here: let tools tell the system exactly what went wrong and what to do about it.
From Demo to Production
Getting an agent to work in a demo takes an afternoon. Getting it to work reliably in production — that's where the engineering lives.
Guardrails
Without boundaries, a confident model will cheerfully execute destructive actions, blow through rate limits, or spend $50 on a task worth $0.05. Autonomy caps — max steps, retries, and spend per run — are non-negotiable. Tier your tool permissions:
For irreversible actions, pause and show the user what's about to happen. The cost of a 5-second confirmation is negligible compared to an unintended deletion. And remember: the most common production failure isn't a tool crashing. It's the agent completing a task correctly that the user didn't actually want. Confirmation steps catch both.
Observability
You can't improve what you can't measure. Every agent run should produce a structured log: what tool was called, with what arguments, how long it took, whether it succeeded, and what it cost.
{
"run_id": "run_a1b2c3d4",
"turn": 3,
"tool": "search_documents",
"args": {
"query": "Q3 financial performance"
},
"duration_ms": 840,
"status": "success",
"token_usage": {
"input": 1200,
"output": 350
},
"cost_usd": 0.002,
"timestamp": "2025-10-01T14:32:15Z"
}Track task success rate (did the user's goal get met?), tool success rate (which tools are unreliable?), p95 latency (where are the bottlenecks?), and cost per run (is it trending up?). High retry rates signal brittle tools. High replan rates signal poor error handling. If you're not logging these, you're flying blind.
Cost Discipline
Agent loops consume tokens every turn, and the costs are not linear. Context grows with each tool call, so turn 20 is dramatically more expensive than turn 2. A missing retry cap or a swallowed error can turn a $0.10 run into a $100 one before anyone notices. It happens more often than teams admit, usually because the failure path was never instrumented.
The defenses are straightforward: set per-run token budgets (if an agent approaches its limit, it wraps up or asks before continuing), use model tiering (Haiku for triage, Sonnet for execution — not every step needs a frontier model), and enable prompt caching for large tool definitions. Without these, you'll discover the costs in your invoice instead of your logs.
Resist multi-agent architectures until you've hit a clear ceiling with one agent. Handoff chains and supervisor-worker patterns exist for genuinely complex orchestration, but they multiply every cost and observability problem listed above. Better tools almost always beat more agents.
A loop, some tools, and basic instrumentation gets you surprisingly far. The work that follows is less glamorous: making each tool reliable enough to retry safely, each error structured enough to act on without burning an LLM call, and each run observable enough to debug on Monday morning. That's the work. And it compounds, because every tool you harden makes every agent that uses it better.
If you missed the context, start with Part 1: Workflow Automations vs Agents.