Building Agentic Systems: From Simple Loops to Production

Part 2 of the On Using AI Better series. A practical guide to building agentic systems — from the core reasoning loop and tool design to production-readiness with MCP, guardrails, and observability.

Every agent framework ships with a diagram that looks like mission control — dozens of boxes, arrows everywhere, state machines within state machines. But the teams actually shipping agents to production? They keep telling the same story: they started with the framework, hit a wall, ripped it out, and replaced it with something embarrassingly simple.

On Hacker News, a developer from Octomind put it plainly: "Most LLM applications require nothing more than string handling, API calls, loops, and maybe a vector DB if you're doing RAG. You don't need several layers of abstraction." Anthropic's own building guide arrives at the same conclusion more diplomatically: "The most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns."

In Part 1, we argued that investing in reliable, composable tools beats building ever-more-complex workflow graphs. Here we make that concrete.

TL;DR

The agent loop is just while not done: think → act → observe. Most frameworks are wrappers around this.
Tool quality matters more than model quality — fix your tools before you upgrade your model.
Handle errors in the tool layer, not by asking the LLM to reflect on what went wrong. It's faster and cheaper.
Production readiness is guardrails, observability, and cost caps — not more sophisticated orchestration.

The Agent Loop

Strip away the abstractions and every agent framework runs the same while loop.

python

tools = [search, retrieve, send_email, ...]
conversation = [system_prompt, user_message]

while True:
    response = llm(conversation, tools)
    if response.has_tool_calls:
        for call in response.tool_calls:
            result = execute(call)
            conversation.append(tool_result(result))
    else:
        return response.text  # final answer

The model gets a conversation, decides whether to call a tool or return a final answer, and the loop continues until it's done. The framework provides the loop and safety rails. The LLM provides the reasoning. Everything else is configuration.

Agentic reasoning flow — Think, act, observe, repeat.

For most tasks, this single loop is enough. More sophisticated patterns exist (multi-step planners, self-improving agents, tree-search), but there's a reason the Anthropic guide says to "start by using LLM APIs directly: many patterns can be implemented in a few lines of code." Don't reach for LangGraph when a while loop and three functions will do. If you've spent time on r/LangChain, you've seen the pattern: someone builds with the framework, hits a wall the moment they go off-script, and ends up rewriting in plain Python anyway.

What Happens in One Turn

Each iteration looks straightforward: call a tool, get a result. But production systems need to handle several concerns within that single turn. Here's how to think about them without over-engineering.

Multi-tool execution flow with parallel branches — A multi-tool execution flow — concrete example of what happens in a single turn.

Before Acting

Triage is deciding what to do with the input. Most of the time the model handles this implicitly — it reads the message and picks a tool. For production systems, it helps to think in three buckets: is this an instruction (do something), a clarification (ask before doing), or neither (respond directly)? Keep the categories few. The more you add, the more often the model misclassifies.

Triage example — a chat message classified as INSTRUCTION — A chat message being classified as INSTRUCTION during triage.

Planning exists on a spectrum. Modern reasoning models are remarkably good at implicit planning — they decompose problems and sequence actions without being told to. Explicit plans (structured step lists with dependencies) add value only when a task is complex enough that you'd want a human to write a plan too. For most single-turn tasks, let the model think and get out of the way.

A common trap: crafting a 20-step plan upfront that breaks on step 3 when reality doesn't match assumptions. Keep plans short. Let the model replan as it learns from tool results.

After Acting

Once a tool returns, the model evaluates: proceed (result looks good), retry (transient error), replan (approach isn't working), or stop (unrecoverable failure). The instinct is to let the LLM reason about every failure. Resist it.

Handle errors in tools, not in the LLM. A rate-limit error doesn't need the model to "think about what went wrong." A structured error code and an automatic retry are faster and cheaper. Reserve LLM reflection for semantic mismatches — the tool succeeded, but the results are about the wrong topic.

Type	Example	Handling
Transient	Rate limit exceeded	Retry automatically with backoff
Input	Missing required field: query	Return error to model — let it fix the call
Semantic	0 results found	Model reflects and replans
Fatal	Authentication failed — invalid API key	Stop execution, surface to user

Transient

Example: Rate limit exceeded

Handling: Retry automatically with backoff

Input

Example: Missing required field: query

Handling: Return error to model — let it fix the call

Semantic

Example: 0 results found

Handling: Model reflects and replans

Fatal

Example: Authentication failed — invalid API key

Handling: Stop execution, surface to user

Tiered error responses — each type maps to a different handling strategy.

Transient errors retry automatically. Input errors tell the model what to fix. Only semantic errors need actual reasoning. This classification alone eliminates most unnecessary LLM calls in the error path.

Designing Tools

When your agent underperforms, fix the tools first. Not the prompt, not the model, not the orchestration. The tools. A clear tool description that tells the model when to use it and what it returns eliminates more failure modes than any amount of prompt engineering. Four principles:

1. One tool, one job

Composite tools are tempting but break in practice. If search_and_summarize fails, which half failed? You can't retry just the summary. You can't reuse the search results elsewhere.

Composite — hard to debug and retry

# Bad: does two things, hard to retry or test
def search_and_summarize(query: str) -> str:
    results = search(query)      # what if this fails?
    summary = summarize(results) # can't retry just this
    return summary

Atomic — composable and retryable

# Good: atomic, composable, independently retryable
def search(query: str) -> list[Document]:
    """Search the corpus. Idempotent, safe to retry."""
    ...

def summarize(documents: list[Document]) -> str:
    """Summarize a list of documents."""
    ...

2. Clear contracts

Every tool needs a name the model can understand, a description that says when to use it, and explicit input schemas. The better your descriptions, the less the model guesses wrong. The tool definition below follows the Model Context Protocol (MCP) format. MCP is becoming the standard for tool integration: build your tools as MCP servers and any compliant agent can use them without custom glue.

Tool definition (MCP format)

{
  "name": "get_weather",
  "description": "Get current weather for a location",
  "inputSchema": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City name or coordinates"
      }
    },
    "required": ["location"]
  }
}

3. Idempotency

Tools should be safely retryable. search is naturally idempotent — same query, same results. send_email is not — retrying sends a duplicate. This isn't hypothetical: on the n8n community forums, developers regularly report agent nodes executing twice on the same input, producing duplicate actions seconds apart. For tools with side effects, use deduplication keys or confirmation gates. It's a small fix that prevents the most common class of production bug.

4. Structured errors

Return an error type, a message, and a machine-actionable suggestion — not a raw stack trace. The error classification from the previous section applies directly here: let tools tell the system exactly what went wrong and what to do about it.

From Demo to Production

Getting an agent to work in a demo takes an afternoon. Getting it to work reliably in production — that's where the engineering lives.

Guardrails

Without boundaries, a confident model will cheerfully execute destructive actions, blow through rate limits, or spend $50 on a task worth $0.05. Autonomy caps — max steps, retries, and spend per run — are non-negotiable. Tier your tool permissions:

Read-only

Runs freely

Write

Needs confirmation

Destructive

Requires explicit approval

For irreversible actions, pause and show the user what's about to happen. The cost of a 5-second confirmation is negligible compared to an unintended deletion. And remember: the most common production failure isn't a tool crashing. It's the agent completing a task correctly that the user didn't actually want. Confirmation steps catch both.

Observability

You can't improve what you can't measure. Every agent run should produce a structured log: what tool was called, with what arguments, how long it took, whether it succeeded, and what it cost.

Structured event log entry

{
  "run_id": "run_a1b2c3d4",
  "turn": 3,
  "tool": "search_documents",
  "args": {
    "query": "Q3 financial performance"
  },
  "duration_ms": 840,
  "status": "success",
  "token_usage": {
    "input": 1200,
    "output": 350
  },
  "cost_usd": 0.002,
  "timestamp": "2025-10-01T14:32:15Z"
}

Track task success rate (did the user's goal get met?), tool success rate (which tools are unreliable?), p95 latency (where are the bottlenecks?), and cost per run (is it trending up?). High retry rates signal brittle tools. High replan rates signal poor error handling. If you're not logging these, you're flying blind.

Cost Discipline

Agent loops consume tokens every turn, and the costs are not linear. Context grows with each tool call, so turn 20 is dramatically more expensive than turn 2. A missing retry cap or a swallowed error can turn a $0.10 run into a $100 one before anyone notices. It happens more often than teams admit, usually because the failure path was never instrumented.

The defenses are straightforward: set per-run token budgets (if an agent approaches its limit, it wraps up or asks before continuing), use model tiering (Haiku for triage, Sonnet for execution — not every step needs a frontier model), and enable prompt caching for large tool definitions. Without these, you'll discover the costs in your invoice instead of your logs.

Resist multi-agent architectures until you've hit a clear ceiling with one agent. Handoff chains and supervisor-worker patterns exist for genuinely complex orchestration, but they multiply every cost and observability problem listed above. Better tools almost always beat more agents.

A loop, some tools, and basic instrumentation gets you surprisingly far. The work that follows is less glamorous: making each tool reliable enough to retry safely, each error structured enough to act on without burning an LLM call, and each run observable enough to debug on Monday morning. That's the work. And it compounds, because every tool you harden makes every agent that uses it better.

If you missed the context, start with Part 1: Workflow Automations vs Agents.