Building Agentic Systems: From Simple Loops to Production

Part 2 of the On Using AI Better series. A practical guide to building agentic systems — from the core reasoning loop and tool design to production-readiness with MCP, guardrails, and observability.

Most agent frameworks advertise themselves with elaborate architecture diagrams — multi-layered orchestrators, state machines, planning subsystems. In practice, the teams shipping agents tend to end up somewhere simpler. Octomind, an AI testing startup, used LangChain in production for over a year before the friction became untenable — they needed to dynamically change which tools their agents could access based on state, and the framework had no mechanism for it. Their team ended up "spending as much time understanding and debugging LangChain as building features," so they ripped it out. Anthropic's building guide arrives at a similar conclusion: "The most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns."

In Part 1, we argued that investing in reliable, composable tools beats building ever-more-complex workflow graphs. Here we make that concrete.

TL;DR

  • The agent loop is while not done: think → act → observe. Most frameworks wrap this with varying amounts of abstraction.
  • Tool descriptions and error handling tend to matter more than model choice or prompt tweaks. See Anthropic's context engineering guide.
  • Classify errors and handle them in code where possible — LLM reasoning is expensive and slow for retries.
  • Production concerns are mostly guardrails, observability, and cost caps.

The Agent Loop

Strip away the abstractions and every agent framework runs the same while loop.

python
tools = [search, retrieve, send_email, ...]
conversation = [system_prompt, user_message]

while True:
    response = llm(conversation, tools)
    if response.has_tool_calls:
        for call in response.tool_calls:
            result = execute(call)
            conversation.append(tool_result(result))
    else:
        return response.text  # final answer

The model gets a conversation, decides whether to call a tool or return a final answer, and the loop continues until it's done. The framework provides the loop and safety rails. The LLM provides the reasoning. Everything else is configuration.

Think, act, observe, repeat.

For most tasks, this single loop is enough. More sophisticated patterns exist (multi-step planners, self-improving agents, tree-search), but there's a reason the Anthropic guide says to "start by using LLM APIs directly: many patterns can be implemented in a few lines of code." It's worth trying a while loop and a few functions before reaching for LangGraph.

What the simple loop hides is the cost curve. Agent costs are not linear — they're quadratic. Context grows with each tool call, so every turn is more expensive than the last. In one analysis of real conversations, a single agent session cost ~$13, with cache reads representing 87% of total expenses by the end. By 50,000 tokens, cache reads dominate.

Handle errors in code, not in the LLM. Retry transient failures automatically, return structured feedback for input errors, and reserve LLM reasoning for semantic mismatches.

What Happens in One Turn

Each iteration looks straightforward: call a tool, get a result. But production systems need to handle several concerns within that single turn. Here's how to think about them without over-engineering.

A multi-tool execution flow — concrete example of what happens in a single turn.

Before Acting

Triage is deciding what to do with the input. Most of the time the model handles this implicitly — it reads the message and picks a tool. For production systems, it helps to think in three buckets: is this an instruction (do something), a clarification (ask before doing), or neither (respond directly)? Keep the categories few. The more you add, the more often the model misclassifies.

A chat message being classified as INSTRUCTION during triage.

Planning exists on a spectrum. Explicit plans (structured step lists with dependencies) add value only when steps depend on each other — when step 3 needs the output of step 1. For sequential tasks, the model decomposes and sequences actions on its own. OpenAI's agent guide recommends the same starting point: a single model call, adding planning layers only when you observe the agent failing without them. The common trap is crafting a 20-step plan upfront that breaks on step 3 when reality doesn't match assumptions. Keep plans short. Let the model replan as it learns from tool results.

After Acting

Once a tool returns, the model evaluates: proceed (result looks good), retry (transient error), replan (approach isn't working), or stop (unrecoverable failure). This is where the principle from above pays off — classify the error, route it to the right handler, and only invoke the LLM when it's genuinely needed:

Transient

Handled by: System

Example: Rate limit exceeded

Strategy: Retry automatically with backoff

Input

Handled by: System → Model

Example: Missing required field: query

Strategy: Return error to model — let it fix the call

Semantic

Handled by: Model

Example: 0 results found

Strategy: Model reflects and replans

Fatal

Handled by: System → User

Example: Authentication failed — invalid API key

Strategy: Stop execution, surface to user

Tiered error responses — only semantic errors need LLM reasoning.

Transient errors retry automatically. Input errors tell the model what to fix. Only semantic errors need actual reasoning. This classification alone eliminates most unnecessary LLM calls in the error path.

Designing Tools

When an agent underperforms, the tools are usually the first place to look — ahead of the prompt, the model, or the orchestration. A clear tool description that tells the model when to use it and what it returns eliminates more failure modes than prompt engineering alone. Four principles:

1. One tool, one job

Composite tools are tempting but break in practice. If search_and_summarize fails, which half failed? You can't retry just the summary. You can't reuse the search results elsewhere.

Composite

python
# Bad: does two things, hard to retry or test
def search_and_summarize(query: str) -> str:
    results = search(query)      # what if this fails?
    summary = summarize(results) # can't retry just this
    return summary
Can't retry independently
Hard to test
Single point of failure

Atomic

python
# Good: atomic, composable, independently retryable
def search(query: str) -> list[Document]:
    """Search the corpus. Idempotent, safe to retry."""
    ...

def summarize(documents: list[Document]) -> str:
    """Summarize a list of documents."""
    ...
Independently retryable
Composable
Idempotent

2. Clear contracts

Every tool needs a name the model can understand, a description that says when to use it, and explicit input schemas. The better your descriptions, the less the model guesses wrong. The tool definition below follows the Model Context Protocol (MCP) format — released by Anthropic in November 2024, adopted by OpenAI and Google DeepMind within months, and now backed by 10,000+ public servers. In late 2025, Anthropic donated MCP to the Linux Foundation as part of the Agentic AI Foundation, co-founded with Block and OpenAI. Build your tools as MCP servers and any compliant agent can use them without custom glue.

Tool definition (MCP format)
{
  "name": "get_weather",
  "description": "Get current weather for a location",
  "inputSchema": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City name or coordinates"
      }
    },
    "required": ["location"]
  }
}

MCP Adoption Timeline

From single-vendor protocol to industry standard in under a year.

Nov 2024

Anthropic releases MCP

Mar 2025

OpenAI adopts MCP

Apr 2025

Google DeepMind adopts

Mid 2025

10,000+ public servers

Nov 2025

Donated to Linux Foundation

Milestone
Adoption

A related concern: tool set size. Every tool definition consumes tokens in every request, and the more tools available, the more likely the model misselects. Anthropic's context engineering guide warns against bloated tool sets and recommends keeping tools token-efficient. Start with the minimum viable toolset. Add tools in response to observed failures, not anticipated needs.

3. Idempotency

Tools should be safely retryable. search is naturally idempotent — same query, same results. send_email is not — retrying sends a duplicate. This isn't hypothetical: n8n users have reported 181 items triggering 542 executions in queue mode, and AI agent tools randomly launching subflows with duplicate data. For tools with side effects, use deduplication keys or confirmation gates. It's a small fix that prevents the most common class of production bug.

4. Structured errors

Return an error type, a message, and a machine-actionable suggestion — not a raw stack trace. The error classification from the previous section applies directly here: let tools tell the system exactly what went wrong and what to do about it.

Structured errors vs raw exceptions
# Bad: raw exception the model can't act on
raise Exception("Request failed")

# Good: structured error the system can route
return ToolError(
    type="transient",
    message="Rate limit exceeded",
    retry_after_ms=2000
)

From Demo to Production

Getting an agent to work in a demo takes an afternoon. Making it reliable in production is where most of the engineering effort goes.

Guardrails

Without constraints, models will execute destructive actions, blow through rate limits, or spend $50 on a task worth $0.05. In one widely reported case, two agents entered an infinite conversation loop, ran undetected for 11 days, and burned $47,000. Autonomy caps — max steps, retries, and spend per run — have been the most reliable safeguard in practice. Tier your tool permissions:

Read-only
Runs freely
Write
Needs confirmation
Destructive
Requires explicit approval

For irreversible actions, pause and show the user what's about to happen. A surprisingly common failure mode: the agent completes a task correctly — just not the task the user actually wanted. A brief confirmation step catches both categories.

Observability

You can't improve what you can't measure. Every agent run should produce a structured log: what tool was called, with what arguments, how long it took, whether it succeeded, and what it cost.

Structured Event Log

success

Every tool call should produce a structured record — not a raw log line.

run_a1b2c3d4|Turn 3|2025-10-01T14:32:15Z
search_documents840ms

query: "Q3 financial performance"

Input

1,200

tokens

Output

350

tokens

Cost

$0.002

USD

Track task success rate (did the user's goal get met?), tool success rate (which tools are unreliable?), p95 latency (where are the bottlenecks?), and cost per run (is it trending up?). High retry rates signal brittle tools. High replan rates signal poor error handling. Without these metrics, debugging usually comes down to reading raw transcripts.

Cost Discipline

Agent loops consume tokens every turn, and the costs are not linear — they're quadratic. Context grows with each tool call, and every turn re-reads the entire conversation. The cost scales with tokens times number of calls, not tokens alone. A missing retry cap or a swallowed error can turn a $0.10 run into a $100 one before anyone notices.

The defenses are straightforward: set per-run token budgets (if an agent approaches its limit, it wraps up or asks before continuing), use model tiering (Haiku for triage, Sonnet for execution — not every step needs a frontier model), and manage context aggressively. JetBrains Research found that observation masking — simply truncating old tool outputs instead of keeping the full context — cut costs by over 50% while improving solve rates by 2.6%. Without these measures, you'll discover the costs in your invoice instead of your logs.

Model Tiering: Same Task, Different Cost

Not every step needs a frontier model. Triage with Haiku, execute with Sonnet — same result, fraction of the cost.

Haiku
Sonnet
Opus

Tiered — $0.065

Triage (Haiku)$0.001
Planning (Sonnet)$0.012
Execution (Sonnet)$0.050
Summary (Haiku)$0.002

Frontier — $0.280

Triage (Opus)$0.020
Planning (Opus)$0.080
Execution (Opus)$0.150
Summary (Opus)$0.030
Tiered approach is
4.3x cheaper
for the same task

Illustrative costs for a single agent run. Tiered approach uses Haiku for low-stakes steps, Sonnet for reasoning-heavy execution.

Multi-agent architectures — handoff chains, supervisor-worker patterns — exist for genuinely complex orchestration, but they multiply every cost and observability problem listed above. In our experience, improving tools has more often been the answer than adding agents.

What Compounds

The model improves every quarter on its own. Everything else is yours to build — and it compounds across every agent on your platform.

Improves on its own

Model reasoning
Context windows
Inference speed
Capabilities

Yours to build

Tool quality
Error handling
Cost controls
Guardrails
Observability

If you missed the context, start with Part 1: Workflow Automations vs Agents.