AI agents

How AI agents actually work (under the hood)

· updated May 21, 20266 min read

The reasoning loop

Strip away the libraries and frameworks, and an AI agent is one tight loop:

while not done:
    plan = model.think(goal, context, history)
    action = plan.next_action            # call a tool, ask a question, finish
    if action == "finish":
        break
    result = execute(action)             # actually run the tool
    history.append((action, result))
    context = update_context(history)

That's it. The model thinks, picks an action, the system executes it, the result feeds back into the next iteration. Everything else — function calling APIs, retrieval, guardrails, observability — is plumbing around this loop.

Layer 1: the reasoning model

The LLM does the planning step. In 2026 the production-grade options:

ModelBest atCaveat
Claude Sonnet 4.6 / 4.7Reliable tool calling, long-context reasoning, structured outputsMost expensive
GPT-4o / GPT-4.1General-purpose, voice agents, large ecosystemSlightly weaker at long-context
Gemini 2.5 Pro / 2.0 FlashLong context (1M+ tokens), vision, cost-sensitive workloadsTool-calling stability less mature
Llama 3.3 70B (self-hosted)Privacy-sensitive deploymentsEngineering overhead to host

For most production agents, Claude Sonnet is our default — its tool-calling reliability and structured-output behavior is the highest in the field as of early 2026.

For deeper model comparison see our ChatGPT vs Claude vs Gemini post.

Layer 2: tool calling

Tools are what turn the LLM from a text generator into something that acts on the world.

In the model SDK, tools are typed:

const findAvailabilityTool = {
  name: "findAvailability",
  description: "Find open appointment slots on a given date",
  input_schema: {
    type: "object",
    properties: {
      date: { type: "string", format: "date" },
      durationMinutes: { type: "number" },
    },
    required: ["date", "durationMinutes"],
  },
};

When the model decides to call this tool, it emits structured JSON:

{
  "tool_name": "findAvailability",
  "input": { "date": "2026-05-26", "durationMinutes": 30 }
}

Your runtime validates the JSON against the schema, calls the actual function, returns the result to the model, and the model continues planning.

Three things make tools work in production:

  1. Typed schemas — the model can't invent fields; invalid JSON fails at the boundary.
  2. Idempotency keys — calling the same tool twice with the same args is safe (no duplicate bookings, payments, etc.).
  3. Failure semantics — clear errors when a tool fails so the model can recover (retry, ask a clarifying question, escalate to human).

Skip any of these and the agent breaks in production within days.

Layer 3: retrieval

The LLM doesn't know your data. Retrieval is how it finds out.

Three retrieval patterns:

Inline — the model fetches what it needs

model: I need to know which POs exist for vendor X from last month.
agent runtime: [calls searchPurchaseOrders({vendor: 'X', from: '2026-04-01'})]
runtime returns: [{id: 'PO-1234', total: 1500}, ...]
model: Continues planning with PO data in context.

Best for: when the right query depends on the goal.

Pre-fetched — the system fills context up front

agent runtime: Embeds the user goal, retrieves top-10 relevant chunks, builds the system prompt with those chunks included.
model: Reasons with everything already in context.

Best for: knowledge-base Q&A where the relevant content is predictable from the query.

Hybrid

Most production agents combine both: pre-fetch the obvious context, let the model inline-call retrieval for anything else.

For production RAG patterns see our RAG patterns post.

Layer 4: guardrails

Without guardrails, an agent can do things you very much don't want it to.

Production guardrails:

  • Approval gates on irreversible actions. The agent stops in a "pending approval" state until a designated human approves via Slack / dashboard / email.
  • Spend caps per agent run. Hard stop if LLM token cost or downstream API cost exceeds threshold.
  • Step ceilings. Maximum N tool calls per agent run. Beyond N, the agent gives up cleanly.
  • Scope refusals. System prompt defines what's in scope; user inputs that try to push outside scope get politely refused.
  • Prompt-injection detection. User-supplied content (emails, document contents) gets sanitized; "ignore previous instructions" patterns are detected and refused.

We design every agent assuming it will be attacked. Sometimes that's adversarial; usually it's accidental (a user being weird, a document containing unexpected content). Guardrails prevent the agent from doing something it shouldn't.

Layer 5: evals

How you know the agent is good.

An eval suite is a set of representative cases — typical inputs and expected behaviors — that we run on every prompt or model change. Three flavors:

Snapshot evals

For each input, the expected output is fixed. The eval checks for exact or semantic match. Useful for: structured extraction, classification, deterministic outputs.

Behavior evals

For each input, the expected behavior is described. A grader (often another LLM, sometimes a human) checks whether the actual behavior matches. Useful for: judgment-heavy tasks where there are many valid answers.

Live evals

Sampled real-production traces, scored after the fact. Useful for catching drift you didn't anticipate.

We build evals during discovery. They run in CI. They surface scores on a dashboard. Without them, the agent silently degrades.

Layer 6: observability

Per-trace logging. Every input, every tool call, every model decision, every output, every cost.

We use Langfuse as the default observability layer for AI-touched flows. It gives us:

  • Per-trace timeline (what happened, in order, with timing).
  • Cost attribution per trace.
  • Filter / search by user, by outcome, by tool used.
  • Replay any past trace.
  • Eval scores attached to traces.

Plus structured logs to Sentry for errors and a dashboard for the operator (success rate, cost, latency, queue depth).

Without observability you can't debug, evaluate, or improve. The first day after deploy you'll need to know why something went wrong; observability is how.

Putting it together

A production-grade agent has, at minimum:

  1. A defined goal interface (webhook, button, schedule).
  2. A reasoning model (Claude / GPT / Gemini), abstracted behind a provider interface.
  3. A set of typed tools with idempotency and failure semantics.
  4. Retrieval against your data.
  5. Guardrails: approval gates, spend caps, step ceilings, scope, prompt-injection detection.
  6. An eval suite that runs in CI.
  7. Per-trace observability with a dashboard.

Skip any of those and the agent works in the demo and fails in production. Honor them all and the agent is no more mysterious than a well-written background worker — just one that happens to use an LLM as its decision engine.

Where it pairs

If you want to see how this architecture lands in concrete agent shapes, see our agent type pages — six concrete shapes with stacks, costs, and examples.

If you want to see it shipped, the Document Intake Agent case study walks through every layer in a real AP automation build.

If you have an agent in mind and want a feasibility take, drop us a note — one paragraph is enough.

Frequently asked questions

Keep reading

Article

What is an AI agent? The full breakdown

An AI agent is a system that turns a goal into a sequence of tool calls. Where a chatbot answers questions, an agent completes jobs. It plans steps, picks tools, executes them, recovers from failures, and either finishes the task or hands off to a human. The defining ingredients are a goal, retrieval, tools, guardrails, evals, and observability.

Read more
Article

RAG done right: the patterns that survive production

Production RAG is engineering, not magic. The patterns that survive: hybrid retrieval (vector + BM25), rerank top-k with a cross-encoder, metadata filtering, source dating, citation rendering, sampled human review. Without these, your retrieval is good in the demo and broken in production.

Read more
Article

ChatGPT API vs Claude API vs Gemini: which to pick (2026)

Claude Sonnet 4.6/4.7 is our default for production agents — most reliable tool calling, best structured output, strong reasoning. GPT-4o wins for voice (Realtime is best-in-class) and the largest ecosystem. Gemini 2.5/2.0 wins for long-context, vision-heavy document work, and cost-sensitive volume workloads. Pick per task; abstract behind a provider interface.

Read more
Service

AI Agents Development

Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.

Read more
Agent type

Workflow Orchestrator Agent

Cross-SaaS triggers — Microsoft 365, Slack, Sheets, HubSpot, Stripe — with idempotency and approvals

Read more

Want this delivered in your stack?

If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.