AI Agents Development
Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.
What an AI agent actually is (and isn't)
An AI agent is a system that turns a goal into a sequence of tool calls. Give it an objective — "extract every line item from this PDF and post it to NetSuite" — and it plans the steps, picks the right tools (a vision model, a schema validator, an API call to your ERP), executes them, watches for failure modes, and either finishes the job or hands off to a human with context.
That is materially different from:
- A chatbot (a single-turn or short conversational interface, usually without tool calls)
- A ChatGPT integration (a thin wrapper around a hosted model)
- A prompt template (one static prompt with variables substituted in)
The difference matters because agents are how AI starts replacing real work. Chatbots answer questions. Agents complete jobs.
If you want the deeper definition, our What is an AI agent guide walks through the anatomy with diagrams.
When you actually need an agent
You need an agent (vs. plain automation, vs. a custom build, vs. a SaaS tool) when all of these are true:
- The work involves unstructured inputs — PDFs, emails, voice, free-form chat, scanned forms.
- Each instance requires judgement — not a fixed rule (if X then Y), but a decision based on context.
- The volume is non-trivial — at least dozens of items per day, ideally hundreds or thousands.
- Errors are visible — you can detect when the agent gets something wrong, either via downstream signal (a returned invoice, a complaint) or via a human review queue.
If your workflow is deterministic (always the same rules, structured input), don't build an agent — build automation. Our AI agents vs automation deep-dive walks through the decision in detail.
How we build agents — the six-phase loop
Every agent we ship goes through the same six phases. The structure exists because the failure modes are predictable: skip discovery and you build the wrong thing; skip evals and you can't tell when it regresses; skip observability and you can't fix it in production.
1. Discover — one to two weeks
We sit with the people doing the actual work. We map inputs, outputs, and the messy bits the documentation always misses. We identify the success metric in finance-grade language — hours saved, errors avoided, cycle time cut — not in vanity AI metrics like "accuracy." We write the spec.
Deliverables: a workflow map, a ranked opportunity list, success criteria, a draft agent specification, and a no-go decision template.
2. Design — three to five days
We pick the right shape: agent vs. automation vs. SaaS vs. custom build. We pick the LLM provider (Claude / OpenAI / Gemini / self-hosted) based on the task. We pick the framework (LangGraph for complex multi-step, plain SDK for simple jobs, Anthropic's Computer Use API for browser tasks). We write the architecture diagram and the cost/timeline estimate.
Deliverables: architecture diagram, risk register, cost & timeline estimate, signed spec.
3. Prototype — two to three weeks
We build a working slice on real data, real failure modes, real numbers. Not slideware. The prototype is end-to-end — it ingests the actual inputs, calls the actual tools, produces the actual outputs — but with thin scaffolding around the parts that need production hardening. We run it side by side with the human process for a week.
Deliverables: working prototype, evaluation results on real data, go/no-go decision.
4. Build — three to eight weeks
Production engineering. Authentication. Logging structured for replay. Idempotency keys on every action. Retry policies. Rate limit handling. Cost guardrails. Tests, types, and code review. CI/CD pipeline configured for your team. Eval suite running on every PR. Observability dashboard.
Deliverables: production code, CI/CD, test suite, eval dashboard, runbooks.
5. Deploy — one week
We roll out in waves: 10% of traffic, 50%, 100%. We watch the dashboards. We compare outputs to the human baseline. We tune. We document what we learn.
Deliverables: phased rollout, on-call rota, monitoring dashboards, post-launch summary.
6. Iterate — ongoing
Agents drift. Models upgrade. Schemas change. We keep a retainer slot for monthly eval runs, prompt tuning, and the inevitable "while you're at it" features.
Deliverables: monthly evals, prompt versioning, quarterly roadmap.
The tech we use
We optimise for shipping, not for résumé-driven engineering. Our typical stack:
| Layer | Default choice | Why |
|---|---|---|
| Reasoning model | Claude Sonnet 4.6 / 4.7 | Best reliability for tool calling and long-context reasoning |
| Vision model | Claude or Gemini 2.0 Flash | Strong at structured extraction from PDFs and images |
| Voice model | GPT-4o Realtime / Whisper | Sub-second response, multilingual, robust to accents |
| Orchestration | LangGraph + plain SDK | LangGraph for multi-step agents with branches; plain SDK when one tool call is enough |
| Retrieval | pgvector / Pinecone / Vectorize | pgvector if you already have Postgres; managed for everything else |
| Observability | Langfuse / Helicone / OpenTelemetry | Per-call traces, cost attribution, regression detection |
| Eval framework | Custom + Promptfoo / Braintrust | Promptfoo for offline, Braintrust for production traces |
| Deploy target | Firebase / Vercel / your cloud | Firebase App Hosting is our default; we go wherever your data lives |
| Tooling glue | TypeScript everywhere | One language end-to-end keeps the team small |
We are provider-agnostic by design. Every agent is abstracted behind a thin interface so you can swap models without rewriting the agent.
Pricing and timeline — the honest version
Generic agency answers like "it depends" waste your time. Here are real ranges from real engagements.
| Engagement | Duration | Investment |
|---|---|---|
| Discovery sprint (workflow map, spec, no/go) | 1–2 weeks | €4,000–6,000 |
| Working prototype on real data | 2–3 weeks | €8,000–15,000 |
| Production agent (single-purpose) | 6–10 weeks | €25,000–50,000 |
| Production agent (multi-purpose / multi-channel) | 10–16 weeks | €50,000–100,000 |
| Ongoing retainer (evals, tuning, on-call) | Monthly | from €2,000/month |
We always quote firm before work begins. If we hit the discovery and decide the agent is the wrong shape, we tell you and refund the rest of the engagement. That has happened twice in our history and we'd do it again.
What an actual agent looks like in production
Two short examples from real builds.
Document intake — accounts payable
A mid-market distributor was keying ~400 supplier invoices per week into NetSuite. The team wanted "AI." What we shipped:
- Ingestion — Resend webhook captures invoice emails into a Firestore queue.
- Vision extraction — Claude vision pass with a Zod schema for line items, totals, tax codes.
- PO matching — agent calls NetSuite to find candidate POs, applies tolerance rules (€5 or 1% mismatch is OK).
- Confidence routing — high confidence + matched PO → auto-post. Medium → review queue. Low → reject with structured reason.
- Posting — agent posts to NetSuite, attaches the original PDF, marks the email as processed.
Outcome: 87% auto-post rate after tuning, 60% reduction in AP time spent on invoice keying. Read the full breakdown in the document intake case study.
Voice concierge — service business
A boutique clinic was losing after-hours bookings to voicemail. What we shipped:
- Twilio number routes to a GPT-4o Realtime model.
- The agent has function-calling tools:
findAvailability(date, durationMinutes),bookAppointment(slotId, patientInfo),transferToHuman(reason). - The agent books the slot, writes the lead into HubSpot, and texts a confirmation.
- Recording and full transcript are stored in Firestore for every call.
Outcome: ~80% of after-hours calls now convert to bookings (previously near-zero). Read Voice Concierge.
What we will not do
A short list of things we have learned, painfully, to refuse.
- Build an agent without evals. It is shipping a car without seatbelts.
- Promise specific accuracy numbers before the prototype. Anyone who quotes "99% accurate" pre-prototype is bluffing.
- Use a single closed-source model with no fallback. Every agent has at least one provider escape hatch.
- Skip the human-in-the-loop on irreversible actions. Money moves, contracts sign, emails send. All gated.
Frequently asked questions
See the FAQ section below — or jump to the agent type taxonomy if you want to see the six concrete shapes we build, with examples and costs per shape.
If you have a workflow in mind and want a fast take on whether an agent is the right shape, send a short note. We reply within one business day.
Frequently asked questions
Related work
Document Processing Agent
Invoices, contracts, receipts, and forms → structured data with confidence-tier human review
Conversational Agent
Internal or customer chat grounded in your knowledge base with citations and escalation
Voice & Phone Agent
After-hours bookings, lead qualification, customer service overflow, FAQ lines
Mastery
AI-powered learning platform on Google Generative AI
Document Intake Agent
Supplier invoices end-to-end with an agentic pipeline
Voice Concierge
AI phone agent for after-hours bookings
Ready to scope ai agents development?
A discovery call is the fastest way to know if there's a fit.