Question 1

How is a custom AI agent different from a ChatGPT plugin or a chatbot?

Accepted Answer

A ChatGPT plugin and a basic chatbot are surface-level integrations — they hand a model a prompt and pipe back text. A production AI agent is a system: it owns its prompts (versioned in git), uses retrieval against your data, calls tools/APIs to take action, has guardrails, runs against an evaluation suite before every deploy, and emits telemetry you can audit. We build the second kind. The first kind you can build in a weekend.

Question 2

Which LLM provider do you use? Are we locked in?

Accepted Answer

We pick per task. Claude (Anthropic) is our default for reasoning-heavy work; OpenAI's GPT-4o realtime is our default for voice; Google Gemini for vision-heavy document extraction or long-context. Every agent is abstracted behind a thin provider interface so you can swap models without rewriting the agent. We never lock you in.

Question 3

How do you handle data privacy and security?

Accepted Answer

Three layers. (1) We default to API endpoints with zero-retention contracts (Anthropic, OpenAI Enterprise, AWS Bedrock). Your data does not train a foundation model. (2) For regulated workloads we deploy self-hosted or in-region open models (Llama, Mistral). (3) We sign NDAs and DPAs, document data flows, keep secrets out of code, and run agents in your cloud account when needed.

Question 4

What is an 'eval suite' and why does it matter?

Accepted Answer

An eval suite is a set of test cases — typical inputs and expected behaviors — that we run against the agent on every deploy. It's the difference between 'works on my laptop' and 'we know exactly when this regresses.' We build evals during discovery, run them in CI, and surface scores on a dashboard. Without evals, every prompt change is a coin flip.

Question 5

What does the 'human-in-the-loop' part actually mean in practice?

Accepted Answer

It means the agent recognises when it's unsure and routes the work to a human queue instead of guessing. For document agents that's confidence-tier routing (auto-post / review / reject). For voice agents it's warm transfer to a human during business hours. For workflow orchestrators it's a Slack approval gate before any irreversible action. Built into the agent from day one, not bolted on later.

Question 6

How long does a production AI agent take to build?

Accepted Answer

Two weeks of discovery to map the actual workflow, write the spec, and build evals. Two to three weeks for a working prototype on real data. Six to twelve weeks for production hardening — integrations, retry/idempotency, observability, on-call runbooks. We always ship the prototype first so you can decide on evidence before signing the bigger check.

Question 7

Will the agent get worse over time?

Accepted Answer

Yes if you ignore it — prompt drift, model upgrades, schema changes in the systems the agent talks to, and edge cases in production data all accumulate. We bake an eval re-run cadence into the engagement and offer a retainer that covers prompt tuning, model upgrades, and quarterly architecture reviews. Without that, agents quietly degrade.

Question 8

Can you build the agent and hand it over so we maintain it ourselves?

Accepted Answer

Absolutely. We deliver code, runbooks, eval suites, and CI/CD configured to your team's preferences. We can also pair with your engineers during the build so knowledge transfer is built in, not a one-off handoff. Roughly 60% of our clients keep us on retainer for ongoing operations; the rest take it fully in-house.

Layer	Default choice	Why
Reasoning model	Claude Sonnet 4.6 / 4.7	Best reliability for tool calling and long-context reasoning
Vision model	Claude or Gemini 2.0 Flash	Strong at structured extraction from PDFs and images
Voice model	GPT-4o Realtime / Whisper	Sub-second response, multilingual, robust to accents
Orchestration	LangGraph + plain SDK	LangGraph for multi-step agents with branches; plain SDK when one tool call is enough
Retrieval	pgvector / Pinecone / Vectorize	pgvector if you already have Postgres; managed for everything else
Observability	Langfuse / Helicone / OpenTelemetry	Per-call traces, cost attribution, regression detection
Eval framework	Custom + Promptfoo / Braintrust	Promptfoo for offline, Braintrust for production traces
Deploy target	Firebase / Vercel / your cloud	Firebase App Hosting is our default; we go wherever your data lives
Tooling glue	TypeScript everywhere	One language end-to-end keeps the team small

Engagement	Duration	Investment
Discovery sprint (workflow map, spec, no/go)	1–2 weeks	€4,000–6,000
Working prototype on real data	2–3 weeks	€8,000–15,000
Production agent (single-purpose)	6–10 weeks	€25,000–50,000
Production agent (multi-purpose / multi-channel)	10–16 weeks	€50,000–100,000
Ongoing retainer (evals, tuning, on-call)	Monthly	from €2,000/month

AI Agents Development

What an AI agent actually is (and isn't)

When you actually need an agent

How we build agents — the six-phase loop

1. Discover — one to two weeks

2. Design — three to five days

3. Prototype — two to three weeks

4. Build — three to eight weeks

5. Deploy — one week

6. Iterate — ongoing

The tech we use

Pricing and timeline — the honest version

What an actual agent looks like in production

Document intake — accounts payable

Voice concierge — service business

What we will not do

Frequently asked questions

Frequently asked questions

Related work

Document Processing Agent

Conversational Agent

Voice & Phone Agent

Mastery

Document Intake Agent

Voice Concierge

Ready to scope ai agents development?