Process

The AI Development playbook: how we ship agents in 6 weeks

· updated May 21, 20265 min read

What the 6 weeks actually look like

For a typical single-purpose production agent — say, a document intake pipeline or a voice booking agent — our 6-week shape:

WeekWhat we ship
1Discovery + spec + eval set + go/no-go
2-3Working prototype on real data; tune until evals pass
4-5Production hardening (auth, retries, idempotency, observability, integration polish, reviewer UI if needed)
6Phased rollout, runbooks, handover

Bigger builds extend each phase proportionally. Smaller ones compress. The shape is consistent.

What we will not do

The hard part isn't what we do — it's what we refuse to do.

No free pitch for a multi-month build. A one-hour intro call doesn't surface real scope. We charge for discovery and the scope is real after.

No agent without evals. Shipping AI-touched workflows without evals is the difference between something that works in a demo and something that breaks in production.

No skip on the reviewer UI. For document agents, the reviewer ergonomics dictate whether the system actually saves time. We treat reviewer UX as a first-class component, not an afterthought.

No approval gates skipped on irreversible actions. Money moves, contracts sign, emails send to customers. All gated. Always.

No lock-in to a single LLM provider. Every agent sits behind a thin provider interface. You can swap in hours.

No "we'll figure out the data flows during implementation." We map them in discovery. Surprises in production cost 10× what discovery costs.

The opinionated stack

We try not to be cargo-cult about tools. We do have defaults that earn their keep.

LayerDefault
Reasoning modelClaude Sonnet 4.6 / 4.7
Voice realtimeOpenAI GPT-4o Realtime
Cost-sensitive volumeGemini 2.0 Flash / Claude Haiku
Application frameworkNext.js 16 + React Server Components
Type safetyTypeScript end-to-end with Zod schemas
StylingTailwind + shadcn/ui
Database (relational)Postgres (Supabase or Cloud SQL)
Database (NoSQL / docs)Firestore
Vectorpgvector if Postgres exists, Pinecone otherwise
AuthNextAuth / Firebase Auth / Microsoft Entra
ObservabilityLangfuse for AI traces, Sentry for errors
TelephonyTwilio
DeployFirebase App Hosting / Vercel / Cloud Run
CIGitHub Actions
EmailResend

This stack is boring on purpose. It's the one we and our clients can maintain in 2 years.

How AI-assisted development changed our process

In 2024, a 6-week production agent build needed a 3-person team. In 2026, the same build needs a 2-person team with AI accelerators.

What changed:

  • Code agents write the boring engineering. Typed API clients, schema mappings, migration scripts, test fixtures, glue logic. 5-10× faster than humans.
  • Eval-driven development is easier. LLMs grade eval outputs at scale. Manual review for edge cases only.
  • Documentation generates itself. Reasonable first drafts of READMEs, runbooks, API docs.
  • Pair programming with AI replaces solo work. Two humans + AI is roughly as productive as 3-4 humans without.

What didn't change:

  • Product judgement still human-driven. Novel architecture, ambiguous requirements, edge case design.
  • Discovery still in-person (or video). Talking to the people who do the work.
  • Reviewer UX, system design, integration boundaries all still human work.

See Code & Integration Agent for how we use code agents internally.

The discovery sprint pattern

Week 1 is where the engagement is won or lost. The pattern:

Day 1: kick-off call with sponsors + the operational owner. Map the current workflow. Identify success metrics in finance-grade language.

Day 2-3: shadow the people doing the actual work. Screen-record (with consent). Note the failure modes the documentation misses.

Day 4: write the eval set. 50-100 representative cases with expected behaviors.

Day 5: write the spec. Architecture, technology choices, cost & timeline estimate, risk register, go/no-go criteria.

Day 6-7: review with the client. Iterate. Sign off on the build engagement or stop here.

Cost: €4,000-8,000 depending on complexity. Value: clients consistently say the discovery alone was worth what they paid. Even when we decide together not to do the build, the client walks with a clear understanding of their workflow and a documented spec.

The prototype phase pattern

Weeks 2-3: real end-to-end pipeline on real data. Thin scaffolding around production-grade core.

Production-grade in the prototype:

  • Real auth.
  • Real database writes.
  • Real LLM calls (not mocked).
  • Real integration with at least one downstream system.

Thin scaffolding (deferred to build phase):

  • Retry / idempotency polish.
  • Full reviewer UI (basic version only).
  • Observability beyond basic logging.
  • Edge case handling.

The prototype runs in shadow mode — alongside the human process — for a week. We compare outcomes. We tune. The go/no-go decision for full build is made on evidence, not promises.

The build phase pattern

Weeks 4-5: production engineering.

  • Auth + authorization: real role-based access, never "everyone is admin."
  • Idempotency: every external action gets a key.
  • Retry: exponential backoff with jitter; dead-letter queue.
  • Observability: per-trace logging, dashboards, alerting.
  • Eval suite in CI: every PR runs against the eval set; regressions block merge.
  • Cost guardrails: per-run ceilings, per-day budgets, automated alerts.
  • Approval gates: irreversible actions pause for human approval.
  • Reviewer UI (for document agents): keyboard-driven, side-by-side document view, fast.
  • Runbooks: what to do when X breaks.

By end of week 5, the system is ready for phased rollout.

The rollout phase pattern

Week 6: 10% → 50% → 100% over the week. Daily standup with the operational team. Aggressive monitoring.

By end of week 6:

  • Production traffic is live.
  • Eval cadence is set (monthly typical).
  • On-call rota is established.
  • Handover docs delivered.
  • Optionally: retainer agreement signed for ongoing operations.

Where to go next

For our full services menu, for the agent types we routinely build, or drop us a note with a workflow in mind. We respond within one business day with an honest take on whether the 6-week shape fits.

For our take on the broader industry context see The state of AI development in 2026. For the buyer's checklist when evaluating any agency see 12 questions to ask.

Frequently asked questions

Keep reading

Article

How much does an AI agent cost? Real numbers from real builds

AI agent builds in 2026 typically cost €4-8k for discovery, €15-30k for a working prototype, €25-80k for production, €2-5k/month for retainer. Per-call infrastructure cost runs €0.01-€0.40 depending on shape. Honest numbers from real builds, with the trade-offs explained.

Read more
Article

Hiring an AI development agency: 12 questions to ask

Twelve questions that separate serious AI dev shops from demoware vendors. Asks about evals, observability, code ownership, provider lock-in, references, and what they'll refuse to do. If a vendor can't answer cleanly, walk.

Read more
Article

The state of AI development in 2026

In 2026, AI development is shipping production agents that earn their keep — document processing, voice, workflow orchestration — backed by Claude / GPT / Gemini and engineered with evals, observability, and guardrails. What's underrated: well-engineered automation with one or two LLM-judgment steps. What's overrated: 'autonomous AGI' marketing.

Read more
Service

AI Agents Development

Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.

Read more
Service

Custom Development

Web apps, mobile apps, dashboards, internal tools. React, Next.js, React Native, Power Apps — picked for the job, not the hype.

Read more
Agent type

Code & Integration Agent

API plumbing, schema mapping, OpenAPI client generation, internal tooling

Read more

Want this delivered in your stack?

If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.