The AI Development playbook: how we ship agents in 6 weeks
What the 6 weeks actually look like
For a typical single-purpose production agent — say, a document intake pipeline or a voice booking agent — our 6-week shape:
| Week | What we ship |
|---|---|
| 1 | Discovery + spec + eval set + go/no-go |
| 2-3 | Working prototype on real data; tune until evals pass |
| 4-5 | Production hardening (auth, retries, idempotency, observability, integration polish, reviewer UI if needed) |
| 6 | Phased rollout, runbooks, handover |
Bigger builds extend each phase proportionally. Smaller ones compress. The shape is consistent.
What we will not do
The hard part isn't what we do — it's what we refuse to do.
No free pitch for a multi-month build. A one-hour intro call doesn't surface real scope. We charge for discovery and the scope is real after.
No agent without evals. Shipping AI-touched workflows without evals is the difference between something that works in a demo and something that breaks in production.
No skip on the reviewer UI. For document agents, the reviewer ergonomics dictate whether the system actually saves time. We treat reviewer UX as a first-class component, not an afterthought.
No approval gates skipped on irreversible actions. Money moves, contracts sign, emails send to customers. All gated. Always.
No lock-in to a single LLM provider. Every agent sits behind a thin provider interface. You can swap in hours.
No "we'll figure out the data flows during implementation." We map them in discovery. Surprises in production cost 10× what discovery costs.
The opinionated stack
We try not to be cargo-cult about tools. We do have defaults that earn their keep.
| Layer | Default |
|---|---|
| Reasoning model | Claude Sonnet 4.6 / 4.7 |
| Voice realtime | OpenAI GPT-4o Realtime |
| Cost-sensitive volume | Gemini 2.0 Flash / Claude Haiku |
| Application framework | Next.js 16 + React Server Components |
| Type safety | TypeScript end-to-end with Zod schemas |
| Styling | Tailwind + shadcn/ui |
| Database (relational) | Postgres (Supabase or Cloud SQL) |
| Database (NoSQL / docs) | Firestore |
| Vector | pgvector if Postgres exists, Pinecone otherwise |
| Auth | NextAuth / Firebase Auth / Microsoft Entra |
| Observability | Langfuse for AI traces, Sentry for errors |
| Telephony | Twilio |
| Deploy | Firebase App Hosting / Vercel / Cloud Run |
| CI | GitHub Actions |
| Resend |
This stack is boring on purpose. It's the one we and our clients can maintain in 2 years.
How AI-assisted development changed our process
In 2024, a 6-week production agent build needed a 3-person team. In 2026, the same build needs a 2-person team with AI accelerators.
What changed:
- Code agents write the boring engineering. Typed API clients, schema mappings, migration scripts, test fixtures, glue logic. 5-10× faster than humans.
- Eval-driven development is easier. LLMs grade eval outputs at scale. Manual review for edge cases only.
- Documentation generates itself. Reasonable first drafts of READMEs, runbooks, API docs.
- Pair programming with AI replaces solo work. Two humans + AI is roughly as productive as 3-4 humans without.
What didn't change:
- Product judgement still human-driven. Novel architecture, ambiguous requirements, edge case design.
- Discovery still in-person (or video). Talking to the people who do the work.
- Reviewer UX, system design, integration boundaries all still human work.
See Code & Integration Agent for how we use code agents internally.
The discovery sprint pattern
Week 1 is where the engagement is won or lost. The pattern:
Day 1: kick-off call with sponsors + the operational owner. Map the current workflow. Identify success metrics in finance-grade language.
Day 2-3: shadow the people doing the actual work. Screen-record (with consent). Note the failure modes the documentation misses.
Day 4: write the eval set. 50-100 representative cases with expected behaviors.
Day 5: write the spec. Architecture, technology choices, cost & timeline estimate, risk register, go/no-go criteria.
Day 6-7: review with the client. Iterate. Sign off on the build engagement or stop here.
Cost: €4,000-8,000 depending on complexity. Value: clients consistently say the discovery alone was worth what they paid. Even when we decide together not to do the build, the client walks with a clear understanding of their workflow and a documented spec.
The prototype phase pattern
Weeks 2-3: real end-to-end pipeline on real data. Thin scaffolding around production-grade core.
Production-grade in the prototype:
- Real auth.
- Real database writes.
- Real LLM calls (not mocked).
- Real integration with at least one downstream system.
Thin scaffolding (deferred to build phase):
- Retry / idempotency polish.
- Full reviewer UI (basic version only).
- Observability beyond basic logging.
- Edge case handling.
The prototype runs in shadow mode — alongside the human process — for a week. We compare outcomes. We tune. The go/no-go decision for full build is made on evidence, not promises.
The build phase pattern
Weeks 4-5: production engineering.
- Auth + authorization: real role-based access, never "everyone is admin."
- Idempotency: every external action gets a key.
- Retry: exponential backoff with jitter; dead-letter queue.
- Observability: per-trace logging, dashboards, alerting.
- Eval suite in CI: every PR runs against the eval set; regressions block merge.
- Cost guardrails: per-run ceilings, per-day budgets, automated alerts.
- Approval gates: irreversible actions pause for human approval.
- Reviewer UI (for document agents): keyboard-driven, side-by-side document view, fast.
- Runbooks: what to do when X breaks.
By end of week 5, the system is ready for phased rollout.
The rollout phase pattern
Week 6: 10% → 50% → 100% over the week. Daily standup with the operational team. Aggressive monitoring.
By end of week 6:
- Production traffic is live.
- Eval cadence is set (monthly typical).
- On-call rota is established.
- Handover docs delivered.
- Optionally: retainer agreement signed for ongoing operations.
Where to go next
For our full services menu, for the agent types we routinely build, or drop us a note with a workflow in mind. We respond within one business day with an honest take on whether the 6-week shape fits.
For our take on the broader industry context see The state of AI development in 2026. For the buyer's checklist when evaluating any agency see 12 questions to ask.
Frequently asked questions
Keep reading
How much does an AI agent cost? Real numbers from real builds
AI agent builds in 2026 typically cost €4-8k for discovery, €15-30k for a working prototype, €25-80k for production, €2-5k/month for retainer. Per-call infrastructure cost runs €0.01-€0.40 depending on shape. Honest numbers from real builds, with the trade-offs explained.
Hiring an AI development agency: 12 questions to ask
Twelve questions that separate serious AI dev shops from demoware vendors. Asks about evals, observability, code ownership, provider lock-in, references, and what they'll refuse to do. If a vendor can't answer cleanly, walk.
The state of AI development in 2026
In 2026, AI development is shipping production agents that earn their keep — document processing, voice, workflow orchestration — backed by Claude / GPT / Gemini and engineered with evals, observability, and guardrails. What's underrated: well-engineered automation with one or two LLM-judgment steps. What's overrated: 'autonomous AGI' marketing.
AI Agents Development
Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.
Custom Development
Web apps, mobile apps, dashboards, internal tools. React, Next.js, React Native, Power Apps — picked for the job, not the hype.
Code & Integration Agent
API plumbing, schema mapping, OpenAPI client generation, internal tooling
Want this delivered in your stack?
If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.