Voice & Phone AI Agents
AI receptionists, booking lines, and qualification calls — wired to your calendar, CRM, and ticketing.
What a voice agent actually does
You give it a phone number. It picks up calls. It has a conversation with the caller — listening, speaking, interrupting cleanly, recovering from confusion. It executes the tools you give it (find calendar availability, book a slot, look up an order, qualify a lead, transfer to a human). It hangs up. The transcript and recording land in your system.
That is the whole product. The interesting parts are everything around it — the voice picked, the pacing, the failure modes, the integrations, the observability — but the surface is exactly that simple.
For a deeper walkthrough of how we built one from scratch, see our Twilio + GPT-4o walkthrough post.
When a voice agent is the right call
You should consider a voice agent when:
- You are losing calls. After-hours, peak hours, vacation cover, unanswered inbound. Voicemail-to-callback has a brutal drop-off rate.
- The conversation has shape. Booking, qualification, order status, FAQ — calls with predictable structure where the human work is mostly transactional.
- Volume justifies the build. Roughly 100+ calls per month at the bottom end. Below that, a smart IVR or a virtual assistant is cheaper.
- You can record the call. Most jurisdictions require disclosure. If you cannot record, we cannot evaluate, and the agent will silently drift.
You should not use a voice agent when:
- The conversation is high-empathy (medical triage, mental health, complaint resolution). Use it as a router to a human, not as the conversation itself.
- Legal compliance forbids automation (e.g. some debt collection laws). Check first.
- Your customer base will react badly. Some markets and demographics still strongly prefer humans on the phone. We will tell you if we think your audience won't tolerate it.
Our Voice AI buyer's guide goes deeper on the buy/build decision.
The stack
| Layer | Default choice |
|---|---|
| Telephony | Twilio Voice (Stream API) |
| Realtime model | OpenAI GPT-4o Realtime |
| ASR fallback (if non-realtime) | Whisper-large-v3 |
| TTS fallback | OpenAI TTS / ElevenLabs |
| Function calling | OpenAI tools / Anthropic tools |
| Backend | Node.js on Cloud Run with WebSocket support |
| State | Firestore or Postgres |
| Recording / transcript | Twilio + your cloud storage |
| Observability | Langfuse + structured logs + Slack alerts |
We use GPT-4o Realtime as the default because it gives us proper barge-in (the agent stops talking when the caller starts) and sub-second response times, which are the two things that make a call feel like a conversation rather than an interrogation.
For longer / slower workflows or when realtime isn't necessary, the Whisper + TTS pattern is cheaper and easier to reason about.
Anatomy of a production voice agent
A simple booking-line agent looks like this in code, abbreviated:
// On call connect
const session = await openai.realtime.connect({
model: "gpt-4o-realtime-preview",
voice: "alloy",
instructions: SYSTEM_PROMPT,
tools: [
findAvailabilityTool,
bookAppointmentTool,
transferToHumanTool,
logLeadTool,
],
});
twilio.stream.pipe(session.audio.in);
session.audio.out.pipe(twilio.playback);
session.on("tool_call", async (call) => {
const result = await TOOLS[call.name].run(call.args);
await session.tool_result(call.id, result);
});
session.on("end", async (transcript) => {
await firestore.collection("calls").add({
callerNumber: call.from,
durationMs: call.duration,
transcript,
audioUrl: await uploadRecording(call.recordingUrl),
outcome: extractOutcome(transcript),
crmLogged: await syncToCRM(transcript),
createdAt: serverTimestamp(),
});
});
In production we add: retry logic on tool calls, timeout guardrails, profanity / harassment detection that escalates, structured logs, eval harness for replaying past calls against prompt changes, and a per-call cost attribution.
Process
1. Discovery — 3 to 5 days
We map the call surface area: typical intents, edge cases, current handoff to humans, success metric (booking rate, qualified-lead rate, resolved-without-escalation rate). We pull 20+ historic call recordings or transcripts where available and analyse the shape of real conversations. We propose the intent taxonomy.
2. Script + prompt design — 3 to 5 days
We write the system prompt, the tool descriptions, the opening message, and the escalation triggers. We define the voice (which TTS voice, pacing, persona). We define what's in scope and what is not in scope so the agent doesn't try to handle things it shouldn't.
3. Build — 1 to 3 weeks
Twilio number provisioning. WebSocket bridge to the model. Tool implementations. CRM integration. Recording / transcript pipeline. Observability dashboard. Test calls every day.
4. Eval + tuning — 1 week
We run the agent against your real historic call set (or simulated calls) and score it. We tune the prompt, the voice, the tools. We re-run. We do not deploy until the eval shows the agent meets the success metric.
5. Launch — phased
Single phone number → 10% of inbound traffic → 50% → 100%. Or a separate after-hours line first, with humans during business hours. Whatever lets us catch regressions early.
6. Iterate
Voice agents drift more than text agents because reality keeps changing the conversation. We keep a monthly eval cadence on retainer.
What good observability looks like for voice
A one-glance dashboard:
- Calls per day, week, month
- Outcome distribution (booked / qualified / transferred / failed)
- Average call duration
- Cost per call (Twilio + model + downstream tools)
- Caller sentiment (sampled from transcripts)
- Escalation rate over time (trending up = something to fix)
- Latency p50 / p95 (time to first agent word, time-to-resolution)
Plus per-call drilldown with the full transcript, the audio playback, the tool calls made, and the outcome.
Pricing — the honest version
| Engagement | Scope | Investment |
|---|---|---|
| Discovery + scripted demo | 5–7 days | €3,500–6,000 |
| Single-intent booking line | 3–4 weeks | €15,000–25,000 |
| Multi-intent qualification line + CRM | 4–6 weeks | €25,000–50,000 |
| Multi-language or multi-line program | 6–10 weeks | €50,000–100,000 |
| Ongoing retainer (eval + tuning) | Monthly | from €1,500/month |
Plus pass-through Twilio + LLM costs (typically €0.10–€0.40 per call).
What we will not do
- Build a voice agent without recording (we cannot evaluate, and the agent will silently degrade).
- Build a voice agent for legally sensitive workflows without explicit legal sign-off from your team.
- Promise specific resolution rates before the eval phase is done.
- Ship without warm-transfer paths to humans.
If you have a call surface in mind — after-hours bookings, qualification, FAQ — send a note and we'll come back within one business day with a feasibility take.
Frequently asked questions
Related work
Voice & Phone Agent
After-hours bookings, lead qualification, customer service overflow, FAQ lines
Voice Concierge
AI phone agent for after-hours bookings
Building a phone agent with Twilio + GPT-4o: a complete walkthrough
Build a phone agent: Twilio provisions the number and streams audio, a Node.js bridge on Cloud Run pipes the audio to GPT-4o Realtime, function-calling tools execute real actions (book appointment, log lead, transfer). Recording, transcript, and observability on every call. Production deployment in 3-6 weeks.
Voice AI for service businesses: a buyer's guide
Voice AI works for service businesses with predictable call patterns and meaningful inbound volume. Booking, qualification, status, FAQ. Real cost ~€0.10-0.40/call. Real build cost €15-50k for a single-line deployment. Evaluate vendors on recording, escalation paths, and CRM integration — not on the demo.
Why your AI chatbot fails (and what to fix)
Most chatbots that fail in production fail for one of six reasons: no retrieval, bad retrieval, no evals, no escalation, no observability, no scope. Tuning the prompt won't fix any of them. The fix is engineering — and the engineering is well-understood by now.
Ready to scope voice & phone ai agents?
A discovery call is the fastest way to know if there's a fit.