Agent type

Voice & Phone Agent

After-hours bookings, lead qualification, customer service overflow, FAQ lines

What a voice agent actually does

You give it a phone number. It picks up. It listens, speaks, interrupts cleanly, and recovers from confusion. It executes the tools you give it. It hangs up.

That's the whole surface. The interesting parts are everything around it — the model choice, the voice, the pacing, the tool implementations, the observability, the escalation paths — but the user-facing product is exactly that simple.

A typical 90-second booking call goes:

  1. Agent: "Thanks for calling [practice]. You're speaking with an AI assistant — I can help you book, reschedule, or get you to a human if you prefer. How can I help?"
  2. Caller: "I need to book a check-up for next Tuesday."
  3. Agent: [calls findAvailability(2026-05-26, 30)] "I have 10am, 11:30am, or 2pm. Which works?"
  4. Caller: "10am please."
  5. Agent: "Can I have your name and date of birth?"
  6. Caller: "[gives details]"
  7. Agent: [calls bookAppointment(slotId, patient)] "Booked. You'll get a confirmation by SMS. Anything else?"

End: transcript and recording in Firestore, CRM updated, SMS sent, call counted in the dashboard.

When this agent is the right call

You should consider a voice agent when:

  • You are losing calls. After-hours, peak hours, vacation cover, unanswered inbound. Voicemail-to-callback has brutal drop-off.
  • The conversation has shape. Booking, qualification, order status, FAQ — predictable structure where the human work is mostly transactional.
  • Volume justifies it. ~100+ relevant calls per month at the bottom end.
  • You can record. Recording is non-negotiable for evaluation; if regulation forbids it, the agent will silently drift.

You should not use a voice agent when:

  • High-empathy work (medical triage, mental health, complaint resolution). Use it as a router to a human, not as the conversation itself.
  • Compliance forbids automation (some debt collection laws, some healthcare scenarios).
  • Your audience hates phone bots categorically. (More common in some markets than others.)

Our Voice AI buyer's guide covers the decision in detail.

The stack

LayerDefault
TelephonyTwilio Voice (Stream API for realtime audio)
Realtime modelOpenAI GPT-4o Realtime
Fallback ASRWhisper-large-v3
Fallback TTSOpenAI TTS / ElevenLabs
Function callingOpenAI tools
BackendNode.js on Cloud Run with WebSocket
StateFirestore or Postgres
RecordingTwilio + your cloud storage
ObservabilityLangfuse + structured logs + Slack alerts

Anatomy of a production voice agent

The minimum production voice agent has:

  • System prompt that defines persona, scope, escalation triggers, and the voice's "rules of engagement."
  • Tool definitions that map intents to concrete actions in your systems.
  • Function-calling loop that lets the model invoke tools mid-conversation.
  • Recording + transcription for every call, stored in your cloud.
  • Eval harness that replays past calls against new prompts/models.
  • Observability dashboard showing per-call cost, latency, outcomes, and trends.
  • Escalation paths — warm transfer during business hours, callback queue outside.

A reference implementation in code lives in our Twilio + GPT-4o walkthrough post.

Cost economics

Roughly €0.10–€0.40 per call in production:

  • Twilio voice: ~€0.015/minute
  • GPT-4o Realtime: ~€0.10/minute conversation (counts both directions)
  • Function-call tool latency / cost: variable
  • Storage and observability: trivial

A typical 90-second booking call: ~€0.12. A 5-minute support call: ~€0.40. Surface this on the dashboard so you can correlate spend to outcomes.

Timeline

ScopeDuration
Single-intent booking line3–4 weeks
Multi-intent qualification + CRM4–6 weeks
Multi-language or multi-line6–10 weeks
Enterprise voice platform (many lines, many integrations)10–16 weeks

Common failure modes we've seen

  • The voice sounds wrong for the brand. Default voices are too neutral or too enthusiastic. We tune voice + pacing during the build, not after deployment.
  • The agent answers questions it shouldn't. A booking agent suddenly being asked medical advice. We define scope explicitly and add refusal patterns.
  • Cold transfer to a human, who restarts from scratch. We always pass conversation context with the transfer.
  • No recording. Cannot evaluate. Cannot tune. Cannot debug. The agent silently degrades over months. We refuse engagements without recording.
  • Vendor lock-in to Twilio's "Studio" or similar low-code IVR builders. Easy to start, painful to scale. We default to code from day one.

Where it pairs

Voice agents commonly chain with:

See Voice Concierge for a full end-to-end build, or drop us a note with the call surface you'd like to automate.

Frequently asked questions

Related

Want to scope a voice & phone agent project?

Tell us the workflow. We'll come back within one business day with a clear next step.