Voice Concierge
AI phone agent for after-hours bookings
At a glance
| Client | Demo build (template for service-business engagements) |
| Industry | Service businesses (clinics, dental, salons, repair, professional services) |
| Engagement | Demo / template — full client builds typically 3–6 weeks |
| Stack | Twilio Voice, GPT-4o Realtime, Cloud Run, calendar/CRM integrations |
| Status | Template, ready to instantiate for client engagements |
The challenge
Service businesses lose a meaningful percentage of after-hours inbound calls. Voicemail-to-callback has brutal drop-off rates — by the time someone calls back the next morning, the prospect has booked elsewhere.
Hiring a 24/7 receptionist is uneconomic for the volume. Generic IVR ("press 1 for bookings, press 2 for...") feels worse than voicemail and converts worse too. Most "AI phone" products are still demo-stage — clunky pacing, can't be interrupted, can't actually book a real slot in a real calendar.
The demo target: a phone agent that picks up every call, books appointments into the actual calendar, qualifies new leads into the CRM, and hands off to a human during business hours when that's the right move.
What we built
A Twilio Voice number routed to a Node.js service on Cloud Run that bridges Twilio's audio streams to OpenAI's GPT-4o Realtime model. The model has function-calling tools for the actual actions — find availability, book, qualify, transfer.
Voice loop
[Caller dials] → [Twilio Voice answers]
↓
[Twilio Stream (audio in/out) → Cloud Run WebSocket]
↓
[GPT-4o Realtime session with system prompt + tools]
↓
[Audio out streamed back to Twilio → caller]
↓
[Tool calls executed when model invokes them]
↓
[Call end → transcript stored, outcome logged]
End-to-end latency from caller speech to agent response: under 2 seconds in our test environment. The model also supports proper barge-in (the agent stops speaking when the caller starts) which is the single biggest UX upgrade over 2020-era IVRs.
Tools (function calling)
The agent has five tools that map intents to real actions:
| Tool | Purpose |
|---|---|
findAvailability(date, durationMinutes) | Query the calendar for open slots |
bookAppointment(slotId, customer) | Create the calendar event, send confirmation |
qualifyLead(name, contact, intent, qualifyingNotes) | Write lead to the CRM with structured fields |
transferToHuman(reason) | Warm transfer during business hours; queue callback outside |
logCallOutcome(outcome, notes) | Final tool — called at end of every call |
Tool implementations are typed (Zod), idempotent (re-runs don't duplicate), and observable (every tool invocation logged with arguments, result, latency, cost).
System prompt
The system prompt defines:
- Persona — friendly, brief, professional, never pretends to be human.
- Opening line — discloses AI at call start.
- Scope — what the agent handles vs. what it escalates.
- Escalation triggers — explicit human request, repeated misunderstanding, sensitive topics, anything off-scope.
- Brevity — no rambling. Short turns. Confirm critical details by reading back.
- Confirmation — read back time/date/name before booking.
Recording + transcript
Every call is recorded (with disclosure at call start). Recording uploaded to Cloud Storage; transcript generated post-call by Whisper for the cases where the realtime transcript wasn't sufficient. Both stored in Firestore tied to the call ID.
Observability dashboard
A one-page dashboard for the operator:
- Calls per day, week, month
- Outcome distribution (booked / qualified / transferred / failed)
- Average call duration
- Cost per call (Twilio + GPT-4o + tool latency)
- Caller sentiment (sampled from transcripts)
- Escalation rate trend
- Latency p50 / p95 — time to first agent word, time to resolution
Plus per-call drilldown: full transcript, audio playback, tool calls made, outcome.
Key features shipped
- Twilio voice integration with WebSocket streaming.
- GPT-4o Realtime as the conversation engine with sub-2s latency.
- Five typed function-calling tools wired to real actions.
- Calendar write-back on booking (Google Calendar, Outlook, Cal.com).
- CRM logging of qualified leads (HubSpot, Salesforce, Pipedrive).
- Warm transfer to humans during business hours; callback queue outside.
- Recording + transcript on every call, stored in client cloud.
- Eval harness that replays historic calls against new prompts/models.
- Per-call cost attribution on the dashboard.
- PII redaction in transcripts where required (configurable).
Target outcomes for a typical engagement
| Metric | Result |
|---|---|
| Coverage | 24/7 — every call answered |
| Time to first agent word | <2 seconds |
| Booking conversion (after-hours) | ~70–80% of qualified callers book |
| Cost per call | €0.10–€0.40 depending on length |
| Escalation rate (target) | <15% of calls warrant transfer |
| Recording + audit | 100% of calls |
What we learned
Disclosure builds trust. Pretending the AI is human breaks trust the moment something goes wrong. Disclosing at the start (briefly, naturally) actually improves caller comfort — most people prefer knowing what they're talking to.
Voice and pacing matter more than the model. GPT-4o Realtime gets us to "uncanny good" with the right voice tuning. The wrong voice (too neutral, too enthusiastic, wrong age/tone for the brand) breaks the spell instantly.
Read back the critical details. "I'm booking you in for Tuesday the 26th at 10am, is that correct?" — the cheapest insurance against costly mistakes.
Calendar integration is harder than the LLM part. Edge cases: holidays, business closures, double-booking, recurring slots, timezone conversions. The calendar tool is where most of the engineering goes.
Recording is non-negotiable for ongoing tuning. Without recordings, the eval suite can't grow and the prompt can't improve. We refuse engagements where recording isn't permitted.
Stack rationale
Twilio Voice because it has the cleanest streaming API for realtime audio and globally available numbers in nearly every country we ship into.
GPT-4o Realtime because it's currently the only production-ready realtime voice model that handles barge-in correctly with reasonable latency and a wide voice library.
Cloud Run for the bridge service because it scales to zero (cheap when idle), handles WebSockets natively, and runs in the customer's cloud account where required.
Firestore + Cloud Storage for state, recordings, transcripts.
Where to go next
For the full technical walkthrough of how to wire Twilio + GPT-4o together, see our Building a phone agent post. For the buyer's perspective on voice AI, see our Voice AI buyer's guide.
If you have a call surface you'd like to automate, drop us a note. We'll come back within a business day.
Related
Voice & Phone AI Agents
AI receptionists, booking lines, and qualification calls — wired to your calendar, CRM, and ticketing.
AI Agents Development
Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.
Voice & Phone Agent
After-hours bookings, lead qualification, customer service overflow, FAQ lines
Building a phone agent with Twilio + GPT-4o: a complete walkthrough
Build a phone agent: Twilio provisions the number and streams audio, a Node.js bridge on Cloud Run pipes the audio to GPT-4o Realtime, function-calling tools execute real actions (book appointment, log lead, transfer). Recording, transcript, and observability on every call. Production deployment in 3-6 weeks.
Voice AI for service businesses: a buyer's guide
Voice AI works for service businesses with predictable call patterns and meaningful inbound volume. Booking, qualification, status, FAQ. Real cost ~€0.10-0.40/call. Real build cost €15-50k for a single-line deployment. Evaluate vendors on recording, escalation paths, and CRM integration — not on the demo.
Why your AI chatbot fails (and what to fix)
Most chatbots that fail in production fail for one of six reasons: no retrieval, bad retrieval, no evals, no escalation, no observability, no scope. Tuning the prompt won't fix any of them. The fix is engineering — and the engineering is well-understood by now.
Have a similar problem?
A 30-minute call will tell us if there's a fit. No prep needed — just bring the messy version of the workflow.