Case study · Service businesses

Voice Concierge

AI phone agent for after-hours bookings

Client
Coming soon — demo case
Timeline
Demo build
Stack
Twilio VoiceGPT-4o RealtimeCalendar APICRM webhooksCloud Run

At a glance

ClientDemo build (template for service-business engagements)
IndustryService businesses (clinics, dental, salons, repair, professional services)
EngagementDemo / template — full client builds typically 3–6 weeks
StackTwilio Voice, GPT-4o Realtime, Cloud Run, calendar/CRM integrations
StatusTemplate, ready to instantiate for client engagements

The challenge

Service businesses lose a meaningful percentage of after-hours inbound calls. Voicemail-to-callback has brutal drop-off rates — by the time someone calls back the next morning, the prospect has booked elsewhere.

Hiring a 24/7 receptionist is uneconomic for the volume. Generic IVR ("press 1 for bookings, press 2 for...") feels worse than voicemail and converts worse too. Most "AI phone" products are still demo-stage — clunky pacing, can't be interrupted, can't actually book a real slot in a real calendar.

The demo target: a phone agent that picks up every call, books appointments into the actual calendar, qualifies new leads into the CRM, and hands off to a human during business hours when that's the right move.

What we built

A Twilio Voice number routed to a Node.js service on Cloud Run that bridges Twilio's audio streams to OpenAI's GPT-4o Realtime model. The model has function-calling tools for the actual actions — find availability, book, qualify, transfer.

Voice loop

[Caller dials] → [Twilio Voice answers]
                       ↓
[Twilio Stream (audio in/out) → Cloud Run WebSocket]
                       ↓
[GPT-4o Realtime session with system prompt + tools]
                       ↓
[Audio out streamed back to Twilio → caller]
                       ↓
[Tool calls executed when model invokes them]
                       ↓
[Call end → transcript stored, outcome logged]

End-to-end latency from caller speech to agent response: under 2 seconds in our test environment. The model also supports proper barge-in (the agent stops speaking when the caller starts) which is the single biggest UX upgrade over 2020-era IVRs.

Tools (function calling)

The agent has five tools that map intents to real actions:

ToolPurpose
findAvailability(date, durationMinutes)Query the calendar for open slots
bookAppointment(slotId, customer)Create the calendar event, send confirmation
qualifyLead(name, contact, intent, qualifyingNotes)Write lead to the CRM with structured fields
transferToHuman(reason)Warm transfer during business hours; queue callback outside
logCallOutcome(outcome, notes)Final tool — called at end of every call

Tool implementations are typed (Zod), idempotent (re-runs don't duplicate), and observable (every tool invocation logged with arguments, result, latency, cost).

System prompt

The system prompt defines:

  • Persona — friendly, brief, professional, never pretends to be human.
  • Opening line — discloses AI at call start.
  • Scope — what the agent handles vs. what it escalates.
  • Escalation triggers — explicit human request, repeated misunderstanding, sensitive topics, anything off-scope.
  • Brevity — no rambling. Short turns. Confirm critical details by reading back.
  • Confirmation — read back time/date/name before booking.

Recording + transcript

Every call is recorded (with disclosure at call start). Recording uploaded to Cloud Storage; transcript generated post-call by Whisper for the cases where the realtime transcript wasn't sufficient. Both stored in Firestore tied to the call ID.

Observability dashboard

A one-page dashboard for the operator:

  • Calls per day, week, month
  • Outcome distribution (booked / qualified / transferred / failed)
  • Average call duration
  • Cost per call (Twilio + GPT-4o + tool latency)
  • Caller sentiment (sampled from transcripts)
  • Escalation rate trend
  • Latency p50 / p95 — time to first agent word, time to resolution

Plus per-call drilldown: full transcript, audio playback, tool calls made, outcome.

Key features shipped

  • Twilio voice integration with WebSocket streaming.
  • GPT-4o Realtime as the conversation engine with sub-2s latency.
  • Five typed function-calling tools wired to real actions.
  • Calendar write-back on booking (Google Calendar, Outlook, Cal.com).
  • CRM logging of qualified leads (HubSpot, Salesforce, Pipedrive).
  • Warm transfer to humans during business hours; callback queue outside.
  • Recording + transcript on every call, stored in client cloud.
  • Eval harness that replays historic calls against new prompts/models.
  • Per-call cost attribution on the dashboard.
  • PII redaction in transcripts where required (configurable).

Target outcomes for a typical engagement

MetricResult
Coverage24/7 — every call answered
Time to first agent word<2 seconds
Booking conversion (after-hours)~70–80% of qualified callers book
Cost per call€0.10–€0.40 depending on length
Escalation rate (target)<15% of calls warrant transfer
Recording + audit100% of calls

What we learned

Disclosure builds trust. Pretending the AI is human breaks trust the moment something goes wrong. Disclosing at the start (briefly, naturally) actually improves caller comfort — most people prefer knowing what they're talking to.

Voice and pacing matter more than the model. GPT-4o Realtime gets us to "uncanny good" with the right voice tuning. The wrong voice (too neutral, too enthusiastic, wrong age/tone for the brand) breaks the spell instantly.

Read back the critical details. "I'm booking you in for Tuesday the 26th at 10am, is that correct?" — the cheapest insurance against costly mistakes.

Calendar integration is harder than the LLM part. Edge cases: holidays, business closures, double-booking, recurring slots, timezone conversions. The calendar tool is where most of the engineering goes.

Recording is non-negotiable for ongoing tuning. Without recordings, the eval suite can't grow and the prompt can't improve. We refuse engagements where recording isn't permitted.

Stack rationale

Twilio Voice because it has the cleanest streaming API for realtime audio and globally available numbers in nearly every country we ship into.

GPT-4o Realtime because it's currently the only production-ready realtime voice model that handles barge-in correctly with reasonable latency and a wide voice library.

Cloud Run for the bridge service because it scales to zero (cheap when idle), handles WebSockets natively, and runs in the customer's cloud account where required.

Firestore + Cloud Storage for state, recordings, transcripts.

Where to go next

For the full technical walkthrough of how to wire Twilio + GPT-4o together, see our Building a phone agent post. For the buyer's perspective on voice AI, see our Voice AI buyer's guide.

If you have a call surface you'd like to automate, drop us a note. We'll come back within a business day.

Related

Service

Voice & Phone AI Agents

AI receptionists, booking lines, and qualification calls — wired to your calendar, CRM, and ticketing.

Read more
Service

AI Agents Development

Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.

Read more
Agent type

Voice & Phone Agent

After-hours bookings, lead qualification, customer service overflow, FAQ lines

Read more
Article

Building a phone agent with Twilio + GPT-4o: a complete walkthrough

Build a phone agent: Twilio provisions the number and streams audio, a Node.js bridge on Cloud Run pipes the audio to GPT-4o Realtime, function-calling tools execute real actions (book appointment, log lead, transfer). Recording, transcript, and observability on every call. Production deployment in 3-6 weeks.

Read more
Article

Voice AI for service businesses: a buyer's guide

Voice AI works for service businesses with predictable call patterns and meaningful inbound volume. Booking, qualification, status, FAQ. Real cost ~€0.10-0.40/call. Real build cost €15-50k for a single-line deployment. Evaluate vendors on recording, escalation paths, and CRM integration — not on the demo.

Read more
Article

Why your AI chatbot fails (and what to fix)

Most chatbots that fail in production fail for one of six reasons: no retrieval, bad retrieval, no evals, no escalation, no observability, no scope. Tuning the prompt won't fix any of them. The fix is engineering — and the engineering is well-understood by now.

Read more

Have a similar problem?

A 30-minute call will tell us if there's a fit. No prep needed — just bring the messy version of the workflow.