Case study · Service businesses

Voice Concierge

AI phone agent for after-hours bookings

Client

Coming soon — demo case

Timeline

Demo build

Stack

Twilio VoiceGPT-4o RealtimeCalendar APICRM webhooksCloud Run

At a glance


Client	Demo build (template for service-business engagements)
Industry	Service businesses (clinics, dental, salons, repair, professional services)
Engagement	Demo / template — full client builds typically 3–6 weeks
Stack	Twilio Voice, GPT-4o Realtime, Cloud Run, calendar/CRM integrations
Status	Template, ready to instantiate for client engagements

Service businesses lose a meaningful percentage of after-hours inbound calls. Voicemail-to-callback has brutal drop-off rates — by the time someone calls back the next morning, the prospect has booked elsewhere.

Hiring a 24/7 receptionist is uneconomic for the volume. Generic IVR ("press 1 for bookings, press 2 for...") feels worse than voicemail and converts worse too. Most "AI phone" products are still demo-stage — clunky pacing, can't be interrupted, can't actually book a real slot in a real calendar.

The demo target: a phone agent that picks up every call, books appointments into the actual calendar, qualifies new leads into the CRM, and hands off to a human during business hours when that's the right move.

What we built

A Twilio Voice number routed to a Node.js service on Cloud Run that bridges Twilio's audio streams to OpenAI's GPT-4o Realtime model. The model has function-calling tools for the actual actions — find availability, book, qualify, transfer.

Voice loop

[Caller dials] → [Twilio Voice answers]
                       ↓
[Twilio Stream (audio in/out) → Cloud Run WebSocket]
                       ↓
[GPT-4o Realtime session with system prompt + tools]
                       ↓
[Audio out streamed back to Twilio → caller]
                       ↓
[Tool calls executed when model invokes them]
                       ↓
[Call end → transcript stored, outcome logged]

End-to-end latency from caller speech to agent response: under 2 seconds in our test environment. The model also supports proper barge-in (the agent stops speaking when the caller starts) which is the single biggest UX upgrade over 2020-era IVRs.

Tools (function calling)

The agent has five tools that map intents to real actions:

Tool	Purpose
`findAvailability(date, durationMinutes)`	Query the calendar for open slots
`bookAppointment(slotId, customer)`	Create the calendar event, send confirmation
`qualifyLead(name, contact, intent, qualifyingNotes)`	Write lead to the CRM with structured fields
`transferToHuman(reason)`	Warm transfer during business hours; queue callback outside
`logCallOutcome(outcome, notes)`	Final tool — called at end of every call

Tool implementations are typed (Zod), idempotent (re-runs don't duplicate), and observable (every tool invocation logged with arguments, result, latency, cost).

System prompt

The system prompt defines:

Persona — friendly, brief, professional, never pretends to be human.
Opening line — discloses AI at call start.
Scope — what the agent handles vs. what it escalates.
Escalation triggers — explicit human request, repeated misunderstanding, sensitive topics, anything off-scope.
Brevity — no rambling. Short turns. Confirm critical details by reading back.
Confirmation — read back time/date/name before booking.

Recording + transcript

Every call is recorded (with disclosure at call start). Recording uploaded to Cloud Storage; transcript generated post-call by Whisper for the cases where the realtime transcript wasn't sufficient. Both stored in Firestore tied to the call ID.

Observability dashboard

A one-page dashboard for the operator:

Calls per day, week, month
Outcome distribution (booked / qualified / transferred / failed)
Average call duration
Cost per call (Twilio + GPT-4o + tool latency)
Caller sentiment (sampled from transcripts)
Escalation rate trend
Latency p50 / p95 — time to first agent word, time to resolution

Plus per-call drilldown: full transcript, audio playback, tool calls made, outcome.

Key features shipped

Twilio voice integration with WebSocket streaming.
GPT-4o Realtime as the conversation engine with sub-2s latency.
Five typed function-calling tools wired to real actions.
Calendar write-back on booking (Google Calendar, Outlook, Cal.com).
CRM logging of qualified leads (HubSpot, Salesforce, Pipedrive).
Warm transfer to humans during business hours; callback queue outside.
Recording + transcript on every call, stored in client cloud.
Eval harness that replays historic calls against new prompts/models.
Per-call cost attribution on the dashboard.
PII redaction in transcripts where required (configurable).

Target outcomes for a typical engagement

Metric	Result
Coverage	24/7 — every call answered
Time to first agent word	<2 seconds
Booking conversion (after-hours)	~70–80% of qualified callers book
Cost per call	€0.10–€0.40 depending on length
Escalation rate (target)	<15% of calls warrant transfer
Recording + audit	100% of calls

What we learned

Disclosure builds trust. Pretending the AI is human breaks trust the moment something goes wrong. Disclosing at the start (briefly, naturally) actually improves caller comfort — most people prefer knowing what they're talking to.

Voice and pacing matter more than the model. GPT-4o Realtime gets us to "uncanny good" with the right voice tuning. The wrong voice (too neutral, too enthusiastic, wrong age/tone for the brand) breaks the spell instantly.

Read back the critical details. "I'm booking you in for Tuesday the 26th at 10am, is that correct?" — the cheapest insurance against costly mistakes.

Calendar integration is harder than the LLM part. Edge cases: holidays, business closures, double-booking, recurring slots, timezone conversions. The calendar tool is where most of the engineering goes.

Recording is non-negotiable for ongoing tuning. Without recordings, the eval suite can't grow and the prompt can't improve. We refuse engagements where recording isn't permitted.

Stack rationale

Twilio Voice because it has the cleanest streaming API for realtime audio and globally available numbers in nearly every country we ship into.

GPT-4o Realtime because it's currently the only production-ready realtime voice model that handles barge-in correctly with reasonable latency and a wide voice library.

Cloud Run for the bridge service because it scales to zero (cheap when idle), handles WebSockets natively, and runs in the customer's cloud account where required.

Firestore + Cloud Storage for state, recordings, transcripts.

Where to go next

For the full technical walkthrough of how to wire Twilio + GPT-4o together, see our Building a phone agent post. For the buyer's perspective on voice AI, see our Voice AI buyer's guide.

If you have a call surface you'd like to automate, drop us a note. We'll come back within a business day.

Service

Voice & Phone AI Agents

AI receptionists, booking lines, and qualification calls — wired to your calendar, CRM, and ticketing.

Service

AI Agents Development

Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.

Agent type

Voice & Phone Agent

After-hours bookings, lead qualification, customer service overflow, FAQ lines

Article

Building a phone agent with Twilio + GPT-4o: a complete walkthrough

Build a phone agent: Twilio provisions the number and streams audio, a Node.js bridge on Cloud Run pipes the audio to GPT-4o Realtime, function-calling tools execute real actions (book appointment, log lead, transfer). Recording, transcript, and observability on every call. Production deployment in 3-6 weeks.

Article

Voice AI for service businesses: a buyer's guide

Voice AI works for service businesses with predictable call patterns and meaningful inbound volume. Booking, qualification, status, FAQ. Real cost ~€0.10-0.40/call. Real build cost €15-50k for a single-line deployment. Evaluate vendors on recording, escalation paths, and CRM integration — not on the demo.

Article

Why your AI chatbot fails (and what to fix)

Most chatbots that fail in production fail for one of six reasons: no retrieval, bad retrieval, no evals, no escalation, no observability, no scope. Tuning the prompt won't fix any of them. The fix is engineering — and the engineering is well-understood by now.

Have a similar problem?

A 30-minute call will tell us if there's a fit. No prep needed — just bring the messy version of the workflow.

Book a call

Voice Concierge

At a glance

The challenge

What we built

Voice loop

Tools (function calling)

System prompt

Recording + transcript

Observability dashboard

Key features shipped

Target outcomes for a typical engagement

What we learned

Stack rationale

Where to go next

Related

Voice & Phone AI Agents

AI Agents Development

Voice & Phone Agent

Building a phone agent with Twilio + GPT-4o: a complete walkthrough

Voice AI for service businesses: a buyer's guide

Why your AI chatbot fails (and what to fix)

Have a similar problem?