Building an AI Voice Calling Agent: A Complete 2026 Walkthrough
How to build an AI voice calling agent that holds real phone conversations — the STT to LLM to TTS pipeline, sub-second latency, interruption handling, and clean human handoff. Built from a live production system.

There is a wide gap between a voice demo that works once on a quiet network and a voice agent that handles live calls all day without looping, crashing, or sounding like a 2019 phone menu. I have shipped the second kind — an AI voice calling agent that handles full inbound and outbound conversations end to end with sub-second latency and has not missed a call since it went live. This is how that system is actually built.
Quick answer: what an AI voice calling agent is
An AI voice calling agent is a real-time pipeline that turns a phone call into a conversation a machine can hold. Three stages run in a loop: speech-to-text (STT) transcribes the caller, an LLM decides what to say and what to do, and text-to-speech (TTS) speaks the reply. Wrapped around that loop is telephony (the actual phone connection), interruption handling, and a human-handoff path. The entire job is making those stages fast enough and reliable enough that a stranger on the phone never realizes how much is happening between their sentence ending and the reply starting.
The production stack
Here is the stack I build on and why each piece is there. None of it is exotic — the value is in how it is wired together.
| Layer | Tool | Job |
|---|---|---|
| Telephony | Twilio | Connects the actual phone call, streams audio both ways |
| Orchestration | Vapi / Retell | Manages the real-time media loop and turn-taking |
| Speech-to-text | Deepgram | Streaming transcription with low latency and interim results |
| Conversation logic | GPT-4o | Decides replies, calls functions (booking, lookups) |
| Text-to-speech | ElevenLabs | Natural-sounding streamed voice output |
| Backend | FastAPI | Function-calling endpoints, calendar/CRM integration, logging |
The pipeline, stage by stage
Stage 1: telephony and the media stream
Twilio answers the call and opens a bidirectional audio stream. The critical detail is streaming — you do not wait for the caller to finish and upload a file. Audio flows in continuously, and you process it as it arrives. This is the single decision that makes sub-second response possible.
Stage 2: streaming speech-to-text
Deepgram receives the audio stream and returns interim transcripts — partial guesses that update as the caller speaks — and a final transcript when it detects the end of an utterance. You use interim results to detect that the caller has started talking (for interruption handling) and the final transcript to trigger the LLM.
End-of-speech detection is where naive builds fail. Wait too long and the agent feels slow; trigger too early and you cut the caller off. I tune an endpointing window — typically 300–500ms of silence — and adjust it per use case. A booking flow where people pause to check a calendar needs a longer window than a fast Q&A line.
Stage 3: the LLM with function calling
The transcript hits GPT-4o with a system prompt that defines the call flow, plus a set of functions the model can call: check calendar availability, qualify a lead against criteria, look up an order, escalate to a human. This is what separates a chatbot read aloud from an agent that does things.
tools = [
{
"type": "function",
"function": {
"name": "book_appointment",
"description": "Book a slot against live calendar availability.",
"parameters": {
"type": "object",
"properties": {
"date": {"type": "string"},
"time": {"type": "string"},
"name": {"type": "string"},
},
"required": ["date", "time", "name"],
},
},
},
{
"type": "function",
"function": {
"name": "escalate_to_human",
"description": "Hand off when the request is outside scope.",
"parameters": {"type": "object", "properties": {
"reason": {"type": "string"}}},
},
},
]
The LLM response is streamed token by token, not awaited in full. The moment the first sentence is complete, it goes to TTS. You never wait for the whole reply before speaking the start of it.
Stage 4: streaming text-to-speech
ElevenLabs receives the streamed text and returns streamed audio, which Twilio plays back to the caller. Because both the LLM output and the TTS are streamed, the agent starts speaking while it is still finishing the sentence in its head — exactly like a person does.
Hitting the sub-second latency budget
Natural conversation lives under one second of turn-taking delay. Here is the budget I work to and how each piece is kept small:
| Stage | Target | How |
|---|---|---|
| End-of-speech detection | 300–500ms | Tuned endpointing on interim transcripts |
| LLM time-to-first-token | 200–400ms | Streaming, tight prompts, fast model |
| TTS time-to-first-audio | 100–200ms | Streamed synthesis, no full-reply wait |
| Total perceived gap | ~700–900ms | Overlap stages instead of running them in series |
The core trick is overlap, not sequence. STT, LLM, and TTS are not three steps you wait through one after another — they are three streams running concurrently, each starting the moment it has enough input. Building it as a sequential request/response chain is the most common reason a voice agent feels two seconds slow.
Handling the messy parts of real calls
Demos handle the happy path. Production handles the rest.
Interruptions (barge-in). When the caller starts talking while the agent is speaking, you detect it from interim transcripts, stop TTS playback immediately, and feed the new input to the LLM. An agent that talks over the caller is instantly unbearable.
Silence and dead air. If the caller goes quiet, the agent should prompt gently ("Are you still there?") rather than hang in silence or hang up.
Off-script questions. The system prompt defines the flow, but callers ask anything. The LLM either answers from its configured knowledge, asks a clarifying question, or escalates. It must never loop — the cardinal sin of voice agents.
Clean human handoff. When the agent escalates, it passes the full conversation context to the human so the caller never has to repeat themselves. A handoff that drops context is worse than no handoff.
A note on disclosure and jurisdiction
Whether the agent tells callers it is AI is a real decision, not a technical afterthought. Some jurisdictions require disclosure; some use cases benefit from it; some flow better without making a point of it. Decide this deliberately and check the rules where your calls land — I advise clients on this as part of scoping, because getting it wrong is a legal problem, not a UX one.
What the result looks like in practice
The agent I shipped greets callers, asks qualifying questions from the configured flow, books appointments against live calendar availability, and routes to a human only when the conversation genuinely needs one. Sub-second latency keeps it feeling natural, and it handles 70–90% of routine call volume without a person on the line. The bar was never "does the demo work" — it was "does it survive a full day of real callers." It does.
The takeaway
Building an AI voice calling agent is an exercise in latency and failure handling, not model selection. Stream every stage, overlap them so the perceived gap stays under a second, and spend most of your engineering on the unhappy paths — interruptions, silence, off-script questions, and a handoff that keeps context. Get those right and the conversation feels human. Get them wrong and no amount of model quality saves you.
Got predictable call volume a machine could handle — booking, qualification, FAQs? That is what I build. See Voice AI Agents or book a scope call.
Want this built, not just explained?
That’s the day job. Book a free scope call and bring the half-baked idea.
Book a consultationAyaan Motiwala
AI Specialist in Surat. I ship multi-LLM systems, voice agents, and automations that survive real users — and write about what breaks along the way.
Related reading
AIHow to Build a Production-Ready Multi-LLM System: A 2026 Architecture Guide
A deep architecture guide to multi-LLM systems — model routing, fallbacks, cost instrumentation, and caching — from someone who runs these in production and cut a client's model bill 40–60%.
AIRAG Explained: Building Retrieval-Augmented Generation with LangChain
A practical LangChain RAG tutorial that goes past the demo — chunking strategy, embedding choice, hybrid search, evaluation, and the source-citation grounding that keeps a chatbot from making things up.