What is an AI voice calling agent?

An AI voice calling agent is a system that answers or makes real phone calls and holds a spoken conversation end to end. It chains speech-to-text, an LLM for conversation logic, and text-to-speech into a real-time loop that handles interruptions, silence, and off-script questions, then hands off to a human when the call goes outside its scope.

How low does the latency need to be for a voice agent to feel natural?

End-to-end latency from the caller finishing a sentence to the agent starting its reply should stay under one second — ideally around 700–800ms. Past roughly 1.2 seconds the pause feels robotic and callers start talking over the agent. Streaming every stage and overlapping them is how you hit that budget.

Which platforms do you use to build voice agents?

Telephony through Twilio, orchestration through Vapi or Retell, Deepgram for low-latency speech-to-text, an LLM like GPT-4o for the conversation logic, and ElevenLabs for natural text-to-speech. The exact mix depends on call volume and use case, but that is the production stack I reach for.

What happens when a caller says something the agent can't handle?

A production agent has fallback handling built in — clarification prompts, graceful redirects, and a clean escalation to a human with full conversation context attached. It should never loop or crash; it either moves the call forward or hands off appropriately.

Voice AI

Building an AI Voice Calling Agent: A Complete 2026 Walkthrough

How to build an AI voice calling agent that holds real phone conversations — the STT to LLM to TTS pipeline, sub-second latency, interruption handling, and clean human handoff. Built from a live production system.

June 8, 2026 7 min read

Building an AI Voice Calling Agent: A Complete 2026 Walkthrough cover

There is a wide gap between a voice demo that works once on a quiet network and a voice agent that handles live calls all day without looping, crashing, or sounding like a 2019 phone menu. I have shipped the second kind — an AI voice calling agent that handles full inbound and outbound conversations end to end with sub-second latency and has not missed a call since it went live. This is how that system is actually built.

Quick answer: what an AI voice calling agent is

An AI voice calling agent is a real-time pipeline that turns a phone call into a conversation a machine can hold. Three stages run in a loop: speech-to-text (STT) transcribes the caller, an LLM decides what to say and what to do, and text-to-speech (TTS) speaks the reply. Wrapped around that loop is telephony (the actual phone connection), interruption handling, and a human-handoff path. The entire job is making those stages fast enough and reliable enough that a stranger on the phone never realizes how much is happening between their sentence ending and the reply starting.

The production stack

Here is the stack I build on and why each piece is there. None of it is exotic — the value is in how it is wired together.

Layer	Tool	Job
Telephony	Twilio	Connects the actual phone call, streams audio both ways
Orchestration	Vapi / Retell	Manages the real-time media loop and turn-taking
Speech-to-text	Deepgram	Streaming transcription with low latency and interim results
Conversation logic	GPT-4o	Decides replies, calls functions (booking, lookups)
Text-to-speech	ElevenLabs	Natural-sounding streamed voice output
Backend	FastAPI	Function-calling endpoints, calendar/CRM integration, logging

The pipeline, stage by stage

Stage 1: telephony and the media stream

Twilio answers the call and opens a bidirectional audio stream. The critical detail is streaming — you do not wait for the caller to finish and upload a file. Audio flows in continuously, and you process it as it arrives. This is the single decision that makes sub-second response possible.

Stage 2: streaming speech-to-text

Deepgram receives the audio stream and returns interim transcripts — partial guesses that update as the caller speaks — and a final transcript when it detects the end of an utterance. You use interim results to detect that the caller has started talking (for interruption handling) and the final transcript to trigger the LLM.

End-of-speech detection is where naive builds fail. Wait too long and the agent feels slow; trigger too early and you cut the caller off. I tune an endpointing window — typically 300–500ms of silence — and adjust it per use case. A booking flow where people pause to check a calendar needs a longer window than a fast Q&A line.

Stage 3: the LLM with function calling

The transcript hits GPT-4o with a system prompt that defines the call flow, plus a set of functions the model can call: check calendar availability, qualify a lead against criteria, look up an order, escalate to a human. This is what separates a chatbot read aloud from an agent that does things.

tools = [
    {
        "type": "function",
        "function": {
            "name": "book_appointment",
            "description": "Book a slot against live calendar availability.",
            "parameters": {
                "type": "object",
                "properties": {
                    "date": {"type": "string"},
                    "time": {"type": "string"},
                    "name": {"type": "string"},
                },
                "required": ["date", "time", "name"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "escalate_to_human",
            "description": "Hand off when the request is outside scope.",
            "parameters": {"type": "object", "properties": {
                "reason": {"type": "string"}}},
        },
    },
]

The LLM response is streamed token by token, not awaited in full. The moment the first sentence is complete, it goes to TTS. You never wait for the whole reply before speaking the start of it.

Stage 4: streaming text-to-speech

ElevenLabs receives the streamed text and returns streamed audio, which Twilio plays back to the caller. Because both the LLM output and the TTS are streamed, the agent starts speaking while it is still finishing the sentence in its head — exactly like a person does.

Hitting the sub-second latency budget

Natural conversation lives under one second of turn-taking delay. Here is the budget I work to and how each piece is kept small:

Stage	Target	How
End-of-speech detection	300–500ms	Tuned endpointing on interim transcripts
LLM time-to-first-token	200–400ms	Streaming, tight prompts, fast model
TTS time-to-first-audio	100–200ms	Streamed synthesis, no full-reply wait
Total perceived gap	~700–900ms	Overlap stages instead of running them in series

The core trick is overlap, not sequence. STT, LLM, and TTS are not three steps you wait through one after another — they are three streams running concurrently, each starting the moment it has enough input. Building it as a sequential request/response chain is the most common reason a voice agent feels two seconds slow.

Handling the messy parts of real calls

Demos handle the happy path. Production handles the rest.

Interruptions (barge-in). When the caller starts talking while the agent is speaking, you detect it from interim transcripts, stop TTS playback immediately, and feed the new input to the LLM. An agent that talks over the caller is instantly unbearable.

Silence and dead air. If the caller goes quiet, the agent should prompt gently ("Are you still there?") rather than hang in silence or hang up.

Off-script questions. The system prompt defines the flow, but callers ask anything. The LLM either answers from its configured knowledge, asks a clarifying question, or escalates. It must never loop — the cardinal sin of voice agents.

Clean human handoff. When the agent escalates, it passes the full conversation context to the human so the caller never has to repeat themselves. A handoff that drops context is worse than no handoff.

A note on disclosure and jurisdiction

Whether the agent tells callers it is AI is a real decision, not a technical afterthought. Some jurisdictions require disclosure; some use cases benefit from it; some flow better without making a point of it. Decide this deliberately and check the rules where your calls land — I advise clients on this as part of scoping, because getting it wrong is a legal problem, not a UX one.

What the result looks like in practice

The agent I shipped greets callers, asks qualifying questions from the configured flow, books appointments against live calendar availability, and routes to a human only when the conversation genuinely needs one. Sub-second latency keeps it feeling natural, and it handles 70–90% of routine call volume without a person on the line. The bar was never "does the demo work" — it was "does it survive a full day of real callers." It does.

The takeaway

Building an AI voice calling agent is an exercise in latency and failure handling, not model selection. Stream every stage, overlap them so the perceived gap stays under a second, and spend most of your engineering on the unhappy paths — interruptions, silence, off-script questions, and a handoff that keeps context. Get those right and the conversation feels human. Get them wrong and no amount of model quality saves you.

Got predictable call volume a machine could handle — booking, qualification, FAQs? That is what I build. See Voice AI Agents or book a scope call.

Want this built, not just explained?

That’s the day job. Book a free scope call and bring the half-baked idea.

Book a consultation

All posts

Ayaan Motiwala

AI Specialist in Surat. I ship multi-LLM systems, voice agents, and automations that survive real users — and write about what breaks along the way.