Voice Agent Latency: Making Conversations Feel Natural
Why Latency Matters
In normal human conversation, the gap between one person finishing and another beginning to speak averages 200 to 300 milliseconds. Humans are extremely sensitive to conversational timing because we have spent our entire lives calibrating expectations about how conversations flow. When a voice agent introduces pauses that exceed natural conversational norms, callers notice immediately.
The effects of high latency cascade through the conversation. Initial pauses create uncertainty about whether the agent heard the caller. Repeated pauses make the caller speak more slowly and carefully, trying to "help" the system understand. Long silences prompt callers to repeat themselves, creating overlapping speech that confuses the agent further. Eventually, latency degrades the experience enough that callers request a human agent or hang up entirely.
Research on conversational AI shows that user satisfaction drops sharply as latency increases. Satisfaction remains high below 500 milliseconds, drops moderately between 500 and 800 milliseconds, and drops significantly above 1,000 milliseconds. For enterprise deployments where call completion rates and customer satisfaction directly affect revenue, latency optimization is a business-critical concern.
The Latency Budget
Total voice agent latency is the sum of five components: endpointing delay (detecting that the caller has finished speaking), ASR processing time, LLM inference time, TTS generation time, and network transit time between components. Each component contributes a portion of the total budget.
Endpointing delay is the time the system waits after the last detected speech to confirm the caller has finished. This is typically 400 to 800 milliseconds of silence. Aggressive endpointing (shorter silence threshold) reduces latency but risks cutting off the caller. Conservative endpointing (longer threshold) avoids cutoffs but adds perceivable delay. Adaptive endpointing adjusts the threshold based on conversation context, using shorter thresholds for expected short responses (yes/no) and longer thresholds for open-ended questions.
ASR processing adds 50 to 200 milliseconds depending on the provider and whether streaming mode is used. In streaming mode, most of the transcription happens while the caller is still speaking, so the additional delay after the endpoint is minimal. In batch mode, the entire utterance must be processed after the caller finishes, adding the full transcription time to the latency budget.
LLM inference is typically the largest latency contributor, adding 200 to 600 milliseconds depending on model size, prompt complexity, and infrastructure. Time-to-first-token (TTFT) is the critical metric because streaming output means the TTS can begin processing before the full response is generated. Smaller models on dedicated GPU instances achieve TTFT under 200 milliseconds. Larger models or shared infrastructure may exceed 500 milliseconds.
TTS generation adds 50 to 200 milliseconds for the first audio to begin playing (TTFB). Streaming TTS systems start producing audio from the first words of the response while later words are still being synthesized. The combination of streaming LLM output and streaming TTS synthesis creates a pipeline where audio begins playing very soon after the LLM starts generating its response.
Network transit between components adds 10 to 100 milliseconds depending on geographic distribution. Co-locating all components in the same data center or cloud region minimizes this to 10 to 20 milliseconds. Distributed deployments where ASR, LLM, and TTS run on different providers in different regions can add 50 to 100 milliseconds of cumulative network transit.
Optimization Strategies
Streaming everywhere is the single most impactful optimization. Using streaming ASR, streaming LLM output, and streaming TTS creates a pipeline where each stage begins processing as soon as the first data arrives from the previous stage, rather than waiting for complete output. This overlapping execution can reduce perceived latency by 500 milliseconds or more compared to sequential processing.
Model selection directly affects LLM inference latency. Smaller models optimized for conversation (as opposed to large general-purpose models) provide faster TTFT while maintaining adequate response quality for most voice agent use cases. Some platforms use model routing, sending simple requests to fast small models and complex requests to larger models, optimizing the latency-quality tradeoff dynamically.
Response length control reduces both LLM generation time and TTS synthesis time. Voice agent responses should be concise by design because listeners cannot re-read spoken content. Instructing the LLM to keep responses under 2 to 3 sentences for routine interactions reduces the total audio duration and the time needed to generate it. Longer responses should be broken into interactive segments where the agent pauses for confirmation before continuing.
Infrastructure optimization includes using dedicated GPU instances for LLM inference (eliminating cold start and queueing delays), co-locating all pipeline components in the same region, maintaining persistent connections between components (avoiding TCP handshake overhead), and using edge deployment to place telephony and audio processing close to callers geographically.
Prefetching and caching can reduce latency for predictable interactions. If the agent knows the next question it will ask, it can pre-generate the TTS audio before the caller finishes their current response. Caching common responses (greetings, closings, frequently repeated phrases) eliminates LLM and TTS latency for those specific outputs.
Measuring Latency in Production
Production latency monitoring should track each pipeline stage independently, not just total round-trip time. Measuring ASR latency, LLM TTFT, TTS TTFB, and network transit separately allows teams to identify bottlenecks and target optimization efforts effectively. Dashboards should show percentile distributions (p50, p95, p99) rather than averages because latency spikes at the tail affect a meaningful number of callers.
End-to-end latency should be measured from the caller perspective, not the server perspective. The time from when the caller stops speaking to when they hear audio includes all processing time plus the round-trip network latency to the caller. This caller-perspective latency is what determines conversation quality.
Voice agent latency must stay under 800 milliseconds for natural conversation. Streaming at every pipeline stage, careful model selection, concise response design, and co-located infrastructure are the primary optimization levers. Monitor each stage independently to identify and address bottlenecks.