Text-to-Speech: Making AI Agents Talk
The Evolution of TTS Technology
Text-to-speech has undergone three generations of technology. First-generation concatenative systems stitched together pre-recorded audio segments to form words and sentences. They sounded mechanical and unnatural, with audible seams between segments. Second-generation parametric systems used statistical models to generate speech parameters, producing smoother but still clearly synthetic output. Third-generation neural TTS systems use deep learning to generate speech waveforms directly, producing audio quality that rivals human recordings.
The neural TTS revolution was driven by models like WaveNet (DeepMind), Tacotron (Google), and VITS (open source). These models learn the complex acoustic patterns of human speech from thousands of hours of recordings, capturing subtleties like intonation, rhythm, emphasis, and breathing that earlier systems missed. The result is speech that sounds natural, expressive, and appropriate for extended conversation.
Current state-of-the-art TTS systems go beyond simple naturalness. They support emotional expression, adjusting tone and delivery to match the content. They handle prosodic emphasis, stressing important words and de-emphasizing fillers. They produce appropriate pauses at sentence and clause boundaries. And they support multiple speaking styles, from professional and authoritative to warm and conversational, within the same voice.
Voice Quality and Selection
TTS providers offer libraries of pre-built voices in different genders, ages, accents, and speaking styles. The selection has expanded dramatically, with major providers offering dozens to hundreds of voices. Quality varies across providers and specific voices, so evaluation against the specific use case is important before committing to a voice for production deployment.
Voice cloning allows businesses to create custom voices from short audio samples. A high-quality clone typically requires 30 minutes to several hours of clean audio, depending on the provider. Once created, the cloned voice can be used for all TTS synthesis, ensuring the agent sounds consistent with the brand identity. This is particularly valuable for businesses that want their AI agent to match an existing brand voice or maintain consistency with other audio content.
Voice consistency is important for customer experience. The agent should sound the same across all interactions, with consistent pitch, pace, and tonal characteristics. Variations between calls create an uncanny feeling and reduce trust. Production deployments should lock in a specific voice and regularly verify that updates to the TTS model have not subtly changed the voice characteristics.
Latency Optimization
For voice agents, TTS latency is measured primarily by time-to-first-byte (TTFB), the delay between receiving text input and beginning audio output. This metric directly affects conversation flow because it contributes to the total response time the caller experiences. The best current systems achieve TTFB under 150 milliseconds, with some achieving under 100 milliseconds.
Streaming synthesis is the primary latency optimization technique. Instead of generating the complete audio for the entire response before beginning playback, the system generates and streams audio word by word or phrase by phrase. The first words begin playing while the rest of the response is still being synthesized. This approach can reduce perceived latency by the duration of the entire response, which might be several seconds for a long answer.
Text chunking works alongside streaming to further reduce latency. The language model output is split into small chunks (sentences or clauses) and each chunk is sent to TTS independently. The TTS begins processing the first chunk immediately while subsequent chunks are still being generated by the LLM. This pipelining overlaps LLM generation and TTS synthesis, reducing the total wait time.
Infrastructure placement also affects TTS latency. Locating the TTS service in the same data center or cloud region as other pipeline components minimizes network transit time. Some platforms run TTS on edge servers closer to callers for further latency reduction. Dedicated GPU instances provide more consistent inference times than shared compute, reducing latency variance that can make some responses feel noticeably slower than others.
Pronunciation and Specialized Content
TTS systems occasionally mispronounce proper names, technical terms, abbreviations, and numbers. Voice agent platforms address this through pronunciation dictionaries that define the correct pronunciation for specific words. When the agent needs to say a company name, product name, or technical term, the pronunciation dictionary ensures it sounds correct.
SSML (Speech Synthesis Markup Language) provides fine-grained control over speech output. It allows specifying pronunciation, emphasis, pauses, speaking rate, and pitch for specific parts of the text. For voice agents, SSML is useful for spelling out account numbers, emphasizing important information, and inserting natural pauses at appropriate points in the response.
Number handling requires special attention. Phone numbers, dates, currency amounts, and addresses need to be spoken in natural formats rather than read digit by digit. ",234.56" should be spoken as "one thousand two hundred thirty four dollars and fifty six cents," not "dollar sign one comma two three four period five six." Good TTS systems handle these conversions automatically, but edge cases often require explicit formatting in the agent response generation.
Leading TTS Providers
ElevenLabs has established itself as a leader in TTS quality, offering voices with exceptional naturalness and emotional range. Their voice cloning capabilities are among the best in the industry, requiring relatively short audio samples to produce high-quality custom voices. Their Turbo model offers low latency suitable for real-time conversation. Pricing is higher than some competitors but reflects the premium voice quality.
PlayHT offers a balance of quality and affordability, with a large voice library and good streaming performance. Their API is straightforward to integrate and they support SSML for fine-grained pronunciation control. They have been improving their latency metrics and now offer competitive TTFB for voice agent applications.
Cartesia focuses specifically on ultra-low-latency TTS for real-time applications. Their Sonic model is designed from the ground up for conversational AI, with TTFB consistently under 100 milliseconds. Voice quality is strong, and their streaming architecture is optimized for the word-by-word delivery that voice agent pipelines require.
LMNT (pronounced "element") offers high-quality voices with a focus on conversational naturalness. Their voices handle the informal, back-and-forth style of phone conversations well, avoiding the overly polished quality that some TTS voices have when trained primarily on audiobook or narration data.
Modern neural TTS produces speech indistinguishable from human recordings, with TTFB under 150 milliseconds enabling fluid conversation. Voice selection, streaming synthesis, and pronunciation control are the key factors in voice agent speech quality.