How to Build an AI Phone Agent
This guide assumes you have development experience and want to build a phone agent with more control than managed platforms provide. The steps cover both developer platform (Vapi, Retell AI) and open source (LiveKit, Pipecat) approaches.
Step 1: Set Up the Telephony Layer
Create an account with a SIP trunking provider. Twilio is the most widely supported, with extensive documentation and integration with most voice agent platforms. Telnyx and Vonage are competitive alternatives with lower pricing for high-volume deployments. Provision one or more phone numbers in your target area codes.
Configure the SIP trunk to forward incoming calls to your voice agent endpoint. On developer platforms, this means pointing the trunk to the platform SIP URI. For open source deployments, configure the trunk to connect to your SIP gateway (LiveKit provides a built-in SIP bridge). Set up failover routing so calls go to a backup destination if your primary agent is unreachable.
For outbound calling, configure the trunk with caller ID settings and ensure compliance with local regulations regarding automated outbound calls. Set up a webhook for call status events (answered, completed, failed) to enable monitoring and logging.
Test the telephony layer independently before connecting it to your conversation pipeline. Make test calls to verify that audio flows in both directions, that call quality is acceptable, and that the signaling (call start, call end, transfer) works correctly. Telephony issues are easier to diagnose in isolation than when combined with ASR, LLM, and TTS problems.
Step 2: Build the Conversation Pipeline
On developer platforms, configure the pipeline through the platform API. Specify your ASR provider and settings (language, model, endpointing sensitivity), your LLM provider and model (with system instructions), and your TTS provider and voice selection. The platform handles the orchestration, streaming, and turn-taking logic.
For open source builds, assemble the pipeline using your chosen framework. In Pipecat, define a pipeline of processors: audio input, VAD (voice activity detection), ASR, LLM, TTS, and audio output. In LiveKit Agents, create an agent class that handles the conversation loop, using the built-in ASR, LLM, and TTS integrations. Configure streaming at every stage to minimize latency.
Test the pipeline with synthetic audio before connecting to real phone lines. Use recorded call audio to validate ASR accuracy, response quality, and TTS naturalness. Measure end-to-end latency and ensure it stays below 800 milliseconds.
Implement audio preprocessing to improve ASR accuracy. Phone audio is typically narrowband (8 kHz sample rate) with compression artifacts, which degrades transcription accuracy compared to wideband audio. Some ASR providers offer models specifically trained on telephony audio. If your provider does not, test whether upsampling or noise reduction preprocessing improves accuracy for your specific audio conditions.
Step 3: Implement Conversation Logic and Tools
Write system instructions that define the agent behavior comprehensively. Include the agent role, greeting script, the types of requests it handles, how to collect and confirm information, when to escalate, and how to close calls. System instructions are the primary lever for conversation quality, and iterating on them based on test results is the most impactful optimization you can make.
Implement tool functions for each external system the agent needs to access. Common tools include customer lookup (query CRM by phone number or account ID), appointment scheduling (check availability and book slots), order status (retrieve shipping and delivery information), and ticket creation (log issues in your support system). Each tool should handle errors gracefully and return clear, concise results that the LLM can incorporate into its spoken response.
Build a conversation state machine that tracks where the interaction is in its lifecycle. At minimum, track whether the conversation is in greeting, information collection, action execution, confirmation, or closing phase. The state machine helps the agent recover gracefully from interruptions and prevents it from repeating steps or skipping critical confirmation steps.
Implement tool response latency management. When a tool call takes more than 2 seconds (a slow database query or external API call), the agent should provide a verbal filler such as "Let me look that up for you" rather than leaving dead silence on the line. Configure timeout handling for each tool so the agent can inform the caller and offer alternatives when a system is unavailable rather than hanging indefinitely.
Step 4: Handle Phone-Specific Behaviors
Implement call transfer capability so the agent can connect callers to human agents when needed. Warm transfers are preferred, where the agent briefs the human on the conversation context before connecting the caller. Configure transfer destinations for different departments or escalation types.
Add voicemail detection for outbound calls so the agent can distinguish between a human answering and a voicemail greeting. When voicemail is detected, the agent should leave an appropriate message and log the attempt for callback scheduling. Implement DTMF input handling for situations where callers may need to enter account numbers or PINs via their keypad.
Configure hold behavior for moments when the agent needs to perform a lengthy operation (like a complex database query). Play appropriate hold audio or provide verbal status updates so the caller knows the agent is still working on their request.
Implement SMS integration for sending confirmation details, links, or reference numbers to the caller during or after the conversation. Many interactions benefit from a follow-up text message that summarizes the actions taken, provides a confirmation number, or includes a link the caller can reference later. The agent can ask the caller for permission to send an SMS and use the caller phone number from the call metadata.
Step 5: Secure, Scale, and Deploy to Production
Implement security controls appropriate for the data your agent handles. Encrypt call recordings and transcripts at rest and in transit. Implement access controls so only authorized personnel can review call recordings. If the agent handles sensitive data (credit card numbers, social security numbers, health information), ensure compliance with relevant regulations (PCI DSS, HIPAA, GDPR) and implement data masking in transcripts and logs.
Configure scaling for concurrent call capacity. Each active call requires dedicated compute resources for ASR, LLM inference, and TTS generation. On developer platforms, the platform handles scaling automatically. For open source deployments, configure auto-scaling rules that add compute capacity as concurrent call count increases and reduce it during low-volume periods. Test the scaling behavior under load to verify that new calls are handled without latency degradation.
Deploy with comprehensive monitoring and alerting. Track per-call metrics (latency at each pipeline stage, ASR confidence scores, tool call success rates, resolution outcomes) and aggregate metrics (concurrent calls, error rates, average handle time, escalation rate). Set up alerts for anomalies like latency spikes, elevated error rates, or unusual escalation patterns. Build dashboards that give operations teams real-time visibility into system health.
Run comprehensive test calls covering every conversation path, edge case, and error scenario. Test with varied accents, speaking speeds, and background noise conditions. Test telephony functions including transfers, DTMF, and voicemail. Deploy to a pilot group first, starting with a small percentage of call volume. Monitor key metrics in real time and review flagged calls daily during the pilot period.
Iterate based on production data. The most valuable improvements come from analyzing real call recordings, identifying common failure patterns, and adjusting system instructions, tool implementations, and pipeline configuration. Plan for continuous improvement as an ongoing process, not a one-time setup.
Debugging Voice Agent Issues
When a voice agent produces poor results, the first step is identifying which pipeline stage is causing the problem. Listen to the call recording while reading the transcript side by side. If the transcript does not match what the caller said, the issue is in ASR. If the transcript is correct but the agent response is wrong, the issue is in the LLM logic or system instructions. If the response text is correct but sounds unnatural, the issue is in TTS pronunciation or pacing.
ASR errors are often systematic, meaning the same words or phrases are consistently misrecognized. Build a list of frequently misrecognized terms and address them through custom vocabulary, pronunciation hints, or post-processing rules that correct known error patterns before passing the transcript to the LLM.
LLM behavior issues usually trace back to system instructions. If the agent provides incorrect information, check whether the system instructions are ambiguous about that topic. If the agent handles a conversation path awkwardly, write more explicit instructions for that specific scenario. System prompt engineering is an iterative process that benefits from analyzing a large volume of real conversation examples.
Building an AI phone agent requires telephony setup, pipeline assembly with streaming at every stage, comprehensive conversation logic with tool integrations, phone-specific features like call transfer and voicemail detection, and production-grade security, scaling, and monitoring. Start with a pilot and iterate based on real call data.