Voice: Speech In, Speech Out

Updated June 2026
Auto Learning Agents listens and speaks. Talk to the master agent through the microphone in the web UI and have answers read back aloud, send voice notes to the chatbot on any connected platform and get spoken replies, with transcription and synthesis engines that can run entirely on your own hardware.

Voice is not a separate subsystem bolted onto chat, it is a pair of conversions at the edge of the same conversation machinery. Speech becomes text on the way in, text becomes speech on the way out, and everything between, context, memory, tools, agents, works exactly as it does for typed conversation. That design has a useful consequence: every spoken exchange is as stored, searchable, and topic-classified as a typed one, because by the time it reaches the conversation system, it is text.

Speaking to the Web UI

The Chat tab carries a microphone button: press, speak, and your words arrive in the conversation as text for the master agent to answer. On the way back, text-to-speech playback reads responses aloud, with an autoplay toggle for fully hands-free sessions, ask a question while you make coffee, hear the answer across the room. The combination turns the master agent into something you can genuinely talk to while doing other things, which suits its role as the system's always-available front desk.

Voice Notes in Every Channel

The chatbot extends the same treatment to Discord, Slack, WhatsApp, and Telegram. An incoming voice note is transcribed and handled exactly like a typed message, same memory, same knowledge, same escalation rules, and where you configure it, the reply comes back as synthesized speech, a voice answer to a voice question. WhatsApp deserves special mention because voice notes are the native dialect there: a system that handles them well meets people where they actually communicate.

The Engines

Transcription runs on faster-whisper, the optimized Whisper engine, locally on your hardware: accurate across accents and noise, with nothing recorded leaving your machine. Speech synthesis offers two paths. Piper runs locally, fast, natural neural voices with the same privacy property, your system can speak without a single external call. Amazon Polly is the cloud option for its larger voice catalog, enabled by adding AWS credentials in settings. The tts settings in settings.txt pick the engine and tune the voice, and the System panel in the Config tab exposes the same choices in the UI.

The local-first arrangement matters more for voice than for most features, audio is the most personal data a system handles, and the default path here keeps both directions, listening and speaking, entirely on hardware you own. The Docker image ships ready for this; on a bare server, faster-whisper and piper-tts are the two optional Python packages from the install guide.

Transcription as a Tool

The speech engines are also ordinary members of the tool layer, which means agents can apply them to any audio, not just live conversation. Hand an agent a recording, a meeting, a voicemail export, an interview, and it can transcribe it into text that flows into the same searchable record as everything else, summarize it, or save the substance to the memory bank. The reverse works too: any agent can synthesize speech from text where a spoken artifact is the right output. Voice capability, once present, belongs to the whole system.

Choosing between the synthesis engines is a one-question decision: if keeping audio fully local matters, or you want zero external dependencies, Piper is excellent and free; if you want the widest choice of voices and accents for customer-facing speech, Polly's catalog is the draw, one set of AWS keys away. Switching later is a settings change, nothing downstream cares which engine spoke.

What Voice Unlocks

Some patterns owners settle into quickly. The spoken status check: ask the master agent how things are running while you are away from the desk, and let autoplay read the morning's activity back to you. The walking memo: voice-note the chatbot on WhatsApp with an idea or an instruction, it lands in the conversation record, transcribed and searchable, and worth keeping gets saved to the memory bank. The accessible install: for anyone who finds typing slower than speaking, voice makes the whole platform's depth available at conversation speed. None of these require setup beyond the engines being present, voice is a property of the conversation layer, so every agent behind it simply works.

Key Takeaway

Microphone and spoken replies in the UI, transcribed voice notes with voice answers on every chat platform, faster-whisper for ears and Piper or Polly for the voice, local by default. Speech converts at the edges, so everything else in the platform works unchanged.