The audio pipeline of a voice OS is the chain of stages that turns raw microphone samples into spoken AI output. There are five stages: capture, voice activity detection, streaming speech-to-text, language and memory processing, and streaming speech synthesis. Each stage runs on its own time budget. If any stage stalls, the whole pipeline stalls, which is why pipeline design matters more than any single component's quality. The best STT in the world is useless if the TTS keeps the user waiting.
WHAT TO LOOK FOR
Microphone capture and Opus encoding
The browser captures audio at 48 kilohertz and encodes it into Opus frames at typically 24 kilobits per second. Opus is preferred over raw PCM because it cuts upload bandwidth by 90 percent without losing speech quality. The frame size of 20 milliseconds is the standard tradeoff between latency and packet overhead.
Voice activity detection
VAD runs on the audio gateway and decides when speech is happening, when it has stopped, and when a thinking pause is just a pause. Modern VAD uses a small neural network that is robust to background noise, music, and other speakers, which keeps the pipeline from triggering on the wrong things.
Streaming speech-to-text
The STT layer receives the audio stream and emits two kinds of transcript: interim results that update word by word, and finalized results once a phrase is committed. Interim results let the UI show what the user is saying in real time; finals are what get sent to the LLM.
TLDR:Lucy OS1 runs a fully streaming audio pipeline. The browser captures 20 millisecond Opus frames and sends them over WebRTC to the audio gateway. Deepgram nova-3 receives the frames and emits partial transcripts within 200 milliseconds. The LLM begins generating tokens as soon as endpointing fires. Cartesia Sonic-2 streams the first audio byte back to the browser within 200 milliseconds of the first LLM token. The user hears Lucy begin speaking before the full response is generated, which is why exchanges feel like talking to a person rather than waiting on a server.
The browser captures audio at 48 kilohertz and encodes it into Opus frames at typically 24 kilobits per second. Opus is preferred over raw PCM because it cuts upload bandwidth by 90 percent without losing speech quality. The frame size of 20 milliseconds is the standard tradeoff between latency and packet overhead.
VAD runs on the audio gateway and decides when speech is happening, when it has stopped, and when a thinking pause is just a pause. Modern VAD uses a small neural network that is robust to background noise, music, and other speakers, which keeps the pipeline from triggering on the wrong things.
The STT layer receives the audio stream and emits two kinds of transcript: interim results that update word by word, and finalized results once a phrase is committed. Interim results let the UI show what the user is saying in real time; finals are what get sent to the LLM.
Once an utterance is endpointed, the LLM receives the final transcript along with the injected context window. If a tool call is required, the tool router executes it and feeds the result back. The LLM emits its response as a token stream, not a full string.
The TTS layer accepts the LLM token stream and begins synthesizing audio for completed sentences immediately. Cartesia Sonic-2 begins streaming audio within 200 milliseconds of receiving the first sentence boundary, which is what eliminates the awkward pause many older voice AIs have between question and answer.
The browser plays the streaming TTS audio while the microphone stays open. If the user starts speaking again, VAD detects it, the TTS playback halts immediately, and the new utterance enters the pipeline. This is how natural interruption works.
QUICK COMPARISON
| Capability | Lucy OS1 | Most AI tools |
|---|---|---|
| Memory across sessions | ✓ Permanent, never resets | ✗ Resets after every session |
| Voice quality | ✓ Lucy OS1 Natural Voice (best-in-class) | ✗ Basic STT, struggles with noise |
| Calendar awareness | ✓ Reads Google Calendar in real time | ✗ No calendar access |
| Available 24/7 | Always on, any device | Available but stateless each time |
| Gets personal over time | ✓ Builds your context continuously | ✗ Starts from zero every session |
Voice-first AI with memory and calendar integration. Free to try.
Start TalkingFree tier available. No credit card required.
GET STARTED
Create your free account
No credit card required. Sign in with your Google account and you're inside in under a minute.
Connect your Google Calendar
Lucy reads your upcoming events before every conversation, so it already knows your day before you say a word.
Start talking about the voice ai audio pipeline
Speak naturally. Lucy listens, responds by voice, and begins building context from your very first exchange. The more you use it, the better it gets.
Welcome