Voice AI Audio Pipeline: Mic to Speaker Explained (2026)

WHAT TO LOOK FOR

The three things that actually matter

Microphone capture and Opus encoding

The browser captures audio at 48 kilohertz and encodes it into Opus frames at typically 24 kilobits per second. Opus is preferred over raw PCM because it cuts upload bandwidth by 90 percent without losing speech quality. The frame size of 20 milliseconds is the standard tradeoff between latency and packet overhead.

Voice activity detection

VAD runs on the audio gateway and decides when speech is happening, when it has stopped, and when a thinking pause is just a pause. Modern VAD uses a small neural network that is robust to background noise, music, and other speakers, which keeps the pipeline from triggering on the wrong things.

Streaming speech-to-text

The STT layer receives the audio stream and emits two kinds of transcript: interim results that update word by word, and finalized results once a phrase is committed. Interim results let the UI show what the user is saying in real time; finals are what get sent to the LLM.

TLDR:Lucy OS1 runs a fully streaming audio pipeline. The browser captures 20 millisecond Opus frames and sends them over WebRTC to the audio gateway. Deepgram nova-3 receives the frames and emits partial transcripts within 200 milliseconds. The LLM begins generating tokens as soon as endpointing fires. Cartesia Sonic-2 streams the first audio byte back to the browser within 200 milliseconds of the first LLM token. The user hears Lucy begin speaking before the full response is generated, which is why exchanges feel like talking to a person rather than waiting on a server.

Why Lucy OS1

Microphone capture and Opus encoding

Voice activity detection

Streaming speech-to-text

Language model and tool routing

Once an utterance is endpointed, the LLM receives the final transcript along with the injected context window. If a tool call is required, the tool router executes it and feeds the result back. The LLM emits its response as a token stream, not a full string.

Streaming text-to-speech

The TTS layer accepts the LLM token stream and begins synthesizing audio for completed sentences immediately. Cartesia Sonic-2 begins streaming audio within 200 milliseconds of receiving the first sentence boundary, which is what eliminates the awkward pause many older voice AIs have between question and answer.

Audio playback and barge-in

The browser plays the streaming TTS audio while the microphone stays open. If the user starts speaking again, VAD detects it, the TTS playback halts immediately, and the new utterance enters the pipeline. This is how natural interruption works.

QUICK COMPARISON

Lucy OS1 vs most AI tools

Capability	Lucy OS1	Most AI tools
Memory across sessions	✓ Permanent, never resets	✗ Resets after every session
Voice quality	✓ Lucy OS1 Natural Voice (best-in-class)	✗ Basic STT, struggles with noise
Calendar awareness	✓ Reads Google Calendar in real time	✗ No calendar access
Available 24/7	Always on, any device	Available but stateless each time
Gets personal over time	✓ Builds your context continuously	✗ Starts from zero every session

Try Lucy OS1, setup takes 30 seconds

Voice-first AI with memory and calendar integration. Free to try.

Start Talking

Free tier available. No credit card required.

GET STARTED

How to use Lucy OS1

Create your free account

No credit card required. Sign in with your Google account and you're inside in under a minute.

Connect your Google Calendar

Lucy reads your upcoming events before every conversation, so it already knows your day before you say a word.

Start talking about the voice ai audio pipeline

Speak naturally. Lucy listens, responds by voice, and begins building context from your very first exchange. The more you use it, the better it gets.

Start for free → Free tier available. No credit card.

Frequently Asked Questions

Why use Opus instead of raw PCM for the upstream audio?

Opus delivers transparent speech quality at 24 kilobits per second versus 768 kilobits per second for raw 16-bit PCM at 48 kilohertz. The bandwidth saving matters on mobile networks and reduces latency on poor links.

What is the time budget for each pipeline stage?

A 500 millisecond end-to-end target typically allocates 50 milliseconds to capture and upload, 150 milliseconds to STT finalization, 150 milliseconds to LLM first token, and 150 milliseconds to TTS first audio. The budgets compound, so any stage that overruns its share blows the whole exchange.

Does WebRTC matter or can WebSockets do this?

WebSockets work for streaming audio but lack jitter buffering, FEC, and packet loss concealment. WebRTC handles all three natively, which keeps voice AI usable on weak Wi-Fi and mobile networks.

How is background noise handled in the pipeline?

Modern STT models like Deepgram nova-3 are trained on noisy data and handle most household and office noise without preprocessing. For very noisy environments, optional noise suppression runs before encoding to improve transcript accuracy.

Can the pipeline be paused and resumed?

Yes. The session manager can pause the audio path while preserving the LLM context, then resume when the user comes back. This is how Lucy OS1 supports leaving a session running while you change rooms or pick up the phone.

How does the pipeline recover from a network drop?

The browser holds the audio session open for up to 30 seconds during reconnection. Buffered audio is replayed once the connection restores. If reconnection fails, the session is gracefully ended and the user is shown a reconnect button.

The Voice AI Audio Pipeline

The three things that actually matter

Why Lucy OS1

Microphone capture and Opus encoding

Voice activity detection

Streaming speech-to-text

Language model and tool routing

Streaming text-to-speech

Audio playback and barge-in

Lucy OS1 vs most AI tools

Try Lucy OS1, setup takes 30 seconds

How to use Lucy OS1

Frequently Asked Questions