Voice AI lives or dies on latency. Above 500ms, users start talking over the agent. Above 800ms, they hang up. Our north star has been to keep the round-trip from "user stops speaking" to "agent starts speaking" under 200ms, end-to-end, across STT, the LLM, and TTS. This post walks through how we got there.

What "200ms" actually means

The clock starts when our VAD detects end-of-utterance. It stops when the first audio packet of the agent's response leaves our edge. That window includes: VAD finalization, STT post-processing, LLM time-to-first-token, TTS first-chunk synthesis, and network egress to the carrier.

Three things matter more than anything else: pipelining the stages, minimizing buffers between them, and never blocking on slow tails.

Pipelining the stages

Naive pipeline: STT → LLM → TTS, sequential. We never finalize STT before kicking off the LLM. Instead, we send partial transcripts to a "thinker" stage that drafts the likely response while the user is still finishing their sentence. By the time end-of-utterance fires, the LLM has usually already produced 30+ tokens of a candidate response, which we either keep or discard based on the final transcript.

Buffers are the enemy

Every TCP buffer between stages is latency you pay for nothing. We replaced the gRPC streams between STT and the orchestrator with shared-memory ring buffers, and between the orchestrator and TTS with a single Unix domain socket. That alone shaved 40ms off p50.

The slow-tail problem

Average latency is a lie. What you care about is the worst 5% of calls. Our biggest wins came from killing slow tails: aggressive timeouts on TTS, parallel speculative decoding on the LLM, and warm pools of pre-initialized GPU workers.

Where we are now

p50 sits at 180ms. p95 at 290ms. p99 at 450ms. That p99 is what we're working on next.

more testing

What "200ms" actually means

Pipelining the stages

Buffers are the enemy

The slow-tail problem

Where we are now

Alok

Comments

Keep reading

test2

test t

How to launch a voice agent in 5 minutes

Build your first voice AI agent

Read the documentation