Realtime Voice Models in 2026: Why ASR-LLM-TTS Pipelines Still Win for Production CX

xAI’s December announcement of the Grok Voice Agent API prompted me to revisit the state of realtime speech-to-speech models. I spent a few weeks testing all three major platforms against production CX requirements. Here’s an honest assessment.

Platforms evaluated:

OpenAI Realtime API (gpt-realtime, GA since August 2025)
Google Gemini Live API (gemini-2.5-flash-native-audio-preview)
xAI Grok Voice Agent API (launched December 17, 2025)

What Are Native Audio Models, Really?

Before diving into specifics, it’s worth being precise about the architecture.

Native audio models are not “STT → LLM → TTS in one box.” They’re end-to-end models operating directly on learned audio representations — spectrogram-like features processed by the neural network. There’s no explicit transcription step.

These models still learn internal linguistic representations similar to text — implicitly, during training. The difference is architectural: audio in, audio out, with no intermediate text layer exposed to developers.

This explains both the advantages (preserved prosody, lower latency, better emotional context) and the limitations (less control, harder debugging, language-specific issues).

The Promise

Traditional voice AI uses a cascaded pipeline:

Audio → ASR → Text → LLM → Text → TTS → Audio

Each handoff adds latency. Each text conversion loses information — prosody, emotion, emphasis, speaker style. By the time you’ve transcribed, processed, and synthesized, you’ve accumulated 800–1500 ms of delay and flattened the conversational nuance.

Realtime models collapse this:

Audio → Native Audio Model → Audio

OpenAI claims 82.8% on Big Bench Audio reasoning. xAI claims sub-700 ms time-to-first-audio. The benchmarks are impressive.

On paper, this is the future. In practice, the platforms aren’t ready for production CX.

The Issues: Control & Flexibility

Dynamic Context Is Limited

The gpt-realtime API doesn’t support passing dynamic variables to saved prompts.

In CX, every call needs context:

"Hello {customer_name}, I see your order {order_id} is scheduled for {date}..."
"Your account balance is {balance}. Would you like to make a payment?"

Without dynamic prompts, you’re constructing full prompts client-side or building workarounds with tool calls. It’s solvable — but it’s friction that shouldn’t exist in a production API.

Tool Calling Is Unreliable or Missing

For CX, voice AI needs to take actions — check orders, book appointments, update records, process returns.

OpenAI: MCP tools support exists but is buggy. In testing, tool calls fail unpredictably. Sometimes they work. Sometimes the model doesn’t invoke them. Sometimes it hallucinates a response instead of waiting for the function result.

Gemini Live: Doesn’t support MCP at all. You’re limited to native function calling, which has its own reliability issues.

In pipelines, tool calling happens at the LLM layer — text-based, debuggable, mature. You control the orchestration.

Language Support Varies

Good news: Gemini Live offers more voice options than OpenAI, some genuinely impressive for professional CX — natural prosody, appropriate pacing.

Bad news: Those voices can’t correctly pronounce Polish phonemes. Nasal vowels get skipped. Soft consonants get anglicised. Consonant clusters come out garbled. A voice that sounds professional in English sounds broken in Polish.

If you’re building for European markets, test your target language extensively before committing. With pipelines, you choose a TTS provider optimised for your language.

The Issues: Observability & Debugging

Black Box Infrastructure

gpt-realtime offers managed SIP ingress and a media server — connect directly to phone systems without building your own telephony layer. Convenient for getting started.

The problem: No access to logs from those components.

You can’t monitor call quality (jitter, packet loss, MOS). You can’t debug dropping calls. You can’t trace SIP handshakes. You can’t audit for compliance. When something breaks, you can’t tell if it’s your configuration, the network, or OpenAI’s infrastructure.

This isn’t just a production blocker — it makes PoC development frustrating. Debugging requires visibility.

You Can’t Isolate Failures

When an ASR-LLM-TTS pipeline fails, you know where: the transcription was wrong, the LLM hallucinated, or the TTS mispronounced something. You trace through each stage.

When a native audio model produces bad output, you can’t isolate the problem. The audio went in, different audio came out, and something in between went wrong.

The Issues: Operational Constraints

Session Limits

Gemini Live:

Audio-only: 15 minutes max
Audio + video: 2 minutes max
Connection lifetime: ~10 minutes

Complex CX calls regularly exceed these. Claims disputes. Technical troubleshooting. Onboarding flows. When your session terminates mid-call, the customer experience breaks.

OpenAI’s 60-minute limit is better, but session management, context limits (128k tokens), and reconnection handling add complexity.

In pipelines, session architecture is yours. You control context, state, and graceful degradation.

Cost Compounds with Conversation Length

Realtime models maintain conversational context by reprocessing audio history each turn. Longer conversation = more accumulated context = higher cost per response. A 2-minute call might cost $0.10. A 10-minute call might cost $2+. The economics become unpredictable for longer interactions.

Pipelines process each turn relatively independently. The LLM sees a context window, but you control what’s in it. Costs are more linear and predictable — roughly the same per-minute rate whether it’s minute 2 or minute 12.

For CX with variable call lengths, predictable unit economics matter.

Vendor Lock-in

With realtime models, you’re committed to a single vendor’s audio processing, voice generation, and (often) telephony stack. If OpenAI’s voices don’t fit your brand, or Gemini’s pronunciation doesn’t work for your language, or xAI’s latency degrades — you’re stuck.

With pipelines, each component is independent. Swap ASR providers. Switch LLMs. Try different TTS voices. Upgrade as better options emerge. The modular architecture protects your investment.

What’s Genuinely Impressive

Despite the issues, there’s real progress worth acknowledging:

Latency is better. Sub-700 ms response times in ideal conditions. For conversational flow, this is noticeable.

Prosody is preserved. The models respond to tone and emphasis in ways text-mediated pipelines can’t. Emotional context carries through.

Voice variety is expanding. Gemini’s selection shows where this is heading — more options for matching brand personality.

Pricing models are evolving. xAI’s $0.05/min flat rate is simpler than token-based pricing. Predictable costs matter for business planning.

The Architecture Tradeoff

Dimension	Native Audio (Realtime)	ASR-LLM-TTS Pipeline
Latency	Better (sub-700 ms)	Higher (500–1000 ms)
Prosody / Emotion	Preserved	Lost in transcription
Dynamic context	Limited	Full control
Tool calling	Unreliable	Mature, debuggable
Observability	Black box	Component-level
Language support	Vendor-dependent	Mix best-in-class
Session management	Platform constraints	Your architecture
Cost model	Compounds with length	Predictable
Vendor lock-in	High	Low

When to Use Which

Realtime models make sense for:

Demos and prototypes
Pure conversation (no tool calls)
English-only deployment
Short interactions (under 5 minutes)
Latency-critical, low-complexity use cases

Pipelines make sense for:

Production CX deployments
Tool calling and system integration
Multi-language support
Complex, variable-length interactions
Environments requiring observability
Teams wanting vendor flexibility

What Would Change This Assessment?

I’m watching for:

Dynamic prompt support — basic feature, shouldn’t be hard
Reliable MCP / tool calling — across all platforms
Observability on managed infrastructure — logs, metrics, traces
Better non-English voice quality — especially European languages
Linear cost scaling — or at least predictable pricing

These aren’t exotic requirements. They’re table stakes for production CX.

The Bottom Line

Realtime speech-to-speech models represent the future of voice AI. The architecture is elegant. The user experience is more natural. The direction is right.

But “right direction” isn’t “production ready.”

For customer experience deployments today, the ASR-LLM-TTS pipeline remains the architecture to ship:

You control each component
You can debug what breaks
You can integrate with your systems
You can optimise for your language
You can swap providers as better options emerge

I’ll revisit this in Q2. The platforms are improving fast. But today, if you need to ship a production voice agent — build the pipeline.

Testing these platforms yourself? Get in touch — I’d like to compare notes on what you’re finding.