xAI’s December announcement of the Grok Voice Agent API prompted me to revisit the state of realtime speech-to-speech models. I spent a few weeks testing all three major platforms against production CX requirements. Here’s an honest assessment.
Platforms evaluated:
- OpenAI Realtime API (gpt-realtime, GA since August 2025)
- Google Gemini Live API (gemini-2.5-flash-native-audio-preview)
- xAI Grok Voice Agent API (launched December 17, 2025)
What Are Native Audio Models, Really?
Before diving into specifics, it’s worth being precise about the architecture.
Native audio models are not “STT → LLM → TTS in one box.” They’re end-to-end models operating directly on learned audio representations — spectrogram-like features processed by the neural network. There’s no explicit transcription step.
These models still learn internal linguistic representations similar to text — implicitly, during training. The difference is architectural: audio in, audio out, with no intermediate text layer exposed to developers.
This explains both the advantages (preserved prosody, lower latency, better emotional context) and the limitations (less control, harder debugging, language-specific issues).
The Promise
Traditional voice AI uses a cascaded pipeline:
Audio → ASR → Text → LLM → Text → TTS → Audio
Each handoff adds latency. Each text conversion loses information — prosody, emotion, emphasis, speaker style. By the time you’ve transcribed, processed, and synthesized, you’ve accumulated 800–1500 ms of delay and flattened the conversational nuance.
Realtime models collapse this:
Audio → Native Audio Model → Audio
OpenAI claims 82.8% on Big Bench Audio reasoning. xAI claims sub-700 ms time-to-first-audio. The benchmarks are impressive.
On paper, this is the future. In practice, the platforms aren’t ready for production CX.
The Issues: Control & Flexibility
Dynamic Context Is Limited
The gpt-realtime API doesn’t support passing dynamic variables to saved prompts.
In CX, every call needs context:
"Hello {customer_name}, I see your order {order_id} is scheduled for {date}..."
"Your account balance is {balance}. Would you like to make a payment?"
Without dynamic prompts, you’re constructing full prompts client-side or building workarounds with tool calls. It’s solvable — but it’s friction that shouldn’t exist in a production API.
Tool Calling Is Unreliable or Missing
For CX, voice AI needs to take actions — check orders, book appointments, update records, process returns.
OpenAI: MCP tools support exists but is buggy. In testing, tool calls fail unpredictably. Sometimes they work. Sometimes the model doesn’t invoke them. Sometimes it hallucinates a response instead of waiting for the function result.
Gemini Live: Doesn’t support MCP at all. You’re limited to native function calling, which has its own reliability issues.
In pipelines, tool calling happens at the LLM layer — text-based, debuggable, mature. You control the orchestration.
Language Support Varies
Good news: Gemini Live offers more voice options than OpenAI, some genuinely impressive for professional CX — natural prosody, appropriate pacing.
Bad news: Those voices can’t correctly pronounce Polish phonemes. Nasal vowels get skipped. Soft consonants get anglicised. Consonant clusters come out garbled. A voice that sounds professional in English sounds broken in Polish.
If you’re building for European markets, test your target language extensively before committing. With pipelines, you choose a TTS provider optimised for your language.
The Issues: Observability & Debugging
Black Box Infrastructure
gpt-realtime offers managed SIP ingress and a media server — connect directly to phone systems without building your own telephony layer. Convenient for getting started.
The problem: No access to logs from those components.
You can’t monitor call quality (jitter, packet loss, MOS). You can’t debug dropping calls. You can’t trace SIP handshakes. You can’t audit for compliance. When something breaks, you can’t tell if it’s your configuration, the network, or OpenAI’s infrastructure.
This isn’t just a production blocker — it makes PoC development frustrating. Debugging requires visibility.
You Can’t Isolate Failures
When an ASR-LLM-TTS pipeline fails, you know where: the transcription was wrong, the LLM hallucinated, or the TTS mispronounced something. You trace through each stage.
When a native audio model produces bad output, you can’t isolate the problem. The audio went in, different audio came out, and something in between went wrong.
The Issues: Operational Constraints
Session Limits
Gemini Live:
- Audio-only: 15 minutes max
- Audio + video: 2 minutes max
- Connection lifetime: ~10 minutes
Complex CX calls regularly exceed these. Claims disputes. Technical troubleshooting. Onboarding flows. When your session terminates mid-call, the customer experience breaks.
OpenAI’s 60-minute limit is better, but session management, context limits (128k tokens), and reconnection handling add complexity.
In pipelines, session architecture is yours. You control context, state, and graceful degradation.
Cost Compounds with Conversation Length
Realtime models maintain conversational context by reprocessing audio history each turn. Longer conversation = more accumulated context = higher cost per response. A 2-minute call might cost $0.10. A 10-minute call might cost $2+. The economics become unpredictable for longer interactions.
Pipelines process each turn relatively independently. The LLM sees a context window, but you control what’s in it. Costs are more linear and predictable — roughly the same per-minute rate whether it’s minute 2 or minute 12.
For CX with variable call lengths, predictable unit economics matter.
Vendor Lock-in
With realtime models, you’re committed to a single vendor’s audio processing, voice generation, and (often) telephony stack. If OpenAI’s voices don’t fit your brand, or Gemini’s pronunciation doesn’t work for your language, or xAI’s latency degrades — you’re stuck.
With pipelines, each component is independent. Swap ASR providers. Switch LLMs. Try different TTS voices. Upgrade as better options emerge. The modular architecture protects your investment.
What’s Genuinely Impressive
Despite the issues, there’s real progress worth acknowledging:
Latency is better. Sub-700 ms response times in ideal conditions. For conversational flow, this is noticeable.
Prosody is preserved. The models respond to tone and emphasis in ways text-mediated pipelines can’t. Emotional context carries through.
Voice variety is expanding. Gemini’s selection shows where this is heading — more options for matching brand personality.
Pricing models are evolving. xAI’s $0.05/min flat rate is simpler than token-based pricing. Predictable costs matter for business planning.
The Architecture Tradeoff
| Dimension | Native Audio (Realtime) | ASR-LLM-TTS Pipeline |
|---|---|---|
| Latency | Better (sub-700 ms) | Higher (500–1000 ms) |
| Prosody / Emotion | Preserved | Lost in transcription |
| Dynamic context | Limited | Full control |
| Tool calling | Unreliable | Mature, debuggable |
| Observability | Black box | Component-level |
| Language support | Vendor-dependent | Mix best-in-class |
| Session management | Platform constraints | Your architecture |
| Cost model | Compounds with length | Predictable |
| Vendor lock-in | High | Low |
When to Use Which
Realtime models make sense for:
- Demos and prototypes
- Pure conversation (no tool calls)
- English-only deployment
- Short interactions (under 5 minutes)
- Latency-critical, low-complexity use cases
Pipelines make sense for:
- Production CX deployments
- Tool calling and system integration
- Multi-language support
- Complex, variable-length interactions
- Environments requiring observability
- Teams wanting vendor flexibility
What Would Change This Assessment?
I’m watching for:
- Dynamic prompt support — basic feature, shouldn’t be hard
- Reliable MCP / tool calling — across all platforms
- Observability on managed infrastructure — logs, metrics, traces
- Better non-English voice quality — especially European languages
- Linear cost scaling — or at least predictable pricing
These aren’t exotic requirements. They’re table stakes for production CX.
The Bottom Line
Realtime speech-to-speech models represent the future of voice AI. The architecture is elegant. The user experience is more natural. The direction is right.
But “right direction” isn’t “production ready.”
For customer experience deployments today, the ASR-LLM-TTS pipeline remains the architecture to ship:
- You control each component
- You can debug what breaks
- You can integrate with your systems
- You can optimise for your language
- You can swap providers as better options emerge
I’ll revisit this in Q2. The platforms are improving fast. But today, if you need to ship a production voice agent — build the pipeline.
Testing these platforms yourself? Get in touch — I’d like to compare notes on what you’re finding.