Beyond Words

Jun 09, 2026

Voice AI has gotten really good, really fast. Google’s Gemini 3.1 Flash Live and OpenAI’s GPT-Realtime-2 can now reason over a live conversation and pick up on tone, not just words. AI is starting to hear not only what we say but how we say it.

But only up to a point. Three things are still in the way. The first is what these systems do with how you sound: they hear it, then mostly go with your words. The second is who gets heard at all: the models that come closest are closed, and work best in English and a few high-resource languages, leaving most of the world behind. And third, when AI clones a voice, even the best cloning models we have today flatten vocal identity rather than preserving it, pulling every voice toward the same center.

Martijn Bartelds, a speech researcher at Together AI who did his postdoc at Stanford, has spent most of his career studying voice AI, across multilingual models, endangered-language ASR, and voice synthesis. “The human voice is so rich,” he told me. “And I think that makes human-to-human communication also so special.” I sat down with him this week to talk through where voice AI falls short and what it will take to fix it.

The architectural problem

The standard way voice AI is built is itself the first place it falls short. Traditionally, voice AI models transcribe speech to text first, then reason over the text. This architecture keeps the words and discards nearly everything else. “If you do the transcription process, of course the only thing that you keep is just an orthographic representation. So the text only. You abstract away from everything else,” Martijn explained. “But if you want to let the model deeply understand paralinguistic information – everything that is captured in the speech signal that makes up for all the things other than just the words – you need a vastly different approach.”

In our conversation, Martijn referenced a 2025 ACL survey of speech language models that also discusses the problem with current voice AI model architecture. The survey describes the problems as threefold:

Information loss during modality conversion
Latency from chaining three systems (speech-to-text (ASR) → language model (LLM) → text-to-speech (TTS))
Errors that accumulate across them

Another architectural path some voice AI models choose to take is to use an audio-encoder-plus-LLM. Here a speech encoder is attached to an LLM so the model consumes a representation of the audio. But this approach has an alignment problem: “You have to somehow align the representations of the speech encoder model and the large language model.”

The richness of a voice, visualized. Transcription reduces all of this to a line of text. Source.

With both paths the audio is treated as a second-class citizen, either discarded for text or bolted on after the fact. Martijn wants a third option: a single model that treats speech and text as equals from the start, so nothing has to be translated away or stitched together. “Having the ability to really reason about the audio is something that seems crucial to me,” he said.

Building toward a unified model

Having a single model that keeps meaning and paralinguistic content together is also the architecture that the 2025 ACL paper proposes. With no conversion to text, nothing is lost in translation. And collapsing three chained systems (ASR, LLM and TTS) into one removes both the latency and the accumulating error.

“I see the model being one complete engine, so to speak,” Martijn told me. “It should handle the text and the audio… they should be of equal importance.” In other words, he’s imagining one embedding space where it no longer matters whether a piece of understanding arrived as speech or as text, where reasoning about the audio is as native to the model as reasoning about text.

This is the architecture the frontier has now largely adopted. Google’s Gemini 3.1 Flash Live processes raw audio natively in a single real-time model, which lets it read tone and emotion along with the words. OpenAI’s GPT-Realtime-2, released in May 2026, works the same way, reasoning through a live conversation without the round trip to a separate text model older pipelines required.

However, while these models are very good now at expressive output and voice generation, they are still not very good at fully understanding and acting upon the information present in the input (i.e. the details in the user’s incoming voice).

The data problem

LLMs got good fast in large part because they had a lot of data from the web to train on. Audio doesn’t have an equivalent to the web that is readily available. High-quality speech data has to be manufactured, which means cleaning recordings, aligning words to audio, labeling speakers and languages, filtering noise. Of these steps, Martijn says alignment is probably the hardest part: “For some approaches, you need a careful alignment between the words and the actual text. So you need the data to be transcribed in the first place.”

For well-resourced languages, commercial incentives justify the work needed to get and clean data; and there’s more raw data available in the first place. But for many of the world’s languages, such incentives often don’t exist. “For some of the digitally underrepresented languages I worked with, like the Dutch dialects or Australian Aboriginal languages, only a couple of hours are available. There is nothing else there,” he said. “This means you have to be creative. It’s about figuring out how we can get the most out of the data.”

And for Martijn, being creative means working on the model and the data together, rather than treating them as separate problems solved in sequence. During his postdoc, Martijn built a multilingual training algorithm that left the dataset fixed but made the model aware, mid-training, of which languages were lagging. This let it shift more weight toward them and lifted performance across the set without a single new hour of audio. The data question and the modeling question, as he puts it, go hand in hand.

The voice cloning distortion

Toward the end of our conversation, Martijn and I talked about the third problem: voice cloning, an increasingly prominent use of voice AI, where even with the best models we have today, the richness of a voice is getting lost on the way out.

“Voice cloning” implies fidelity and it’s easy to assume the output of this technology is an exact copy of a speaker’s voice. But in Voice “Cloning” is Style Transfer, Martijn and his collaborators found the opposite. Listeners consistently rated the cloned voices as more customer-service-like, authoritative, and warm than the originals. “These models don’t really faithfully clone someone’s voice, but more or less transfer this style of how someone speaks,” Martijn said.

The fear with cloning is impersonation (deepfakes, fraud, etc). But Martijn’s work sheds light on another risk: that voice AI actually reshapes how we sound. His study found cloning flattens vocal identity rather than preserves it, nudging every voice toward the same optimized center.

The same pattern is showing up in text. In How LLMs Distort Our Written Language, Natasha Jaques and her collaborators found that LLM edits move essays farther than human edits do (even when asked for minimal changes) and in a consistent direction.

When AI mediates human expression, it optimizes away the irregularities that make communication personal. Whether the medium is writing or voice, AI pulls us toward a common mean such that we all begin to sound and write the same.

What it means to be heard

What these systems can hear and who gets heard both come down to whether we treat voice as something richer than text. The frontier models are finally starting to hear the how and not just the what. But these systems that do it best are closed, and they still work best in English and a handful of high-resource languages. “I would just love to see more people working on trying to create these multimodal audio-text language models end to end and making them open source,” Martijn said. “Having a very strong open-source competitor in that space would be fantastic,” especially, he notes, one that also serves speakers of digitally underrepresented languages.

Martijn envisions a model that understands us more fully when we speak, and leaves us sounding like ourselves. “If we say the exact same content, but you have more hesitation in your voice, you should get a different answer than me,” he said.

To get there, voice has to be treated as something fundamentally different from text. Language is not its transcript. And being heard is not the same as being transcribed. “The words, your pitch, your tone. This is so broad and so rich, but it’s everything,” Martijn said. “The model should go beyond the words.”

Author’s note: An LLM was used for light copy editing only (spelling, grammar, and clarity). Content, meaning, tone, and structure remain unchanged.

AI Opportunities

Discussion about this post

Ready for more?