
Text-to-speech TTS moved fast over the past year. The line between synthetic and human speech narrowed. Latency dropped below 100 milliseconds for some real-time systems. Emotional control became a standard feature rather than a research demo. This guide reviews the models that really matter in 2026. It is written for AI professionals choosing a model for production.
How to read TTS benchmarks in 2026
Two benchmarks dominate in most community discussions. The first is the Artificial Analysis Speech Arena Leaderboard. It ranks models by blind human preference using an ELO rating. As of 2026 it evaluates dozens of production APIs. The second is the community-run TTS Arena on Hugging Face. It uses the same blind A/B voting method.
These leaderboards measure perceived quality, not accuracy. They also change continuously. As of May 30, 2026, the Artificial Analysis Speech Arena lists Gemini 3.1 Flash TTS, Realtime TTS-2 (Research Preview), Sonic 3.5, Realtime TTS 1.5 Max, and Fun-Realtime-TTS-Preview as its top five by ELO. Those positions shifted within the prior weeks, and they will shift again. Treat any single number as a point-in-time reading, not a fixed truth.
Accuracy needs separate measurement. Trelis Research tested ten models using a round-trip character error rate, or CER. The method transcribes generated audio with an ASR model, then compares it to the input text. Mean opinion score, or MOS, captures perceived naturalness. Both metrics have limits. Round-trip CER depends on the ASR model’s own accuracy. The UTMOS quality estimator was trained on audio up to ten seconds, so longer samples show less score spread.
Latency is the third axis. The relevant figure for voice agents is time-to-first-audio, or TTFA. Time-to-first-byte, or TTFB, can be misleading, since container headers carry no audio. Consistency matters as much as the median. A Gradium benchmark from May 2026 measured the interquartile range across providers. Tail latency, not the average, determines user experience at scale.
In short, no benchmark is complete. Quality, accuracy, latency, language coverage, and price all trade off. The right model depends on which axis your application cannot compromise.
Commercial leaders
#1 Inworld TTS-1.5 and Realtime TTS-2
Inworld AI is a research lab founded by a team from Google and DeepMind. It released TTS-1.5 on January 21, 2026. The model targets real-time, consumer-scale applications. Inworld reports roughly 30 percent more expressive range than TTS-1. It also reports about 40 percent better stability, measured through word error rate and output consistency.
TTS-1.5 ships in two tiers. The Mini tier is tuned for latency-sensitive workloads such as voice agents and gaming. The Max tier balances higher stability with low latency. Inworld reports P90 time-to-first-audio under 130 milliseconds for Mini and under 250 milliseconds for Max. The model supports 15 languages and offers both instant and professional voice cloning.
Pricing is tiered by plan, not a single rate. On the On-Demand and Creator plans, Inworld lists $25 per million characters for TTS 1.5 Mini and $35 for Realtime TTS-2 and TTS 1.5 Max. The Developer and Growth plans cut those rates; Growth reaches $15 for Mini and $25 for Max and TTS-2. Enterprise pricing goes as low as $5 and $10 respectively. Note that TTS 1.5 covers 15 languages, while TTS-2 covers over 100.
Inworld later added Realtime TTS-2 in 2026. It is described as a closed-loop voice model with stronger steering and expressiveness. Across several leaderboard snapshots, Inworld reported holding three of the top five spots on the Artificial Analysis Speech Arena.
Inworld suits developers building voice agents at consumer scale. The combination of low latency and aggressive pricing is its main draw.
#2 Google Gemini 3.1 Flash TTS
Google DeepMind released Gemini 3.1 Flash TTS on April 15, 2026. It is a preview model available through the Gemini API, Google AI Studio, Vertex AI, and Google Vids. The model introduces more than 200 audio tags. These tags steer style, tone, pacing, accent, and scene direction.
On Google’s own report, the model reached an ELO of 1,211 on the Artificial Analysis leaderboard. It supports 70-plus languages and native multi-speaker dialogue. Google built it on the Gemini family rather than a standalone speech stack. The model treats generation as a language task: it decides not only what to say, but how to say it.
The model has documented limitations that matter for deployment. A TTS session has a 32,000-token context window, and Google’s docs state that Gemini TTS does not support streaming. It is built for controlled text recitation, not interactive voice agents; the separate Live API is Google’s real-time path. Output quality can drift on generations longer than a few minutes, so Google recommends chunking. The model offers 30 prebuilt voices. All generated audio carries a SynthID watermark for AI-content identification.
Gemini 3.1 Flash TTS fits podcast and audiobook generation with fine-grained control. It is a strong default for teams already on Google Cloud.
#3 ElevenLabs v3
ElevenLabs released Eleven v3 in alpha on June 5, 2025. It reached general availability in early 2026, per the company’s announcement. ElevenLabs describes it as its most expressive model. It introduced inline audio tags formatted in lowercase square brackets. Examples include [whispers], [laughs], [sighs], and scene cues like [interrupting]. The model supports more than 70 languages.
The GA release refined the alpha. ElevenLabs reports users preferred the new version about 72 percent of the time. It also improved how the model handles numbers, symbols, and specialized notation.
A key feature is Text to Dialogue. It weaves multiple voices into one generation pass. The model matches prosody and emotional range across speakers. It can handle interruptions and shifting moods with limited prompting.
Eleven v3 still requires more prompt engineering than earlier models. It is not built for real-time use. ElevenLabs states the larger model and higher-fidelity codec take longer to run. For real-time and conversational use, the company recommends Flash v2.5 instead. Those models stream with low latency, around the 75-millisecond range in vendor figures.
ElevenLabs v3 fits narrative content, audiobooks, and character work where quality outweighs speed. It remains a common starting point for high-quality voice production.
#4 MiniMax Speech 2.6 HD and later
MiniMax built a competitive line of speech models with limited attention in English-speaking markets. Speech 2.6 HD offers strong expressiveness and support for 40-plus languages. It sits high on several leaderboard snapshots. One January 2026 reading placed Speech 2.6 HD near the top on Artificial Analysis.
The Turbo variant targets agents, keeping latency under 250 milliseconds. MiniMax’s appeal is its price-to-performance ratio. It delivers emotion control that competes with more expensive flagships. Later HD versions, such as Speech 2.8 HD, appear in 2026 leaderboard snapshots at premium pricing.
MiniMax fits multilingual applications that need expressiveness without flagship pricing.
#5 Hume Octave 2
Hume AI takes a different design approach. Octave 2 is a speech-language model that reads for meaning before generating audio. It produces emotionally calibrated speech rather than applying fixed pronunciation rules. The model shifts delivery on its own as a script moves from calm to urgent. It does this without explicit tags or instructions.
The trade-offs are real. Language coverage is narrow compared to multilingual flagships. Building cloned voices into a production API requires a sales process. Reported pricing varies widely by source and tier, from under $10 to over $100 per million characters. Confirm the current rate with Hume before budgeting.
Octave 2 fits applications where tone carries weight. Examples include companion agents, mental-health tools, and customer interactions where flat delivery breaks the experience.
#6 Cartesia Sonic 3 and Sonic 3.5
Cartesia optimizes for speed. Sonic uses a State Space Model, or SSM, architecture instead of transformers. SSM inference scales linearly rather than quadratically with sequence length. This keeps latency low under load. Cartesia reports model latency under 100 milliseconds, and an end-to-end time-to-first-audio near 82 milliseconds on Sonic 3.5.
Sonic 3 was released in late 2025. Sonic 3.5 followed in May 2026 and is now the recommended stable model. Both support 42 languages, including nine Indian languages, with more than 500 voices. Cartesia briefly held the number-one spot on the Artificial Analysis leaderboard with Sonic 3.5 before others overtook it. The models add refined prosody, wider emotional range, real-time laughter, and voice cloning from short samples.
Sonic 3 fits real-time conversational agents where latency is the hard constraint. It is a TTS-only system, so teams bring their own speech-to-text and language model.
#7 Speechify SIMBA 3.0
Speechify positions SIMBA 3.0 as a cost-efficient flagship. The company reported a number-seven rank on the Artificial Analysis leaderboard in May 2026. Its reported ELO was about 1,159, at a list price near $10 per million characters. That made it the lowest-priced model in the reported top ten.
These figures come from Speechify’s own announcement, so verify them independently before committing. SIMBA 3.0 fits teams seeking benchmark-competitive quality at lower cost than premium flagships.
#8 OpenAI gpt-4o-mini-tts and the Realtime line
OpenAI announced gpt-4o-mini-tts in March 2025. It is built on the GPT-4o-mini architecture. Its main feature is steerability through natural-language instructions. Developers can instruct the model on how to say something, not just what. An example instruction is “speak in a calm, empathetic tone.” OpenAI also released a playground for testing at OpenAI.fm.
OpenAI shipped an updated snapshot, gpt-4o-mini-tts-2025-12-15, in December 2025. It reports roughly 35 percent lower word error rate on the Common Voice and FLEURS benchmarks. The update also improved Custom Voices, which let organizations build a branded voice from a reference sample. The endpoint exposes 13 built-in voices and covers 50-plus languages. OpenAI prices it at $0.60 per million text input tokens and $12 per million audio output tokens, which works out to roughly $0.015 per minute of audio. OpenAI calls it its newest and most reliable TTS model; the older tts-1 and tts-1-hd remain available.
For conversational agents, OpenAI’s Realtime line advanced further. The Realtime API reached general availability in August 2025. In May 2026, OpenAI launched GPT-Realtime-2, its first voice model with GPT-5-class reasoning. It handles tool calls, interruptions, and corrections during live speech-to-speech. OpenAI also added GPT-Realtime-Translate and GPT-Realtime-Whisper for live translation and transcription.
gpt-4o-mini-tts fits teams already on the OpenAI platform that need low-cost, instructable speech. The Realtime models suit full speech-to-speech agents.
Open-weight models
As of late May 2026, the overall top tier of the Artificial Analysis leaderboard remained closed-source. Open weights still matter. They allow self-hosting, customization, on-device deployment, and control over data. They can remove per-character API costs, replaced by your own compute. But licenses vary. Some weights are permissive, while others are research-only and require a separate license for commercial use. Check the license before building on any of them.
#01 Kokoro 82M
Kokoro is one of the most efficient open-weight models available. It no longer leads the open-weight rankings; on the current Artificial Analysis leaderboard it sits around an ELO of 1,058, behind Fish Audio S2 Pro, Step Audio EditX, and Voxtral TTS. It has just 82 million parameters. The architecture builds on StyleTTS2 and ISTFTNet. It avoids diffusion and encoder stages, which speeds generation.
In the Trelis “Tricky TTS” test, Kokoro reached a 4.5 MOS and a 17 percent CER. That was the highest quality score among the models tested there. It runs efficiently on modest hardware, including CPU. Hosted API rates run under $1 per million characters of input, around $0.65 in one current listing. Its weights were released in late December 2024, with v1.0 following in 2025. It covers about 15 languages and is distributed under the Apache 2.0 license.
Kokoro fits cost-sensitive or edge deployments where compact size and speed matter. Emotion-markup and cross-lingual features remain experimental and are best supported in English.
#02 Fish Audio S2 Pro
Fish Audio S2 Pro is the highest-ranked open-weight model on the current Artificial Analysis leaderboard, at an ELO near 1,123. Fish Audio reports training on more than 10 million hours of audio across 80-plus languages. The 5-billion-parameter model uses a Dual-Autoregressive architecture with an RVQ audio codec. It supports open-domain emotion tags, native multi-speaker output, and latency under 150 milliseconds.
There is an important license caveat. S2 Pro ships under the Fish Audio Research License, not a permissive open license. Research and non-commercial use are free. Commercial use requires a separate license from Fish Audio. The weights, fine-tuning code, and a streaming inference engine are all published. Self-hosting still needs real GPU resources.
Fish Audio fits teams that want top open-weight quality, provided they secure a commercial license before shipping.
#03 IndexTTS-2
IndexTTS-2, from IndexTeam, advances zero-shot TTS. Its standout feature is precise duration control. That makes it useful for video dubbing, where audio must fit a fixed time window. The model also separates timbre from emotion. Developers can control voice identity and emotional tone independently.
The architecture incorporates GPT latent representations and a three-stage training process. A soft instruction mechanism, built by fine-tuning Qwen3, guides emotional tone through text descriptions. Its authors report that IndexTTS-2 beats prior zero-shot systems on word error rate, speaker similarity, and emotional fidelity across several datasets.
IndexTTS-2 fits professional dubbing and expressive synthesis where timing and control are critical. Its dual-mode operation adds configuration complexity.
#04 CosyVoice 2
CosyVoice2-0.5B comes from the FunAudioLLM project. It has 0.5 billion parameters. Its focus is ultra-low-latency streaming synthesis. It supports zero-shot voice cloning. The small footprint makes it practical for real-time, self-hosted pipelines.
CosyVoice 2 fits real-time applications where teams want an open streaming model.
#05 VibeVoice
VibeVoice, from Microsoft, targets long-form generation. The 1.5-billion-parameter model supports context lengths up to 64,000 tokens. It can produce roughly 90 minutes of continuous speech. That suits podcasts and long narration.
It has clear constraints. It is trained on English and Chinese only. It generates multi-speaker audio sequentially, with no overlapping speech. VibeVoice fits long-form, two-language projects that need extended continuity.
Other notable current models
The field is wider than the just the ranking list. Several models appear on current leaderboards and deserve a place on a shortlist. xAI shipped its own Text to Speech model in 2026. StepAudio 2.5 TTS appears among premium-priced top entries. Voxtral TTS, a 4-billion-parameter model from Mistral announced in March 2026, uses character-based pricing near $0.016 per 1,000 characters. Step Audio EditX and Magpie-Multilingual rank among the stronger open-weight options. Alibaba’s Qwen3-TTS and Maya1 add further open and multilingual choices. None of these is a default, but each can win a specific brief.
Choosing a model by use case
The market is no longer a single-winner race. Start with the job, then pick the tool.
Real-time voice agents: Latency is the binding constraint. Users will not wait. Cartesia Sonic 3.5 leads on raw speed with its SSM architecture, near 82 milliseconds end-to-end. Inworld’s realtime tiers pair low latency with low cost. Deepgram Aura-2 is another low-latency option, reported under 90 milliseconds. ElevenLabs Flash v2.5 keeps the same voice library as offline workloads. For full speech-to-speech, consider OpenAI’s GPT-Realtime-2.
Long-form audiobooks and narration: Quality dominates and latency is irrelevant. ElevenLabs v3 sets a high realism bar for narrative content. Gemini 3.1 Flash TTS offers strong control, with chunking for long scripts. Among open weights, VibeVoice handles extended continuity in English and Chinese.
Multilingual content: Coverage and consistency matter most. Gemini 3.1 Flash TTS and ElevenLabs v3 both support 70-plus languages. MiniMax Speech covers 40-plus at lower cost. Fish Audio S2 Pro leads the open tier with 80-plus languages, but commercial use needs a paid license.
Character and dialogue work: Expressiveness and multi-speaker control lead. ElevenLabs v3 Text to Dialogue handles interruptions and overlapping turns. Gemini 3.1 Flash TTS adds scene direction and per-speaker control. Inworld targets game characters specifically.
Emotional fidelity: Hume Octave 2 reads for meaning and adapts delivery without tags. It fits companion agents and sensitive interactions.
On-device and cost control: Open weights remove API fees. Kokoro runs on CPU with a small footprint. CosyVoice 2 streams at low latency. Both trade some quality for control.
Dubbing: IndexTTS-2 offers duration control to match audio to video timing. That capability is rare among general-purpose models.
Marktechpost’s Visual Explainer
Marktechpost · TTS Guide 2026
01 / 11
Key Takeaways
- No single model wins; pick by your binding constraint — latency, quality, language coverage, or cost.
- Current leaderboard top tier: Gemini 3.1 Flash TTS, Inworld Realtime TTS-2, Cartesia Sonic 3.5, ElevenLabs v3.
- Rankings shift weekly, so treat any ELO snapshot as dated, not fixed.
- Cartesia Sonic 3.5 owns real-time latency at ~82ms end-to-end; Deepgram Aura-2 is a close second.
- ElevenLabs v3 went generally available in early 2026 and leads expressive, multi-speaker narration.
- Gemini 3.1 Flash TTS has no streaming and a 32k-token limit — it’s recitation, not a live agent.
- Fish Audio S2 Pro is the top open-weight model but research-licensed; commercial use needs a paid license.
- Kokoro is the most efficient open option, but no longer the highest-ranked open weight.
- Inworld pricing is tiered: $25/$35 on-demand, dropping to $5/$10 at enterprise volume.
- Public benchmarks narrow the field; your own test on your own text makes the call.
Sources:
Benchmarks & leaderboards
Commercial models (official sources)
Open-weight models (model cards & official pages)
Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us





