Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs

Inference speed is becoming a competitive metric for large language models. Xiaomi’s MiMo team just released MiMo-V2.5-Pro-UltraSpeed, built in collaboration with the TileRT systems group. It decodes faster than 1000 tokens per second on a 1-trillion-parameter model. Xiaomi team describes this as a first at trillion-parameter scale. Demos show generation peaks near 1200 tokens per second. The notable part is the hardware: it runs on commodity GPUs, not custom silicon.

What is MiMo-V2.5-Pro-UltraSpeed

UltraSpeed is a high-speed serving mode for the existing MiMo-V2.5-Pro model. The base model uses a Mixture-of-Experts (MoE) architecture at trillion-parameter scale. UltraSpeed targets generation speed rather than model capability. It changes how fast the model produces output tokens. The speedup comes from three coordinated techniques across the model and the serving system. Xiaomi calls this approach extreme model-system codesign. Crucially, the entire stack runs on a single standard 8-GPU commodity node.

The Speed Case: Three Layers Working Together

The first layer is FP4 quantization. At trillion scale, FP8 or FP16 weights create heavy memory and bandwidth pressure. Lower bit-width weights move through memory faster, which directly lifts decode speed. Xiaomi uses the MXFP4 format, applied selectively to the MoE Experts only. Other modules keep higher precision, reported as FP8 by TileRT. Experts hold most parameters and tolerate quantization best, so the tradeoff is favorable. Quantization-Aware Training (QAT) keeps benchmark quality essentially on par with the original.

The second layer is DFlash speculative decoding, covered in detail below. The third layer is TileRT, the system that executes everything on the GPU. Each technique alone is not enough. The 1000 TPS result needs all three aligned tightly.

DFlash: Parallel Drafting Without a Serial Bottleneck

Standard speculative decoding uses a small draft model to guess upcoming tokens. The large model then verifies those guesses in parallel. Rejection sampling keeps output identical to normal decoding, so quality is lossless. The problem is that the draft model still generates tokens one at a time. DFlash, a method from the research community, removes that constraint. It uses block-level masked parallel prediction. The draft model fills a whole block of masked positions in one forward pass.

Xiaomi tuned DFlash with the Muon second-order optimizer and model self-distillation. The draft model uses Sliding Window Attention (SWA) only, matching the MiMo-V2 design. This makes per-prediction compute constant rather than growing with context length. Block size is capped at 8 to limit verification cost and raise concurrency.

Acceptance length measures how many draft tokens survive verification each round.

Scenario	Acceptance Length
Coding	6.30
Math / Reasoning	5.56
Agent	4.29

In coding, six to seven of eight draft tokens are accepted per round. Some samples reach a maximum of 7.14.

TileRT: Squeezing the Microseconds

At 1000 TPS, each operator runs for only microseconds. Traditional systems launch operators one by one, and each launch costs time. Those gaps fracture the execution stream and become the real bottleneck. TileRT replaces this with a Persistent Engine Kernel that stays resident on the GPU. It uses Warp Specialization to split data movement, compute, and communication into coordinated roles. Small operations like RMSNorm, RoPE, and KV cache writes turn into bottlenecks at this scale. The system was co-designed with the FP4 and DFlash choices, not added afterward.

Use Cases

The release targets latency-sensitive work where waiting breaks the loop:

Parallel reasoning: run many Best-of-N or tree-search paths within the same wall-clock time.
Coding agents: faster code generation cuts the wait between agent steps.
Real-time decision loops: trading signal generation, fraud interception, and live dialogue.
Interactive prototyping: demos show a Snake game in about 10 seconds and a macOS interface in about one minute.

These are throughput-bound workloads where raw token speed is the binding constraint.

How It Compares

The first table contrasts the two routes to extreme decode speed.

Approach	Hardware	How speed is achieved
Cerebras	Wafer-Scale integration (custom)	Scale on a single custom wafer
Groq	Custom architecture	Pure on-chip SRAM
MiMo × TileRT	Commodity GPUs (8-GPU node)	Model-system codesign: FP4 + DFlash + TileRT

The second table compares the standard model with the UltraSpeed mode.

Dimension	MiMo-V2.5-Pro	MiMo-V2.5-Pro-UltraSpeed
Decode speed	Baseline	~10× faster (1000+ TPS)
Price	1×	3×
Weight precision	Standard	FP4 MoE Experts via QAT
Decoding	Standard autoregressive	DFlash speculative decoding
Access	Standard model plans	API only, application-based trial
Token Plan	Supported	Not supported

Access, Pricing, and Open Source

UltraSpeed ships through a limited, application-based window. The API trial runs June 9 to June 23, 2026. Pricing is 3× the standard MiMo-V2.5-Pro rate, for roughly 10× the speed. It is API only, and the Token Plan is not supported. Approved users also receive free Chat access during the trial. Chat limits apply: 10 queue entries daily, 30-minute sessions, and 5-minute idle release. Xiaomi open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face. TileRT has open-sourced select modules on GitHub.

Strengths and Limitations

Strengths

1000+ TPS on a 1T model without custom silicon.
Lossless decoding through rejection sampling in DFlash.
FP4 applied only where tolerance is highest, preserving quality.
An open checkpoint lets the community test the claims.

Limitations

Access is gated, short, and approval-based at launch.
Pricing triples per token versus the standard model.
Acceptance length drops in open-ended conversation.
Independent third-party speed verification is not yet public.

Key Takeaways

Xiaomi MiMo and TileRT decode a 1-trillion-parameter model past 1000 tokens per second on commodity GPUs.
The speedup comes from three layers: FP4 quantization, DFlash speculative decoding, and the TileRT runtime.
FP4 (MXFP4) is applied only to MoE Experts; QAT keeps capability essentially on par.
DFlash predicts a whole masked block per forward pass, hitting 6.30 average acceptance length in coding.
UltraSpeed runs on a single 8-GPU node via an application-based API trial, June 9–23, 2026.

Marktechpost’s Visual Explainer

01 / 08

What It Is

Xiaomi’s MiMo team built it with the TileRT systems group.
It decodes over 1000 tokens/s on a 1-trillion-parameter model.
Demos show generation peaks near 1200 tokens/s.
It runs on commodity GPUs, a single standard 8-GPU node.
Released June 8, 2026.

1000+tokens / second

1Tparameters (MoE)

8commodity GPUs

02 / 08

Three Layers Working Together

FP4 quantization shrinks weights and eases bandwidth pressure.
DFlash speculative decoding predicts many tokens in parallel.
TileRT executes the whole pipeline at microsecond scale.
Xiaomi calls this approach extreme model-system codesign.
No single technique is enough; all three must align.

03 / 08

Layer 1 — FP4 Quantization

Uses the MXFP4 format to lower memory and bandwidth cost.
Applied selectively to the MoE Experts only.
Other modules keep higher precision (FP8, per TileRT).
Experts hold most parameters and tolerate quantization best.
QAT keeps capability essentially on par with the original.

04 / 08

Layer 2 — DFlash Speculative Decoding

A research-community method using block-level masked parallel prediction.
The draft model fills a whole block in one forward pass.
It uses Sliding Window Attention; block size capped at 8.
Rejection sampling keeps the output lossless.

Scenario	Acceptance Length
Coding	6.30
Math / Reasoning	5.56
Agent	4.29

05 / 08

Layer 3 — TileRT Runtime

At 1000 TPS, each operator runs for only microseconds.
A Persistent Engine Kernel stays resident on the GPU.
Warp Specialization splits data movement, compute, and communication.
Small ops like RMSNorm and RoPE become bottlenecks here.
The runtime was co-designed with the FP4 and DFlash choices.

06 / 08

Where It Fits

Parallel reasoning: many Best-of-N or tree-search paths at once.
Coding agents: less wait between agent steps.
Real-time loops: trading signals, fraud interception, live dialogue.
Interactive prototyping: a Snake game in about 10 seconds.

07 / 08

Standard vs UltraSpeed

Dimension	MiMo-V2.5-Pro	UltraSpeed
Decode speed	Baseline	~10× (1000+ TPS)
Price	1×	3×
Weights	Standard	FP4 MoE Experts (QAT)
Decoding	Autoregressive	DFlash speculative
Access	Standard plans	API only, by application

08 / 08

Access, Pricing & Open Source

API trial runs June 9 to June 23, 2026 (Beijing time).
Pricing is 3× the standard rate for roughly 10× speed.
API only; the Token Plan is not supported.
Checkpoint open-sourced: MiMo-V2.5-Pro-FP4-DFlash on Hugging Face.
TileRT has open-sourced select modules on GitHub.

Marktechpost
AI research, models, and developer tools — explained for engineers.

Check out the Model weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link

juicytalk.now

juicytalk.now

Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs

What is MiMo-V2.5-Pro-UltraSpeed

The Speed Case: Three Layers Working Together

DFlash: Parallel Drafting Without a Serial Bottleneck

TileRT: Squeezing the Microseconds

Use Cases

How It Compares

Access, Pricing, and Open Source

Strengths and Limitations

Strengths

Limitations

Key Takeaways

Marktechpost’s Visual Explainer

What It Is

Three Layers Working Together

Layer 1 — FP4 Quantization

Layer 2 — DFlash Speculative Decoding

Layer 3 — TileRT Runtime

Where It Fits

Standard vs UltraSpeed

Access, Pricing & Open Source

JuicyTalk

Related Posts

KwaiKAT Team Releases KAT-Coder-V2.5: An Agentic Coding Model Trained on 100,000+ Verifiable Repository Environments

Induction Labs Photon-1 Simulates Desktops, Plays Checkers, and Models Billiard Physics From One Pretraining Run

Leave a Reply Cancel reply

You Missed

Bitcoin OG selling eases as dormant BTC movement hits 4-year low: Thorn

These NEW Free Assembly Bags Have Serious Anthro Vibes (But Cost Way Less!)

Hogh will bring Champions League quality to Celtic, says O’Neill

GG vs DS, LPL 2026, Match Prediction: Who will win today’s game between Galle Gallants and Dambulla Sixers?

Warhammer 40,000’s Only Movie Explains Why We Haven’t Gotten Another In 16 Years

Stock-Up Price: Arm & Hammer Laundry Detergent Only $1.99 at CVS (Over 70% Off!)