Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs


Inference speed is becoming a competitive metric for large language models. Xiaomi’s MiMo team just released MiMo-V2.5-Pro-UltraSpeed, built in collaboration with the TileRT systems group. It decodes faster than 1000 tokens per second on a 1-trillion-parameter model. Xiaomi team describes this as a first at trillion-parameter scale. Demos show generation peaks near 1200 tokens per second. The notable part is the hardware: it runs on commodity GPUs, not custom silicon.

What is MiMo-V2.5-Pro-UltraSpeed

UltraSpeed is a high-speed serving mode for the existing MiMo-V2.5-Pro model. The base model uses a Mixture-of-Experts (MoE) architecture at trillion-parameter scale. UltraSpeed targets generation speed rather than model capability. It changes how fast the model produces output tokens. The speedup comes from three coordinated techniques across the model and the serving system. Xiaomi calls this approach extreme model-system codesign. Crucially, the entire stack runs on a single standard 8-GPU commodity node.

The Speed Case: Three Layers Working Together

The first layer is FP4 quantization. At trillion scale, FP8 or FP16 weights create heavy memory and bandwidth pressure. Lower bit-width weights move through memory faster, which directly lifts decode speed. Xiaomi uses the MXFP4 format, applied selectively to the MoE Experts only. Other modules keep higher precision, reported as FP8 by TileRT. Experts hold most parameters and tolerate quantization best, so the tradeoff is favorable. Quantization-Aware Training (QAT) keeps benchmark quality essentially on par with the original.

The second layer is DFlash speculative decoding, covered in detail below. The third layer is TileRT, the system that executes everything on the GPU. Each technique alone is not enough. The 1000 TPS result needs all three aligned tightly.

DFlash: Parallel Drafting Without a Serial Bottleneck

Standard speculative decoding uses a small draft model to guess upcoming tokens. The large model then verifies those guesses in parallel. Rejection sampling keeps output identical to normal decoding, so quality is lossless. The problem is that the draft model still generates tokens one at a time. DFlash, a method from the research community, removes that constraint. It uses block-level masked parallel prediction. The draft model fills a whole block of masked positions in one forward pass.

Xiaomi tuned DFlash with the Muon second-order optimizer and model self-distillation. The draft model uses Sliding Window Attention (SWA) only, matching the MiMo-V2 design. This makes per-prediction compute constant rather than growing with context length. Block size is capped at 8 to limit verification cost and raise concurrency.

Acceptance length measures how many draft tokens survive verification each round.

ScenarioAcceptance Length
Coding6.30
Math / Reasoning5.56
Agent4.29

In coding, six to seven of eight draft tokens are accepted per round. Some samples reach a maximum of 7.14.

TileRT: Squeezing the Microseconds

At 1000 TPS, each operator runs for only microseconds. Traditional systems launch operators one by one, and each launch costs time. Those gaps fracture the execution stream and become the real bottleneck. TileRT replaces this with a Persistent Engine Kernel that stays resident on the GPU. It uses Warp Specialization to split data movement, compute, and communication into coordinated roles. Small operations like RMSNorm, RoPE, and KV cache writes turn into bottlenecks at this scale. The system was co-designed with the FP4 and DFlash choices, not added afterward.

Use Cases

The release targets latency-sensitive work where waiting breaks the loop:

  • Parallel reasoning: run many Best-of-N or tree-search paths within the same wall-clock time.
  • Coding agents: faster code generation cuts the wait between agent steps.
  • Real-time decision loops: trading signal generation, fraud interception, and live dialogue.
  • Interactive prototyping: demos show a Snake game in about 10 seconds and a macOS interface in about one minute.

These are throughput-bound workloads where raw token speed is the binding constraint.

How It Compares

The first table contrasts the two routes to extreme decode speed.

ApproachHardwareHow speed is achieved
CerebrasWafer-Scale integration (custom)Scale on a single custom wafer
GroqCustom architecturePure on-chip SRAM
MiMo × TileRTCommodity GPUs (8-GPU node)Model-system codesign: FP4 + DFlash + TileRT

The second table compares the standard model with the UltraSpeed mode.

DimensionMiMo-V2.5-ProMiMo-V2.5-Pro-UltraSpeed
Decode speedBaseline~10× faster (1000+ TPS)
Price
Weight precisionStandardFP4 MoE Experts via QAT
DecodingStandard autoregressiveDFlash speculative decoding
AccessStandard model plansAPI only, application-based trial
Token PlanSupportedNot supported

Access, Pricing, and Open Source

UltraSpeed ships through a limited, application-based window. The API trial runs June 9 to June 23, 2026. Pricing is 3× the standard MiMo-V2.5-Pro rate, for roughly 10× the speed. It is API only, and the Token Plan is not supported. Approved users also receive free Chat access during the trial. Chat limits apply: 10 queue entries daily, 30-minute sessions, and 5-minute idle release. Xiaomi open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face. TileRT has open-sourced select modules on GitHub.

Strengths and Limitations

Strengths

  • 1000+ TPS on a 1T model without custom silicon.
  • Lossless decoding through rejection sampling in DFlash.
  • FP4 applied only where tolerance is highest, preserving quality.
  • An open checkpoint lets the community test the claims.

Limitations

  • Access is gated, short, and approval-based at launch.
  • Pricing triples per token versus the standard model.
  • Acceptance length drops in open-ended conversation.
  • Independent third-party speed verification is not yet public.

Key Takeaways

  • Xiaomi MiMo and TileRT decode a 1-trillion-parameter model past 1000 tokens per second on commodity GPUs.
  • The speedup comes from three layers: FP4 quantization, DFlash speculative decoding, and the TileRT runtime.
  • FP4 (MXFP4) is applied only to MoE Experts; QAT keeps capability essentially on par.
  • DFlash predicts a whole masked block per forward pass, hitting 6.30 average acceptance length in coding.
  • UltraSpeed runs on a single 8-GPU node via an application-based API trial, June 9–23, 2026.

Marktechpost’s Visual Explainer

01 / 08

What It Is

  • Xiaomi’s MiMo team built it with the TileRT systems group.
  • It decodes over 1000 tokens/s on a 1-trillion-parameter model.
  • Demos show generation peaks near 1200 tokens/s.
  • It runs on commodity GPUs, a single standard 8-GPU node.
  • Released June 8, 2026.

1000+tokens / second

1Tparameters (MoE)

8commodity GPUs

02 / 08

Three Layers Working Together

  • FP4 quantization shrinks weights and eases bandwidth pressure.
  • DFlash speculative decoding predicts many tokens in parallel.
  • TileRT executes the whole pipeline at microsecond scale.
  • Xiaomi calls this approach extreme model-system codesign.
  • No single technique is enough; all three must align.

03 / 08

Layer 1 — FP4 Quantization

  • Uses the MXFP4 format to lower memory and bandwidth cost.
  • Applied selectively to the MoE Experts only.
  • Other modules keep higher precision (FP8, per TileRT).
  • Experts hold most parameters and tolerate quantization best.
  • QAT keeps capability essentially on par with the original.

04 / 08

Layer 2 — DFlash Speculative Decoding

  • A research-community method using block-level masked parallel prediction.
  • The draft model fills a whole block in one forward pass.
  • It uses Sliding Window Attention; block size capped at 8.
  • Rejection sampling keeps the output lossless.
ScenarioAcceptance Length
Coding6.30
Math / Reasoning5.56
Agent4.29

05 / 08

Layer 3 — TileRT Runtime

  • At 1000 TPS, each operator runs for only microseconds.
  • A Persistent Engine Kernel stays resident on the GPU.
  • Warp Specialization splits data movement, compute, and communication.
  • Small ops like RMSNorm and RoPE become bottlenecks here.
  • The runtime was co-designed with the FP4 and DFlash choices.

06 / 08

Where It Fits

  • Parallel reasoning: many Best-of-N or tree-search paths at once.
  • Coding agents: less wait between agent steps.
  • Real-time loops: trading signals, fraud interception, live dialogue.
  • Interactive prototyping: a Snake game in about 10 seconds.

07 / 08

Standard vs UltraSpeed

DimensionMiMo-V2.5-ProUltraSpeed
Decode speedBaseline~10× (1000+ TPS)
Price
WeightsStandardFP4 MoE Experts (QAT)
DecodingAutoregressiveDFlash speculative
AccessStandard plansAPI only, by application

08 / 08

Access, Pricing & Open Source

  • API trial runs June 9 to June 23, 2026 (Beijing time).
  • Pricing is 3× the standard rate for roughly 10× speed.
  • API only; the Token Plan is not supported.
  • Checkpoint open-sourced: MiMo-V2.5-Pro-FP4-DFlash on Hugging Face.
  • TileRT has open-sourced select modules on GitHub.

Marktechpost
AI research, models, and developer tools — explained for engineers.


Check out the Model weights and Technical detailsAlso, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




Source link

  • Related Posts

    ClawHub Security Signals: A Coding Guide to End-to-End Security Signal Analysis and Verdict Classification on the AI Skills Dataset

    TEXT_COL = “skill_md_content” NUM_COLS = [“skillspector_score”, “static_finding_count”, “skillspector_issue_count”, “virustotal_malicious_count”] TARGET = “clawscan_verdict” def prep(df): out = df.copy() out[TEXT_COL] = out[TEXT_COL].fillna(“”).astype(str).str.slice(0, 6000) for c in NUM_COLS: out[c] = pd.to_numeric(out[c], errors=”coerce”) return…

    Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on Artificial Analysis, Best-in-Class FLEURS Accuracy, and Up to 5x Faster Long-Audio Transcription

    Last week Microsoft AI has announced MAI-Transcribe-1.5. It is the second iteration of the company’s in-house speech-to-text family. The model targets accuracy across 43 languages, accents, and noisy environments. The…

    Leave a Reply

    Your email address will not be published. Required fields are marked *