BREAKING: Xiaomi and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs

The question is no longer whether trillion-parameter models can run fast—it's whether custom silicon can survive.

Xiaomi's MiMo team, in collaboration with TileRT, has demonstrated inference speeds exceeding 1000 tokens per second on a 1-trillion-parameter Mixture-of-Experts model—using a single 8-GPU commodity node. This is not a lab curiosity. It is a direct challenge to the wafer-scale and custom-architecture approaches of Cerebras and Groq.

For executives, this means the cost-performance frontier of AI inference just shifted. The premium for speed is now 3× the standard rate for 10× the throughput—a trade-off that will reshape deployment decisions for latency-sensitive applications.

Context: What Happened

On June 8, 2026, Xiaomi released MiMo-V2.5-Pro-UltraSpeed, a high-speed serving mode for its existing MiMo-V2.5-Pro model. The base model uses a Mixture-of-Experts (MoE) architecture at trillion-parameter scale. UltraSpeed targets generation speed rather than model capability, achieving over 1000 tokens per second—with peaks near 1200 TPS—on commodity GPUs.

The speedup comes from three coordinated techniques: FP4 quantization, DFlash speculative decoding, and the TileRT runtime. Xiaomi calls this approach extreme model-system codesign. The entire stack runs on a single standard 8-GPU node.

Strategic Analysis: The Three-Layer Attack on Inference Bottlenecks

Layer 1: FP4 Quantization – Selective Precision for Maximum Gain

At trillion-parameter scale, memory bandwidth is the binding constraint. FP4 quantization using the MXFP4 format reduces weight size by 4× compared to FP16, and 2× compared to FP8. Critically, Xiaomi applies FP4 only to the MoE Experts—which hold most parameters and tolerate quantization best—while keeping other modules at FP8. Quantization-Aware Training (QAT) preserves benchmark quality essentially on par with the original.

Strategic implication: This selective approach proves that aggressive quantization can be applied without sacrificing model capability, provided the architecture (MoE) allows it. Expect competitors to rush similar mixed-precision strategies.

Layer 2: DFlash Speculative Decoding – Parallel Drafting Without Serial Bottlenecks

Standard speculative decoding uses a small draft model to guess tokens, but the draft model still generates one token at a time. DFlash removes that constraint by using block-level masked parallel prediction: the draft model fills a whole block of masked positions in one forward pass. Xiaomi tuned DFlash with the Muon second-order optimizer and model self-distillation. The draft model uses Sliding Window Attention (SWA) only, keeping per-prediction compute constant. Block size is capped at 8 to limit verification cost.

Acceptance lengths are impressive: 6.30 in coding, 5.56 in math/reasoning, 4.29 in agent tasks. This means in coding, six to seven of eight draft tokens are accepted per round.

Strategic implication: DFlash turns speculative decoding from a niche trick into a production-ready technique. The open-source release of the checkpoint on Hugging Face will accelerate adoption and further optimization.

Layer 3: TileRT – Microsecond-Scale Execution

At 1000 TPS, each operator runs for only microseconds. Traditional systems launch operators one by one, and each launch costs time. TileRT replaces this with a Persistent Engine Kernel that stays resident on the GPU, using Warp Specialization to split data movement, compute, and communication into coordinated roles. Small operations like RMSNorm, RoPE, and KV cache writes—normally negligible—become bottlenecks at this scale.

Strategic implication: TileRT's co-design with the model architecture (FP4 and DFlash) is a blueprint for future inference engines. Expect hyperscalers to invest heavily in similar runtime-level optimizations.

Winners & Losers

Winners

  • Xiaomi: Positions as a leader in ultra-fast inference, gains brand prestige and potential API revenue.
  • TileRT: Showcases its inference engine's capability, attracting customers and potential acquisition interest.
  • Latency-sensitive AI applications: Real-time coding assistants, autonomous agents, and trading systems gain access to extremely fast inference on trillion-parameter models.
  • Open-source community: Receives high-quality code and models to build upon.

Losers

  • Custom silicon AI chip startups (Groq, Cerebras): Their hardware advantage is challenged by commodity GPU solutions achieving comparable speed.
  • Cloud inference providers with slower offerings: May lose customers seeking speed unless they match performance or lower prices.
  • Niche inference optimization companies: Their proprietary optimizations may be overshadowed by open-source TileRT and DFlash.

Second-Order Effects

1. Commoditization of high-performance inference: The combination of open-source release and extreme speed on standard GPUs may commoditize high-performance inference, shifting the competitive landscape from hardware specialization to software optimization and ecosystem integration.

2. Pressure on pricing models: At 3× the standard rate for 10× speed, the price-performance ratio is favorable for latency-sensitive workloads. Expect competitors to adjust pricing or risk losing high-value customers.

3. Accelerated adoption of MoE architectures: The success of FP4 quantization on MoE experts will encourage more organizations to adopt MoE models, knowing that inference can be both fast and cost-effective.

4. Increased focus on speculative decoding: DFlash's open-source availability will spur further research into block-level prediction and acceptance optimization.

Market / Industry Impact

The inference market is bifurcating: one track for throughput-optimized batch processing, another for latency-critical real-time applications. UltraSpeed targets the latter, and its success will force cloud providers to offer tiered inference services with speed guarantees. The open-source release of key components means that any startup with GPU access can potentially replicate this performance, democratizing ultra-fast inference.

Executive Action

  • Evaluate latency-sensitive workloads: Identify use cases where 10× faster inference justifies 3× cost—e.g., real-time coding agents, trading signals, interactive prototyping.
  • Monitor open-source developments: The MiMo checkpoint and TileRT modules are available now. Integrate into your inference stack to stay competitive.
  • Reassess hardware procurement: If commodity GPUs can deliver custom-silicon-level speed, reconsider investments in specialized hardware.

Why This Matters

This is not a marginal improvement. It is a structural shift in what is possible with commodity hardware. For the first time, a trillion-parameter model can generate text faster than a human can read it—on hardware available today. The implications for real-time AI applications are profound. Executives who ignore this risk being outpaced by competitors who adopt ultra-fast inference.

Final Take

Xiaomi and TileRT have proven that software and system co-design can match—and potentially exceed—the performance of custom silicon. The era of hardware-only inference advantages is ending. The winners will be those who optimize the full stack, from quantization to runtime. The losers will be those who bet on proprietary hardware without a software moat.




Source: MarkTechPost

Rate the Intelligence Signal

Intelligence FAQ

Through three coordinated techniques: FP4 quantization on MoE experts, DFlash speculative decoding with block-level parallel prediction, and the TileRT runtime with persistent kernels and warp specialization.

Pricing is 3× the standard MiMo-V2.5-Pro rate for roughly 10× speed. It targets latency-sensitive workloads like real-time coding agents, trading systems, and interactive prototyping.