Meet Mamba-3: A New State Space Model Frontier with 2x Smaller States and Enhanced MIMO Decoding Hardware Efficiency


The scaling of inference-time compute has become a primary driver for Large Language Model (LLM) performance, shifting architectural focus toward inference efficiency alongside model quality. While Transformer-based architectures remain the standard, their quadratic computational complexity and linear memory requirements create significant deployment bottlenecks. A team of researchers from Carnegie Mellon University (CMU), Princeton University, Together AI, and Cartesia AI have introduced Mamba-3, a model that addresses these constraints through an ‘inference-first’ design.

Mamba-3 builds upon the State Space Model (SSM) framework, introducing three core methodological updates: exponential-trapezoidal discretization, complex-valued state updates, and a Multi-Input Multi-Output (MIMO) formulation.

1. Exponential-Trapezoidal Discretization

State space models are continuous-time systems that must be discretized to process discrete sequences. Previous iterations like Mamba-1 and Mamba-2 utilized a first-order heuristic known as ‘exponential-Euler’ discretization. Mamba-3 replaces this with exponential-trapezoidal discretization, which provides a second-order accurate approximation of the state-input integral.

Technically, this update changes the discrete recurrence from a two-term update to a three-term update:

$$h_{t}=e^{\Delta_{t}A_{t}}h_{t-1}+(1-\lambda_{t})\Delta_{t}e^{\Delta_{t}A_{t}}B_{t-1}x_{t-1}+\lambda_{t}\Delta_{t}B_{t}x_{t}$$

This formula is equivalent to applying a data-dependent, width-2 convolution on the state-input Btxt within the core recurrence. In empirical testing, this implicit convolution, combined with learnable B and C biases, allows Mamba-3 to function effectively without the external short causal convolutions typically required by recurrent models.

2. Complex-Valued State Space Models and the ‘RoPE Trick

A limitation of real-valued linear models is their inability to solve ‘state-tracking’ tasks, such as determining the parity of bit sequences. This failure stems from restricting the eigen-values of the transition matrix to real numbers, which cannot represent the ‘rotational’ dynamics required for such tasks.

Mamba-3 incorporates complex-valued SSMs to resolve this. The research team established a theoretical equivalence between discretized complex SSMs and real-valued SSMs that utilize data-dependent Rotary Positional Embeddings (RoPE) on the B and C projections.

By using the ‘RoPE trick,’ the model applies aggregated data-dependent rotations across time steps. This enables Mamba-3 to solve synthetic tasks like Parity and Modular Arithmetic, where Mamba-2 and real-valued variants perform no better than random guessing.

3. Multi-Input, Multi-Output (MIMO) Formulation

To address the hardware inefficiency of memory-bound decoding, Mamba-3 transitions from a Single-Input Single-Output (SISO) recurrence to a Multi-Input, Multi-Output (MIMO) structure.

In standard SSM decoding, the arithmetic intensity is approximately 2.5 ops per byte, far below the compute-bound regime of modern GPUs like the H100. MIMO increases the rank R of the input and output projections (Bt E RNR and xt E RPR), transforming the state update from an outer product to a matrix-matrix multiplication.

This shift increases decoding FLOPs by up to 4x relative to Mamba-2 at a fixed state size. Because the additional computation is overlaid with the existing memory I/O required for the state update, MIMO improves modeling quality and perplexity while maintaining similar wall-clock decode latency.

Architecture and Normalization

The Mamba-3 block follows the Llama-style layout, alternating with SwiGLU blocks. Key refinements include:

  • BC/QK Normalization: RMS normalization is applied to the B and C projections, mirroring QKNorm in Transformers. This stabilizes training and enables the removal of the post-gate RMSNorm used in previous versions.
  • Head-Specific Biases: Learnable, channel-wise biases are added to B and C components after normalization to induce convolution-like behavior.
  • Hybrid Integration: When used in hybrid architectures—interleaving linear layers with self-attention—the addition of a pre-gate, grouped RMSNorm was found to improve length generalization in retrieval tasks.

Results and Efficiency

Evaluations were conducted on the FineWeb-Edu dataset across four model scales (180M to 1.5B).

  • Downstream Performance: At the 1.5B scale, the Mamba-3 SISO variant outperforms Mamba-2 and Gated DeltaNet (GDN). The MIMO variant (R=4) further improves average downstream accuracy by 1.2 points over the SISO baseline.
  • Pareto Frontier: Mamba-3 achieves comparable pretraining perplexity to Mamba-2 while using only half the state size (e.g., Mamba-3 with state size 64 matches Mamba-2 with 128).
  • Kernel Performance: Optimized Triton (for prefill) and CuTe DSL (for decode) kernels ensure that the additional mathematical components remain lightweight. SISO Mamba-3 kernels demonstrate lower latency than released Mamba-2 and GDN kernels at standard BF16 settings.
Model (1.5B)Avg. Downstream Acc % ↑FW-Edu Ppl ↓
Transformer55.410.51
Mamba-255.710.47
Mamba-3 SISO56.410.35
Mamba-3 MIMO (R=4)57.610.24

Mamba-3 demonstrates that fundamental adjustments to the state space model viewpoint can bridge the gap between theoretical sub-quadratic efficiency and practical modeling capability.


Check out Paper, GitHub Page and Technical detailsAlso, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.




Source link

  • Related Posts

    Tsinghua and Ant Group Researchers Unveil a Five-Layer Lifecycle-Oriented Security Framework to Mitigate Autonomous LLM Agent Vulnerabilities in OpenClaw

    Autonomous LLM agents like OpenClaw are shifting the paradigm from passive assistants to proactive entities capable of executing complex, long-horizon tasks through high-privilege system access. However, a security analysis research…

    Baidu Qianfan Team Releases Qianfan-OCR: A 4B-Parameter Unified Document Intelligence Model

    The Baidu Qianfan Team introduced Qianfan-OCR, a 4B-parameter end-to-end model designed to unify document parsing, layout analysis, and document understanding within a single vision-language architecture.…

    Leave a Reply

    Your email address will not be published. Required fields are marked *