Meet Mamba-3: A New State Space Model Frontier with 2x Smaller States and Enhanced MIMO Decoding Hardware Efficiency

The scaling of inference-time compute has become a primary driver for Large Language Model (LLM) performance, shifting architectural focus toward inference efficiency alongside model quality. While Transformer-based architectures remain the standard, their quadratic computational complexity and linear memory requirements create significant deployment bottlenecks. A team of researchers from Carnegie Mellon University (CMU), Princeton University, Together AI, and Cartesia AI have introduced Mamba-3, a model that addresses these constraints through an ‘inference-first’ design.

Mamba-3 builds upon the State Space Model (SSM) framework, introducing three core methodological updates: exponential-trapezoidal discretization, complex-valued state updates, and a Multi-Input Multi-Output (MIMO) formulation^.

1. Exponential-Trapezoidal Discretization

State space models are continuous-time systems that must be discretized to process discrete sequences. Previous iterations like Mamba-1 and Mamba-2 utilized a first-order heuristic known as ‘exponential-Euler’ discretization. Mamba-3 replaces this with exponential-trapezoidal discretization, which provides a second-order accurate approximation of the state-input integral.

Technically, this update changes the discrete recurrence from a two-term update to a three-term update^{^{^{^:}}}

$$h_{t}=e^{\Delta_{t}A_{t}}h_{t-1}+(1-\lambda_{t})\Delta_{t}e^{\Delta_{t}A_{t}}B_{t-1}x_{t-1}+\lambda_{t}\Delta_{t}B_{t}x_{t}$$

This formula is equivalent to applying a data-dependent, width-2 convolution on the state-input B_tx_t within the core recurrence. In empirical testing, this implicit convolution, combined with learnable B and C biases, allows Mamba-3 to function effectively without the external short causal convolutions typically required by recurrent models.

2. Complex-Valued State Space Models and the ‘RoPE Trick‘

A limitation of real-valued linear models is their inability to solve ‘state-tracking’ tasks, such as determining the parity of bit sequences. This failure stems from restricting the eigen-values of the transition matrix to real numbers, which cannot represent the ‘rotational’ dynamics required for such tasks.

Mamba-3 incorporates complex-valued SSMs to resolve this. The research team established a theoretical equivalence between discretized complex SSMs and real-valued SSMs that utilize data-dependent Rotary Positional Embeddings (RoPE) on the B and C projections.

By using the ‘RoPE trick,’ the model applies aggregated data-dependent rotations across time steps. This enables Mamba-3 to solve synthetic tasks like Parity and Modular Arithmetic, where Mamba-2 and real-valued variants perform no better than random guessing.

3. Multi-Input, Multi-Output (MIMO) Formulation

To address the hardware inefficiency of memory-bound decoding, Mamba-3 transitions from a Single-Input Single-Output (SISO) recurrence to a Multi-Input, Multi-Output (MIMO) structure^{^{^{^.}}}

In standard SSM decoding, the arithmetic intensity is approximately 2.5 ops per byte, far below the compute-bound regime of modern GPUs like the H100. MIMO increases the rank R of the input and output projections (B_t E R^NRand x_t E R^PR), transforming the state update from an outer product to a matrix-matrix multiplication.

This shift increases decoding FLOPs by up to 4x relative to Mamba-2 at a fixed state size^{^{^{. Because the additional computation is overlaid with the existing memory I/O required for the state update, MIMO improves modeling quality and perplexity while maintaining similar wall-clock decode latency^{^{^{^{^{^{^{^{^.}}}}}}}}}}}

Architecture and Normalization

The Mamba-3 block follows the Llama-style layout, alternating with SwiGLU blocks. Key refinements include:

BC/QK Normalization: RMS normalization is applied to the B and C projections, mirroring QKNorm in Transformers. This stabilizes training and enables the removal of the post-gate RMSNorm used in previous versions.
Head-Specific Biases: Learnable, channel-wise biases are added to B and C components after normalization to induce convolution-like behavior.
Hybrid Integration: When used in hybrid architectures—interleaving linear layers with self-attention—the addition of a pre-gate, grouped RMSNorm was found to improve length generalization in retrieval tasks.

Results and Efficiency

Evaluations were conducted on the FineWeb-Edu dataset across four model scales (180M to 1.5B)^{^{^{^.}}}

Downstream Performance: At the 1.5B scale, the Mamba-3 SISO variant outperforms Mamba-2 and Gated DeltaNet (GDN). The MIMO variant (R=4) further improves average downstream accuracy by 1.2 points over the SISO baseline.
Pareto Frontier: Mamba-3 achieves comparable pretraining perplexity to Mamba-2 while using only half the state size (e.g., Mamba-3 with state size 64 matches Mamba-2 with 128).
Kernel Performance: Optimized Triton (for prefill) and CuTe DSL (for decode) kernels ensure that the additional mathematical components remain lightweight. SISO Mamba-3 kernels demonstrate lower latency than released Mamba-2 and GDN kernels at standard BF16 settings.

Model (1.5B)	Avg. Downstream Acc % ↑	FW-Edu Ppl ↓
Transformer	55.4	10.51
Mamba-2	55.7	10.47
Mamba-3 SISO	56.4	10.35
Mamba-3 MIMO (R=4)	57.6	10.24

Mamba-3 demonstrates that fundamental adjustments to the state space model viewpoint can bridge the gap between theoretical sub-quadratic efficiency and practical modeling capability.

Check out Paper, GitHub Page and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Source link