NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running Agents

NVIDIA has released Nemotron 3 Ultra, the largest model in its Nemotron 3 family. It targets a specific problem: long-running agents that plan, call tools, and reason across many turns. As agents run longer, token counts grow and inference cost climbs. Nemotron 3 Ultra is designed to keep accuracy high while making that inference faster and cheaper.

What is Nemotron 3 Ultra

Nemotron 3 Ultra is a 550 billion total parameter Mixture-of-Experts (MoE) model. Only 55 billion parameters are active per token. The MoE design improves accuracy per active parameter.

It uses a hybrid Mamba-Attention architecture instead of a pure Transformer. Mamba layers handle long sequences with sub-quadratic scaling. A few Attention layers are kept for precise recall over large contexts.

The model was pre-trained on 20 trillion text tokens. Context was then extended to 1 million tokens. It was post-trained using Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD).

NVIDIA team reports up to roughly 6x higher inference throughput than comparable open LLMs, at on-par accuracy.

https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf

The Architecture

The model has 108 layers and a model dimension of 8,192. It uses 64 query heads and only 2 key-value heads, which keeps the KV cache small. Each MoE layer holds 512 experts, with the top 22 activated per token.

Three design choices stand out:

LatentMoE routes experts more efficiently. It buys more routed experts at fixed inference cost by trading away hidden-dimension width. NVIDIA team reports better accuracy per parameter than standard granular MoEs.
Multi-Token Prediction (MTP) predicts several future tokens in one forward pass. It enables native speculative decoding for faster generation. Two MTP heads share parameters during training.
NVFP4 pre-training uses the E2M1 4-bit datatype with two-dimensional block quantization on weights. NVIDIA team calls this the largest-scale demonstration of stable, accurate NVFP4 training to date.

The hybrid Mamba-Attention stack are quite important for agents. Mamba’s per-step decode cost stays constant as sequence length grows. That is why throughput gains widen on long, decode-heavy workloads.

Pretraining and the Data Release

Pretraining used a Warmup-Stable-Decay learning rate schedule over 20 trillion tokens. It was split into two phases. The first 15 trillion tokens biased for diversity. The final 5 trillion biased for high-quality data.

NVIDIA team also released new domain-specific pretraining datasets. These include 173 billion refreshed GitHub code tokens. In a Nemotron 3 Nano ablation, a synthetic legal set raised a proxy LegalBench average from 64.6 to 74.7. In a similar ablation, a Wiki-based fact-seeking set raised proxy SimpleQA from 40.2 to 50.2.

The post-training release is also large. NVIDIA adds 10 million new SFT samples and 1 million new RL tasks. It adds 15 new RL environments. Cumulative Nemotron open totals reach 50M SFT samples, 2M RL tasks, and 55 RL environments.

Training was not entirely smooth. NVIDIA documents two loss divergences and treats them as a useful engineering record. The first, near 8 trillion tokens, traced to moving output-layer gradient reduction from FP32 to BF16. The MTP gradient contribution was effectively lost in BF16’s 7 mantissa bits. Reverting to FP32 gradient reduction re-stabilized training.

The second divergence, near 16 trillion tokens, had no confirmed root cause. NVIDIA mitigated it by annealing the learning rate early. It then cut the total token horizon to 20 trillion tokens.

Post-Training: SFT, RLVR, and MOPD

The post-training pipeline runs SFT, then unified RLVR, then MOPD warmup, MOPD, and MTP Boosting. The whole loop can repeat for several cycles.

RLVR stands for Reinforcement Learning with Verifiable Reward. It trains across many environments at once: terminal use, software engineering, search, math, code, safety, and more. The reward in these settings is often sparse and environment-dependent.

MOPD is the main new post-training method. Mixed-environment RLVR dilutes the learning signal as the number of environments grows. To address this, NVIDIA team trained more than ten domain-specialized teacher models. Each teacher has its own training pipeline.

During MOPD, the student model generates its own rollouts across domains. Each rollout is scored by the matching teacher with dense, token-level guidance. This is a denser signal than RLVR’s sparse rewards. The process runs asynchronously, with rollout generation, teacher scoring, and student updates pipelined.

MOPD is also iterative. After one MOPD checkpoint, new teachers are initialized from the improved student. Their gains merge back into the next round. NVIDIA team ran two MOPD iterations for Nemotron 3 Ultra.

One practical caveat is worth noting. MOPD works best when student rollouts stay within the teacher’s support. A brief SFT warmup aligns the two distributions first. NVIDIA team found gains are smaller on self-contained reasoning tasks the student rarely samples.

Reasoning Effort Control

Nemotron 3 Ultra supports three reasoning modes: reasoning-off, regular, and medium-effort. The regular and medium modes also accept an inference-time budget control.

Medium-effort is the efficiency lever. NVIDIA team reports it uses about 2.5x fewer tokens than regular mode. The cost is roughly a 7% drop in accuracy. For high-volume agent steps, that trade can lower spend meaningfully.

The Benchmark Case

The comparisons in the NVIDIA’s research report use GLM-5.1 (754B), Kimi-K2.6 (1T), and Qwen-3.5 (397B), among others. The picture is competitive rather than dominant.

On agentic tasks, Nemotron 3 Ultra posts 90.0 on PinchBench and 56.0 on ProfBench (Search). NVIDIA team reserved both as held-out generalization gates, scored only once on the final model. It scores 71.9 on SWE-Bench Verified and 56.4 on Terminal Bench 2.1. On Terminal Bench, Kimi-K2.6 leads at 67.2.

On reasoning, it scores 570.0 on IOI 2025. NVIDIA team frames this as top-3-human-level competitive programming. On AA-Omniscience, it records the highest non-hallucination score in the set at 78.7. That suggests a lower tendency to answer when uncertain.

Long context holds up at scale. The model scores 94.7 on RULER at 1 million tokens. Several larger comparison models top out at 256K context.

On an 8K input / 64K output setting at NVFP4 on GB200, Nemotron 3 Ultra reaches 5.9x the throughput of GLM-5.1. It is 4.8x faster than Kimi-K2.6 and 1.6x faster than Qwen-3.5. Note: Nemotron’s numbers use TRT-LLM, while the others use vLLM.

The trade-off is visible on prefill-heavy work. On a 50K input / 2K output setting, it trails Qwen-3.5, because prefill cost tracks active parameters. NVIDIA team also reports up to 30% lower cost to task completion, from fewer tokens per turn on SWE-Bench and Terminal Bench.

NVIDIA team also stresses harness robustness. The model is trained under multiple agent harnesses per task type, not one. SWE-Bench Verified scores stay between 65% and 70.4% across Pi, OpenHands, Hermes, OpenCode, and Mini SWE Agent. The goal is consistent behavior regardless of deployment framework.

Quantization and Deployment

NVIDIA team ships a single NVFP4 checkpoint. On Blackwell it runs with native FP4 math. On Hopper it runs as W4A16, since Hopper lacks native FP4 tensor cores.

The final solution operates at 5.03 bits-per-element. It mixes NVFP4 routed experts with FP8 layers for shared experts and Mamba linears. Attention layers stay in BF16. NVIDIA team found accuracy saturated below this budget, so higher precision added no measurable gain.

The reduced weight footprint has a deployment benefit. The W4A16 path leaves room to fit MTP weights on a single 8-GPU H100 node. An FP8 checkpoint could not, without spanning two nodes.

Key Takeaways

Nemotron 3 Ultra is a 550B open MoE (55B active) using a hybrid Mamba-Attention design for long-running agents.
NVIDIA reports up to ~6x higher inference throughput than comparable open LLMs at on-par accuracy (5.9x vs GLM-5.1 on 8K/64K).
It pairs a 1M-token context with the highest non-hallucination score in its comparison set (78.7 on AA-Omniscience).
Post-training centers on Multi-teacher On-Policy Distillation (MOPD), distilling 10+ specialized teachers into one student.
Weights, training data, and recipes ship openly under OpenMDW-1.1, with one NVFP4 checkpoint for Blackwell, Hopper, and Ampere.

Marktechpost’s Visual Explainer

NVIDIA Nemotron 3 Ultra

SLIDE 1 / 8

Open Model Release

Nemotron 3 Ultra: a 550B open MoE built for long-running agents

An open Mixture-of-Experts hybrid Mamba-Transformer for agentic reasoning, tool use, and long-context tasks.

Total / Active

550B / 55B

Sparse MoE, 55B active per token

Context

1M tokens

Extended after 20T-token pretraining

Throughput

~6x

Up to ~6x vs comparable open LLMs

License

OpenMDW-1.1

Open weights, data, and recipes

Pre-trained on 20T tokens, then post-trained with SFT, RLVR, and Multi-teacher On-Policy Distillation (MOPD).

What It Is

A hybrid Mamba-Attention MoE, not a pure Transformer

Hybrid stack: Mamba layers scale sub-quadratically; a few Attention layers preserve precise recall.
Sparse MoE: 550B total parameters, 55B active per token, improving accuracy per active parameter.
Long context: pretrained on 20T text tokens, then extended to a 1M-token window.
Open release: base, post-trained, and NVFP4 checkpoints, plus training data and recipes.

Throughput gains come mainly from the hybrid Mamba-Attention design, which bounds KV-cache footprint.

Architecture

108 layers, 512 experts per layer, top-22 routing

Layers

108

Model dimension 8,192

Attention

64 / 2

Query heads / KV heads

Experts

512

Top-22 activated per token

Precision

NVFP4

E2M1, 2D block quantization

Key techniques

LatentMoE: more routed experts at fixed inference cost by trading hidden-dimension width.
Multi-Token Prediction (MTP): predicts several tokens per pass; two heads share parameters.
NVFP4 pre-training: NVIDIA’s largest-scale stable, accurate FP4 training run to date.

Pretraining & Data

20T tokens in two phases, plus new open datasets

Two-phase curriculum: 15T tokens biased for diversity, then 5T biased for quality.
Code refresh: 173B new GitHub tokens with a September 30, 2025 cutoff.
Domain data (Nano ablations): legal lifted proxy LegalBench 64.6 to 74.7; Wiki lifted proxy SimpleQA 40.2 to 50.2.
Post-training data: +10M SFT samples and +1M RL tasks; totals reach 50M SFT, 2M RL tasks, 55 environments.

NVIDIA documents two loss divergences (near 8T and 16T tokens) and the fixes used to stabilize training.

Post-Training

MOPD: distilling 10+ specialized teachers into one student

SFT→
RLVR→
MOPD Warmup→
MOPD→
MTP Boosting

Why MOPD: mixed-environment RLVR dilutes the signal as the number of environments grows.
How it works: the student generates rollouts; each teacher scores them with dense token-level guidance.
Asynchronous: rollout generation, teacher scoring, and student updates run pipelined.
Iterative: NVIDIA ran two MOPD iterations, re-initializing teachers from the improved student.

A short SFT warmup keeps student rollouts within each teacher’s support before distillation.

Benchmarks

Competitive across agentic, reasoning, and long-context tasks

PinchBench (held-out)

90.0

Top tier of evaluated open models

SWE-Bench Verified

71.9

Software engineering agents

IOI 2025

570.0

Top-3-human-level (NVIDIA framing)

RULER @ 1M

94.7

Long-context retrieval

ProfBench (Search): 56.0, the second held-out generalization gate.
AA-Omniscience: highest non-hallucination score in the set at 78.7.
Terminal Bench 2.1: 56.4, where Kimi-K2.6 leads at 67.2.

Throughput & Efficiency

Faster on decode-heavy work; budget control for cost

vs GLM-5.1

5.9x

8K in / 64K out, NVFP4 on GB200

vs Kimi-K2.6

4.8x

Same decode-heavy setting

vs Qwen-3.5

1.6x

Trails Qwen on prefill-heavy work

Cost to complete

~30%

Lower, from fewer tokens per turn

Reasoning modes: reasoning-off, regular, and medium-effort, with inference-time budget control.
Medium-effort: about 2.5x fewer tokens for roughly a 7% accuracy drop.

Throughput is reported with TRT-LLM for Nemotron and vLLM for the other models.

Quantization, Licensing & Takeaways

One NVFP4 checkpoint, across NVIDIA GPU generations

Single checkpoint: native FP4 on Blackwell, W4A16 on Hopper, also runs on Ampere.
Operating point: 5.03 bits-per-element, mixing NVFP4 experts with FP8 and BF16 layers.
Footprint win: the W4A16 path fits MTP weights on a single 8-GPU H100 node.
Fully open: weights, data, and recipes under OpenMDW-1.1; fine-tune via LoRA, SFT, or RL.

Not the top scorer on every benchmark. The design favors throughput, long context, and reliability for agents.

Curated by Marktechpost — AI/ML research & dev news for engineers and data scientists

Sources: NVIDIA Nemotron 3 Ultra technical report & blog · Verified Jun 4, 2026

Where to Use Nemotron 3 Ultra

Check out the Paper, Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link

juicytalk.now

juicytalk.now

NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running Agents

What is Nemotron 3 Ultra

The Architecture

Pretraining and the Data Release

Post-Training: SFT, RLVR, and MOPD

Reasoning Effort Control

The Benchmark Case

Quantization and Deployment

Key Takeaways

Marktechpost’s Visual Explainer

Nemotron 3 Ultra: a 550B open MoE built for long-running agents

A hybrid Mamba-Attention MoE, not a pure Transformer

108 layers, 512 experts per layer, top-22 routing

Key techniques

20T tokens in two phases, plus new open datasets

MOPD: distilling 10+ specialized teachers into one student

Competitive across agentic, reasoning, and long-context tasks

Faster on decode-heavy work; budget control for cost

One NVFP4 checkpoint, across NVIDIA GPU generations

Where to Use Nemotron 3 Ultra

JuicyTalk

Related Posts

KwaiKAT Team Releases KAT-Coder-V2.5: An Agentic Coding Model Trained on 100,000+ Verifiable Repository Environments

Induction Labs Photon-1 Simulates Desktops, Plays Checkers, and Models Billiard Physics From One Pretraining Run

Leave a Reply Cancel reply

You Missed

Russia’s Biggest Bank Plans Crypto Trading Infrastructure By Year End

Power Lift Recliner w/ Heat & Massage ONLY $350 Shipped on Wayfair.com (Reg. $1,800)

Prediction: New York Red Bulls vs Charlotte FC

2026’s New Sci-Fi Comedy Ann Droid Is Black Mirror Meets Ted Lasso

Pen+Gear School Supply List Kits Only $9.98 at Walmart

Watch goals video as Brazil icon earns ban on Santos return in striker’s first match since World Cup 2026