NVIDIA Drops Nemotron 3 Super With 5x Throughput Gains for AI Agents

Felix Pinkston Mar 11, 2026 22:44

NVIDIA releases Nemotron 3 Super, a 120B parameter open model delivering 5x higher throughput for agentic AI with a 1M-token context window.

NVIDIA Drops Nemotron 3 Super With 5x Throughput Gains for AI Agents

NVIDIA launched Nemotron 3 Super on March 11, 2026, a 120-billion-parameter open model that delivers 5x higher throughput than its predecessor while targeting the computational bottlenecks that have plagued multi-agent AI systems.

The model activates only 12 billion of its 120 billion parameters per inference call. This sparse activation pattern, powered by a hybrid Mamba-Transformer Mixture-of-Experts architecture, slashes the compute requirements that typically make large reasoning models impractical for continuous operation.

Why Multi-Agent AI Has Been Stuck

Multi-agent systems generate up to 15x the tokens of standard chat applications. Every turn requires re-sending conversation history, tool outputs, and reasoning steps. NVIDIA calls this the "context explosion" problem—and it causes agents to gradually drift from their original objectives over extended tasks.

The second constraint? The "thinking tax." Running massive reasoning models for every subtask makes multi-agent applications too expensive and slow for production deployment.

Nemotron 3 Super attacks both problems simultaneously. Its native 1-million-token context window gives agents persistent memory across long workflows. The hybrid architecture keeps latency low enough for concurrent agent deployment at scale.

Technical Architecture Worth Noting

The model introduces several architectural innovations that separate it from standard transformer designs:

Latent MoE compresses token embeddings before routing to experts, enabling the model to consult 4x as many specialists for identical computational cost. This granularity matters when a single conversation spans tool calls, code generation, and data analysis within a few turns.

Multi-token prediction forecasts several future tokens in one forward pass. Beyond training benefits, this enables built-in speculative decoding—up to 3x wall-clock speedups for structured generation tasks like code without requiring a separate draft model.

Native NVFP4 pretraining runs the majority of operations in 4-bit precision from the first gradient update. The model learns accuracy within these constraints rather than suffering post-training quantization losses. NVIDIA claims 4x inference speedup on B200 GPUs compared to FP8 on H100.

Benchmark Performance

On PinchBench—a benchmark measuring LLM performance as the "brain" of autonomous agents—Nemotron 3 Super scores 85.6% across the full test suite. NVIDIA claims this makes it the best open model in its class for agentic applications.

The model was post-trained with reinforcement learning across 21 environment configurations using NeMo Gym, generating over 1.2 million environment rollouts during training. This trajectory-based approach targets reliable behavior under multi-step workflows rather than satisfying single-turn responses.

Open Everything

NVIDIA released the complete package: weights on Hugging Face, 10 trillion curated pretraining tokens, 40 million post-training samples, and full training recipes. The NVIDIA Nemotron Open Model License allows enterprise deployment anywhere.

Deployment cookbooks cover vLLM, SGLang, and TensorRT LLM. The model runs through Perplexity Pro, OpenRouter, and build.nvidia.com, with additional availability through Baseten, Cloudflare, DeepInfra, Fireworks AI, and Together AI.

NVIDIA positions Nemotron 3 Super alongside Nemotron 3 Nano (released December 2025) for tiered deployment—Nano handles targeted individual steps while Super manages complex multi-step planning. The upcoming Nemotron 3 Ultra will complete the family for expert-level tasks.

Image source: Shutterstock