NVIDIA Nemotron 3 Super 120B: What It Is and When to Use It

Released in March 2026, Nemotron 3 Super 120B is NVIDIA’s most capable open-weight model to date. It is not trying to be the smartest model in the room – it is trying to be the fastest open model that can reliably run multi-step AI agents, and it succeeds at that specific goal.


What It Is

Nemotron 3 Super is a 120.6 billion parameter model, but only 12.7 billion parameters are active on any given forward pass. This is the key to understanding it: it is a sparse Mixture-of-Experts model, where most of the network is dormant at inference time. You get the capability of a 120B model at the compute cost of a ~13B model.

The architecture is a hybrid of three layer types:

On top of this, NVIDIA added Multi-Token Prediction – the model predicts multiple tokens simultaneously rather than one at a time, delivering up to 3x speedup on structured generation without a separate speculative decoding model.

The result: 458 tokens per second, ranked #1 for speed among all open-weight models tested. Context window: 1 million tokens.


How It Compares

Model Intelligence Score Speed Open Weights
Gemini 3.1 Pro Preview 57 High No
GPT-5.4 57 High No
Claude Opus 4.6 53 Medium No
Claude Sonnet 4.6 52 Medium No
Nemotron 3 Super 120B 36 458 t/s (#1) Yes
Llama 3.3 70B ~30 High Yes

The intelligence score gap is real – Nemotron sits well below the frontier proprietary models on raw reasoning. But among open-weight models it ranks #2 out of 54, and no open model comes close to its throughput.

Benchmark highlights:


Why You Would Use It

1. Data never leaves your infrastructure

The model weights are fully open under the NVIDIA Nemotron license, free for commercial use. You can run it on your own GPUs on-premises, in a private cloud, or in a regulated environment. For healthcare, legal, financial services, or any company with data residency requirements, this is often a non-negotiable – and no proprietary model can match it.

2. Agentic AI systems at scale

Nemotron was specifically designed and evaluated for multi-agent workflows: planning, tool calling, processing results, and executing across multiple steps. Multi-agent systems generate far more tokens than standard chat – long context and reliable tool calling matter more than single-turn accuracy. The 1M token context window means an agent can hold an entire codebase, financial report, or document history in memory without truncation.

3. Throughput for high-volume workloads

At 458 tokens/second, it processes more than 2x faster than GPT-OSS-120B and 5x faster than the previous Nemotron Super. If you are running dozens of concurrent agents or need low-latency responses at volume, the throughput advantage is significant. Proprietary APIs also have rate limits; self-hosted infrastructure does not.

4. Cost

On inference APIs: OpenRouter offers it free. DeepInfra charges $0.10/1M input, $0.50/1M output – orders of magnitude cheaper than GPT-4o or Claude Sonnet for equivalent throughput. Self-hosted is the compute cost of your GPU, nothing more.


Shortcomings

Verbosity. Nemotron ranked last (54th of 54) for verbosity in benchmark testing, generating 110 million tokens during evaluation versus a median of 7.4 million. In practice, this means it tends to over-explain. This increases cost and latency in production and requires careful prompt engineering to control output length.

Reasoning mode is required for tool calling. Without reasoning mode enabled, tool-calling performance degrades significantly. Reasoning mode adds internal thinking tokens before every response – useful for accuracy, expensive in wall-clock time and token cost.

Accuracy gap vs. frontier models. On raw knowledge (MMLU-type tasks), Qwen3.5-122B leads. On math, GPT-OSS-120B edges ahead on some benchmarks. On coding, several models outperform it in pure function-level accuracy. If you need the most capable model for a single high-stakes query, Claude or GPT-4o is still ahead.

Code context coverage thins. SWE-Bench performance is strong on issues that are clear from the local code patch, but weakens when correct behavior requires deeper understanding of surrounding context. It stays close to the diff rather than reasoning broadly about system behavior.

Mamba long-context retrieval trade-off. Mamba-based layers process sequences in linear time but have historically been weaker than pure attention at precisely retrieving information from distant positions in context. NVIDIA’s Transformer layers mitigate this, but the architectural trade-off remains real compared to full-attention models.


How to Access It

API (easiest):

All use OpenAI-compatible endpoints – just change the base URL and model name.

Self-hosted via NVIDIA NIM:

NIM packages the model as a Docker container with an OpenAI-compatible API on port 8000:

docker run --gpus all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -p 8000:8000 \
  nvcr.io/nim/nvidia/nemotron-3-super-120b-a12b:latest

Then call it like any OpenAI endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "nvidia/nemotron-3-super-120b-a12b", "messages": [{"role": "user", "content": "Hello"}]}'

Three precision options are available: BF16 (full), FP8 (quantized), and NVFP4 (ultra-quantized for Blackwell GPUs with 4x efficiency gains).

Hardware requirement: Multiple high-VRAM GPUs (A100/H100 class) for BF16. FP4 on Blackwell (DGX B200, GB200) runs it on significantly fewer GPUs.

Hugging Face / Ollama: Full weights available for local download and inference.


The Bottom Line

Nemotron 3 Super is the right choice when you need:

It is not the right choice when you need:

Think of it as the workhorse of open agentic AI: not the sharpest, but fast, efficient, open, and purpose-built for the work that enterprise AI systems actually do.