Dec 15 NVIDIA Nemotron 3: hybrid Mamba-Transformer completely open source models from 30B to 500B Show details

news.smol.ai•2 months ago•View Original →

TL;DR: NVIDIA Nemotron 3 — fully open hybrid Mamba–Transformer MoE models (30B→500B)

Major Highlights:

Truly open release (weights, data, recipes, RL stack): NVIDIA’s Nemotron 3 Nano (30B) lands as one of the most complete open model drops to date: model weights, pre- and post-training code, recipes, and all redistributable datasets are released, plus a full agent RL suite (NeMo Gym/NeMo-RL). Licensed under the NVIDIA Open Model License with commercial use allowed.
Hybrid Mamba–Transformer + MoE with 1M context: The Nano model interleaves Mamba-2 state-space layers, sparse MoE, and selective self-attention to deliver 1,000,000-token context windows at high throughput. This sets a new open baseline for long-context, efficient inference.
Competitive small-model performance: Nemotron 3 Nano posts best-in-class small-model results on SWE-Bench and a 52 on Artificial Analysis Intelligence Index (+6 vs Qwen3-30B A3B), while sustaining ~380 tokens/sec on DeepInfra—strong quality/speed trade-offs for a 30B-class model.
Bigger models coming with LatentMoE + NVFP4: Super (~100–120B) and Ultra (~400–500B) models are “coming soon,” featuring NVFP4 pretraining and LatentMoE routing in a lower-dimensional latent subspace to cut all‑to‑all communication and expert compute.

Key Technical Details:

Model lineup: Nano 30B total parameters (~3.6B active per token via MoE). Super and Ultra planned at ~100–120B and ~400–500B.
Architecture: Hybrid stack with interleaved Mamba-2 (SSM) and MoE layers plus selective self-attention; 1M-token context support.
Training/inference:
- Nano released today; Super/Ultra to follow.
- Larger models use NVFP4; Nano uses the hybrid MoE/Mamba stack now (LatentMoE documented for larger SKUs).
Performance:
- ~380 tok/s (DeepInfra), strong SWE-Bench; AAII: 52 (+6 vs Qwen3-30B A3B).
Open assets: Weights; pre/post-training recipes; redistributable datasets (e.g., Nemotron‑Math, Nemotron‑Math‑Proofs, agentic data); NeMo Gym for multi-environment RL.
Ecosystem (day‑0): vLLM, SGLang, llama.cpp, GGUF (Unsloth), Baseten, Together, Hugging Face collections.

Community Response/Impact:

Researchers praise the release for reproducibility and transparency (open data + recipes) and for elevating agent-focused R&D via NeMo Gym.
Practitioners highlight immediate deployability due to wide inference stack support and strong throughput.
Comparisons note that while Nemotron often lags SOTA on headline leaderboards, Nano’s openness and long-context efficiency make it a new reference checkpoint for training and agent workflows.
Broader context: NVIDIA deepens its end-to-end AI stack (alongside moves like the SLURM acquisition), prompting debate on ecosystem dependency and portability.

First Principles Analysis:

Why this works: Mamba (state-space models) provides linear-time, memory-efficient long-context handling; sparse MoE activates a small subset of parameters per token for high capacity at low compute. Selective self-attention fills gaps where global dependency modeling is needed.
LatentMoE (for larger SKUs) reduces costly all-to-all routing by operating in a lower-dimensional latent space, addressing a key bottleneck in distributed MoE training.
Opening the full pipeline (data → pre/post-training → RL environments) enables true replication, faster method validation, and fairer comparisons—crucial for scientific progress and enterprise adoption.