TL;DR: NVIDIA Nemotron 3 — fully open hybrid Mamba–Transformer MoE models (30B→500B)
Major Highlights:
- Truly open release (weights, data, recipes, RL stack): NVIDIA’s Nemotron 3 Nano (30B) lands as one of the most complete open model drops to date: model weights, pre- and post-training code, recipes, and all redistributable datasets are released, plus a full agent RL suite (NeMo Gym/NeMo-RL). Licensed under the NVIDIA Open Model License with commercial use allowed.
- Hybrid Mamba–Transformer + MoE with 1M context: The Nano model interleaves Mamba-2 state-space layers, sparse MoE, and selective self-attention to deliver 1,000,000-token context windows at high throughput. This sets a new open baseline for long-context, efficient inference.
- Competitive small-model performance: Nemotron 3 Nano posts best-in-class small-model results on SWE-Bench and a 52 on Artificial Analysis Intelligence Index (+6 vs Qwen3-30B A3B), while sustaining ~380 tokens/sec on DeepInfra—strong quality/speed trade-offs for a 30B-class model.
- Bigger models coming with LatentMoE + NVFP4: Super (~100–120B) and Ultra (~400–500B) models are “coming soon,” featuring NVFP4 pretraining and LatentMoE routing in a lower-dimensional latent subspace to cut all‑to‑all communication and expert compute.
Key Technical Details:
- Model lineup: Nano 30B total parameters (~3.6B active per token via MoE). Super and Ultra planned at ~100–120B and ~400–500B.
- Architecture: Hybrid stack with interleaved Mamba-2 (SSM) and MoE layers plus selective self-attention; 1M-token context support.
- Training/inference:
- Nano released today; Super/Ultra to follow.
- Larger models use NVFP4; Nano uses the hybrid MoE/Mamba stack now (LatentMoE documented for larger SKUs).
- Performance:
- ~380 tok/s (DeepInfra), strong SWE-Bench; AAII: 52 (+6 vs Qwen3-30B A3B).
- Open assets: Weights; pre/post-training recipes; redistributable datasets (e.g., Nemotron‑Math, Nemotron‑Math‑Proofs, agentic data); NeMo Gym for multi-environment RL.
- Ecosystem (day‑0): vLLM, SGLang, llama.cpp, GGUF (Unsloth), Baseten, Together, Hugging Face collections.
Community Response/Impact:
- Researchers praise the release for reproducibility and transparency (open data + recipes) and for elevating agent-focused R&D via NeMo Gym.
- Practitioners highlight immediate deployability due to wide inference stack support and strong throughput.
- Comparisons note that while Nemotron often lags SOTA on headline leaderboards, Nano’s openness and long-context efficiency make it a new reference checkpoint for training and agent workflows.
- Broader context: NVIDIA deepens its end-to-end AI stack (alongside moves like the SLURM acquisition), prompting debate on ecosystem dependency and portability.
First Principles Analysis:
- Why this works: Mamba (state-space models) provides linear-time, memory-efficient long-context handling; sparse MoE activates a small subset of parameters per token for high capacity at low compute. Selective self-attention fills gaps where global dependency modeling is needed.
- LatentMoE (for larger SKUs) reduces costly all-to-all routing by operating in a lower-dimensional latent space, addressing a key bottleneck in distributed MoE training.
- Opening the full pipeline (data → pre/post-training → RL environments) enables true replication, faster method validation, and fairer comparisons—crucial for scientific progress and enterprise adoption.