Home
Projects
Blog
Contact
Books
AI News
← Back to AI News

Dec 15 NVIDIA Nemotron 3: hybrid Mamba-Transformer completely open source models from 30B to 500B Show details

news.smol.ai•2 months ago•View Original →

TL;DR: NVIDIA Nemotron 3 — fully open hybrid Mamba–Transformer MoE models (30B→500B)

Major Highlights:

  • Truly open release (weights, data, recipes, RL stack): NVIDIA’s Nemotron 3 Nano (30B) lands as one of the most complete open model drops to date: model weights, pre- and post-training code, recipes, and all redistributable datasets are released, plus a full agent RL suite (NeMo Gym/NeMo-RL). Licensed under the NVIDIA Open Model License with commercial use allowed.
  • Hybrid Mamba–Transformer + MoE with 1M context: The Nano model interleaves Mamba-2 state-space layers, sparse MoE, and selective self-attention to deliver 1,000,000-token context windows at high throughput. This sets a new open baseline for long-context, efficient inference.
  • Competitive small-model performance: Nemotron 3 Nano posts best-in-class small-model results on SWE-Bench and a 52 on Artificial Analysis Intelligence Index (+6 vs Qwen3-30B A3B), while sustaining ~380 tokens/sec on DeepInfra—strong quality/speed trade-offs for a 30B-class model.
  • Bigger models coming with LatentMoE + NVFP4: Super (~100–120B) and Ultra (~400–500B) models are “coming soon,” featuring NVFP4 pretraining and LatentMoE routing in a lower-dimensional latent subspace to cut all‑to‑all communication and expert compute.

Key Technical Details:

  • Model lineup: Nano 30B total parameters (~3.6B active per token via MoE). Super and Ultra planned at ~100–120B and ~400–500B.
  • Architecture: Hybrid stack with interleaved Mamba-2 (SSM) and MoE layers plus selective self-attention; 1M-token context support.
  • Training/inference:
    • Nano released today; Super/Ultra to follow.
    • Larger models use NVFP4; Nano uses the hybrid MoE/Mamba stack now (LatentMoE documented for larger SKUs).
  • Performance:
    • ~380 tok/s (DeepInfra), strong SWE-Bench; AAII: 52 (+6 vs Qwen3-30B A3B).
  • Open assets: Weights; pre/post-training recipes; redistributable datasets (e.g., Nemotron‑Math, Nemotron‑Math‑Proofs, agentic data); NeMo Gym for multi-environment RL.
  • Ecosystem (day‑0): vLLM, SGLang, llama.cpp, GGUF (Unsloth), Baseten, Together, Hugging Face collections.

Community Response/Impact:

  • Researchers praise the release for reproducibility and transparency (open data + recipes) and for elevating agent-focused R&D via NeMo Gym.
  • Practitioners highlight immediate deployability due to wide inference stack support and strong throughput.
  • Comparisons note that while Nemotron often lags SOTA on headline leaderboards, Nano’s openness and long-context efficiency make it a new reference checkpoint for training and agent workflows.
  • Broader context: NVIDIA deepens its end-to-end AI stack (alongside moves like the SLURM acquisition), prompting debate on ecosystem dependency and portability.

First Principles Analysis:

  • Why this works: Mamba (state-space models) provides linear-time, memory-efficient long-context handling; sparse MoE activates a small subset of parameters per token for high capacity at low compute. Selective self-attention fills gaps where global dependency modeling is needed.
  • LatentMoE (for larger SKUs) reduces costly all-to-all routing by operating in a lower-dimensional latent space, addressing a key bottleneck in distributed MoE training.
  • Opening the full pipeline (data → pre/post-training → RL environments) enables true replication, faster method validation, and fairer comparisons—crucial for scientific progress and enterprise adoption.