NVIDIA Nemotron Nano 2: 6x Faster, 128K Context LLMs Released
NVIDIA has introduced the Nemotron Nano 2 family, a new suite of large language models (LLMs) engineered to deliver both cutting-edge reasoning accuracy and remarkable speed. These models, built on a novel hybrid Mamba-Transformer architecture, promise up to six times faster inference throughput compared to similarly sized counterparts. A defining characteristic of this release is NVIDIA’s commitment to unprecedented transparency, openly providing most of the training corpus, recipes, and model checkpoints to the broader AI community. Crucially, these models are designed to handle massive 128,000-token context lengths on a single midrange GPU, such as an NVIDIA A10G, significantly lowering the barriers for advanced long-context reasoning and practical real-world deployment.
The Nemotron Nano 2 models boast impressive performance metrics. They can generate tokens up to 6.3 times faster than models like Qwen3-8B in reasoning-intensive scenarios, all without compromising accuracy. Beyond raw speed, benchmarks reveal their superior accuracy across a spectrum of tasks, including complex reasoning, coding, and multilingual applications. They consistently match or exceed the performance of competitive open models, particularly excelling in mathematical problem-solving, code generation, tool utilization, and tasks requiring extensive context understanding. The ability to manage a 128K context length on a single GPU, a feat previously impractical for midrange hardware, is a testament to their efficient pruning and hybrid architectural design.
At the heart of Nemotron Nano 2 lies its innovative hybrid Mamba-Transformer backbone, drawing inspiration from the larger Nemotron-H Architecture. This design largely replaces traditional self-attention layers with highly efficient Mamba-2 layers, with only about eight percent of the total layers retaining self-attention. This careful architectural crafting, featuring 56 layers in the 9-billion-parameter model, a hidden size of 4480, and grouped-query attention, allows Mamba-2 state space layers to facilitate both scalability and robust long-sequence retention. Mamba-2 layers, known for their high-throughput sequence processing, are strategically interleaved with sparse self-attention to maintain long-range dependencies, alongside large feed-forward networks. This structure is particularly advantageous for reasoning tasks that demand “thinking traces”—long, generated outputs based on extensive in-context inputs—where traditional transformer architectures often encounter performance bottlenecks or memory constraints.
NVIDIA’s training methodology for the Nemotron Nano 2 models is as noteworthy as its architecture. These models are trained and distilled from a larger 12-billion-parameter teacher model using an extensive and meticulously curated high-quality corpus of 20 trillion tokens. This pretraining data spans diverse domains, including web content, mathematics, code, multilingual text, academic papers, and STEM subjects. NVIDIA’s commitment to data transparency is further demonstrated by the release of major datasets under permissive licenses on Hugging Face. These include Nemotron-CC-v2, a multilingual web crawl with synthetic Q&A rephrasing; Nemotron-CC-Math, comprising 133 billion tokens of standardized LaTeX math content; Nemotron-Pretraining-Code, a quality-filtered GitHub source code collection; and Nemotron-Pretraining-SFT, synthetic instruction-following datasets across various domains. Additionally, over 80 billion tokens of post-training data, encompassing supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), tool-calling, and multilingual datasets, have been open-sourced for direct reproducibility.
The efficiency and cost-effectiveness of Nemotron Nano 2 are a result of NVIDIA’s sophisticated model compression process, built upon the “Minitron” and Mamba pruning frameworks. Knowledge distillation from the 12-billion-parameter teacher model reduces it to 9 billion parameters, achieved through careful pruning of layers, feed-forward network dimensions, and embedding width. This is complemented by multi-stage SFT and reinforcement learning techniques, including tool-calling optimization, instruction-following, and “thinking budget” control for managing reasoning-token budgets during inference. Through memory-targeted neural architecture search, the pruned models are specifically engineered to ensure that both the model and its key-value cache fit and remain performant within the memory constraints of an A10G GPU, even at a 128K context length. This holistic approach yields inference speeds up to six times faster than open competitors in scenarios with large input/output tokens, all while maintaining uncompromised task accuracy.
In summary, NVIDIA’s Nemotron Nano 2 release marks a significant milestone in open LLM research. It redefines the capabilities achievable on a single, cost-effective GPU in terms of both speed and context capacity, simultaneously setting a new standard for data transparency and reproducibility. Its innovative hybrid architecture, superior throughput, and high-quality open datasets are poised to significantly accelerate innovation across the entire AI ecosystem.