Nvidia Unveils Nemotron-Nano-9B-v2: Small, Open AI Model with Reasoning Toggle

Venturebeat

Nvidia has entered the burgeoning field of small language models (SLMs) with the release of Nemotron-Nano-9B-v2, a compact yet powerful AI model designed to offer advanced reasoning capabilities while optimizing for deployment efficiency. This move follows a trend of increasingly smaller, more specialized AI models capable of running on less powerful hardware, such as those recently introduced by MIT spinoff Liquid AI and Google.

The Nemotron-Nano-9B-v2 boasts nine billion parameters, a meaningful reduction from its initial 12 billion-parameter design. This optimization specifically targets deployment on a single Nvidia A10 GPU, a popular choice for enterprise applications. According to Oleksii Kuchiaev, Nvidia’s Director of AI Model Post-Training, this pruning allows for a larger batch size and enables the model to process information up to six times faster than similarly sized transformer models. For context, many leading large language models (LLMs) operate in the 70+ billion parameter range, where parameters refer to the internal settings governing a model’s behavior, with more generally indicating greater capability but also higher computational demands. The push toward smaller, more efficient models like Nemotron-Nano-9B-v2 addresses growing concerns around power consumption, rising token costs, and inference delays that are reshaping the landscape of enterprise AI.

A significant architectural innovation underpinning Nemotron-Nano-9B-v2 is its hybrid nature, combining elements of both the Transformer and Mamba architectures. While the widely adopted Transformer models rely solely on attention layers, which can become memory and compute-intensive as sequence lengths grow, Nemotron-H models (the family to which Nano-9B-v2 belongs) integrate selective state space models (SSMs) from the Mamba architecture. Developed by researchers at Carnegie Mellon University and Princeton, SSMs excel at handling very long sequences of information by maintaining internal states. These layers scale linearly with sequence length, efficiently processing longer contexts without the substantial memory and compute overhead associated with traditional self-attention mechanisms. This hybrid approach significantly reduces operational costs, achieving up to two to three times higher throughput on long contexts with comparable accuracy, a strategy also being adopted by other AI labs.

One of Nemotron-Nano-9B-v2’s standout features is its user-controllable AI “reasoning.” The model, positioned as a unified, text-only chat and reasoning system, defaults to generating an internal reasoning trace before producing a final answer. Users can toggle this behavior on or off using simple control tokens like /think or /no_think. Furthermore, developers can manage a “thinking budget” at runtime, capping the number of tokens the model dedicates to internal reasoning before completing a response. This mechanism is crucial for balancing accuracy with latency, particularly in time-sensitive applications such as customer support systems or autonomous agents.

Benchmark evaluations highlight Nemotron-Nano-9B-v2’s competitive accuracy against other open small-scale models. When tested in “reasoning on” mode using the NeMo-Skills suite, it achieved impressive scores: 72.1 percent on AIME25, 97.8 percent on MATH500, 64.0 percent on GPQA, and 71.1 percent on LiveCodeBench. Scores for instruction following and long-context benchmarks also demonstrate strong performance, with 90.3 percent on IFEval and 78.9 percent on the RULER 128K test. Across the board, Nano-9B-v2 shows higher accuracy than Qwen3-8B, a common point of comparison in its class. Nvidia illustrates these results with accuracy-versus-budget curves, demonstrating how performance scales with increased token allowance for reasoning, suggesting that careful budget control can optimize both quality and latency in real-world applications.

The model and its underlying Nemotron-H family were trained on a diverse mix of curated, web-sourced, and synthetic datasets, including general text, code, mathematics, science, legal, and financial documents, alongside alignment-style question-answering datasets. Notably, Nvidia confirmed the use of synthetic reasoning traces generated by other large models to bolster performance on complex benchmarks. The model is also designed for broad language support, handling English, German, Spanish, French, Italian, and Japanese, with extended descriptions for Korean, Portuguese, Russian, and Chinese, making it suitable for both instruction following and code generation.

Nemotron-Nano-9B-v2 is immediately available on Hugging Face and through Nvidia’s model catalog, released under the Nvidia Open Model License Agreement. This permissive, enterprise-friendly license explicitly states that the models are commercially usable out of the box, allowing developers to freely create and distribute derivative models. Crucially, Nvidia does not claim ownership of any outputs generated by the model, placing responsibility and rights with the developer or organization using it. This means enterprises can integrate the model into production without negotiating separate commercial licenses or incurring fees tied to usage thresholds or revenue levels, unlike some tiered open licenses.

While highly permissive, the license does stipulate several key conditions focusing on responsible deployment. Users must not bypass built-in safety mechanisms without implementing comparable replacements, and any redistribution of the model or its derivatives must include the Nvidia Open Model License text and attribution. Compliance with trade regulations and restrictions, along with adherence to Nvidia’s Trustworthy AI guidelines for ethical considerations, are also mandatory. Furthermore, a litigation clause automatically terminates the license if a user initiates copyright or patent litigation against another entity alleging infringement by the model. These conditions are geared toward ensuring legal and ethical use rather than imposing commercial restrictions, allowing enterprises to scale their products without royalty burdens, provided they respect safety, attribution, and compliance obligations.

With Nemotron-Nano-9B-v2, Nvidia is targeting developers who require a nuanced balance of reasoning capability and deployment efficiency at smaller scales. By combining hybrid architectures with advanced compression and training techniques, the company is providing tools that aim to maintain accuracy while significantly reducing costs and latency, underscoring its continued focus on efficient and controllable AI models.