NVIDIA ProRLv2: Extended RL Boosts LLM Reasoning & Performance

Marktechpost

NVIDIA’s latest innovation, ProRLv2 (Prolonged Reinforcement Learning v2), represents a significant stride in enhancing the reasoning capabilities of large language models (LLMs). This new approach challenges conventional wisdom by demonstrating that by substantially extending the duration of reinforcement learning (RL) steps—from 2,000 to an unprecedented 3,000—LLMs can unlock novel solution spaces, foster greater creativity, and achieve higher-level reasoning previously considered unattainable. Remarkably, these advancements are evident even in more compact models, such as the 1.5-billion-parameter Nemotron-Research-Reasoning-Qwen-1.5B-v2.

To achieve these breakthroughs, ProRLv2 integrates several key innovations designed to mitigate the inherent instabilities and limitations often encountered when applying RL to LLM training. A core component is the REINFORCE+±Baseline, a robust RL algorithm engineered for long-horizon optimization, enabling stable learning over thousands of steps. Further stability and exploration are ensured through a combination of KL Divergence Regularization and a Reference Policy Reset mechanism. This system periodically refreshes the reference model with the current best-performing checkpoint, preventing the RL objective from prematurely dominating the training process and allowing for continued, stable progress. Diversity in generated solutions is actively encouraged by Decoupled Clipping and Dynamic Sampling (DAPO), which specifically boosts the likelihood of less common tokens and strategically directs learning signals towards prompts of intermediate difficulty. Additionally, a cyclically applied Scheduled Length Penalty helps maintain diversity and prevents the model from converging too narrowly as training lengthens. The most direct innovation, however, is the very act of scaling the RL training horizon itself, explicitly testing how far extended RL can push the boundaries of reasoning.

The practical impact of ProRLv2 is vividly illustrated by the performance of Nemotron-Research-Reasoning-Qwen-1.5B-v2, a model trained with ProRLv2 for the full 3,000 RL steps. This compact model sets a new benchmark for open-weight 1.5-billion-parameter models across a diverse array of reasoning tasks, including complex mathematics, coding challenges, scientific problems, and logic puzzles. Its performance not only surpasses previous iterations but also outcompetes rival models in its class. A critical observation is the sustained improvement seen with increased RL steps; longer training consistently leads to gains, particularly on tasks where base models initially struggled, indicating a genuine expansion of reasoning boundaries. Furthermore, ProRLv2 significantly enhances generalization, not merely boosting direct accuracy (pass@1) but also enabling the model to devise novel reasoning approaches and solution strategies for tasks it had not encountered during its training. Benchmark gains are substantial, including average pass@1 accuracy improvements of 14.7% in math, 13.9% in coding, a remarkable 54.8% in logic puzzles, 25.1% in STEM reasoning, and 18.1% in instruction-following tasks, with further improvements noted on previously unseen and more challenging benchmarks in its v2 iteration.

The overarching finding from ProRLv2 is profound: continued reinforcement learning, when meticulously applied with careful exploration and regularization techniques, reliably expands the learning and generalization capacity of large language models. Rather than hitting an early performance plateau or overfitting, prolonged RL training empowers even smaller models to achieve reasoning prowess comparable to much larger counterparts. This suggests that scaling the RL process itself is as critical to advancing AI capabilities as increasing model size or dataset volume. ProRLv2 fundamentally redefines the perceived limits of reasoning in language models, underscoring that the future of AI development may lie not merely in the sheer scale of models, but in the depth and duration to which their learning can be extended through sophisticated reinforcement learning.