Alibaba's GSPO: Stable RL for LLMs, Powering Qwen3 Models

Marktechpost

Reinforcement learning (RL) has emerged as a crucial technology for scaling large language models (LLMs), enabling them to tackle highly complex challenges such as competition-level mathematics and intricate programming tasks through deeper reasoning. However, a significant hurdle persists: achieving stable and reliable training dynamics when scaling RL with ever-larger computational resources. Current state-of-the-art algorithms, notably GRPO, frequently encounter severe stability issues during the training of colossal language models, often leading to catastrophic failures. These instabilities stem from the improper application of importance sampling weights, which introduce high-variance noise. This noise intensifies with longer model responses and is exacerbated by clipping mechanisms, ultimately causing model collapse and impeding progress.

Existing methods like PPO and GRPO attempt to address the challenges of off-policy learning—where models learn from data generated by outdated policies—through mechanisms such as clipping. Yet, these approaches are limited by their ill-posed objectives, particularly when applied to massive models handling long-response tasks. GRPO’s reliance on token-level importance sampling, for instance, generates high-variance noise that can trigger irreversible model collapse. Attempts to recover from such collapses, whether through meticulous hyperparameter tuning or checkpoint restoration, often prove futile, underscoring a fundamental flaw in their design. The inherent mismatch between token-level corrections and sequence-level rewards highlights a pressing need for a new approach that optimizes directly at the sequence level to ensure both stability and scalability.

In response to these challenges, researchers at Alibaba Inc. have introduced Group Sequence Policy Optimization (GSPO), an innovative RL algorithm specifically engineered for training LLMs. GSPO’s primary breakthrough lies in its theoretically grounded importance ratio, which is derived from the likelihood of entire sequences, aligning more closely with the principles of importance sampling. Furthermore, it calculates normalized rewards as advantages across multiple responses to a single query, fostering consistency between sequence-level rewards and the overall optimization objectives. Empirical evaluations have consistently demonstrated that GSPO significantly surpasses GRPO in terms of stability, efficiency, and overall performance. By effectively resolving the stability issues frequently encountered when training large Mixture-of-Experts (MoE) models, GSPO eliminates the need for complex, often cumbersome, stabilization techniques.

The researchers conducted their experiments using a cold-start model fine-tuned from Qwen3-30B-A3B-Base, meticulously tracking training reward curves and model performance across demanding benchmarks such as AIME’24, LiveCodeBench, and CodeForces. During training, rollout data in each batch was systematically divided into four mini-batches for gradient updates. A critical distinction of GSPO is its approach to clipping: it clips entire responses rather than individual tokens, with clipping ranges typically set to 3e-4 and 4e-4 in its formulation. This results in a two-order-of-magnitude difference in clipped token fractions compared to GRPO. Remarkably, despite removing a larger proportion of tokens for gradient estimation, GSPO achieves superior training efficiency. This outcome powerfully underscores the inherent inefficiency of GRPO’s noisy, token-level estimates.

GSPO offers substantial advantages, particularly for MoE model training, by stabilizing the process through consistent expert activations across gradient updates—a stark contrast to GRPO, which often grapples with expert-activation volatility. This innovation negates the necessity for intricate solutions like Routing Replay, simplifying the underlying infrastructure and enabling models to fully utilize their inherent capacity. Within the broader RL infrastructure, GSPO’s sequence-level optimization significantly reduces its dependency on precise token-level likelihoods, rendering it more robust to potential precision mismatches. This robustness allows for the direct use of inference engine likelihoods, bypassing costly recomputation and considerably enhancing efficiency in scenarios involving partial rollouts and multi-turn reinforcement learning. Ultimately, GSPO streamlines the entire RL infrastructure for large-scale language model training.

In conclusion, Group Sequence Policy Optimization (GSPO) represents a pivotal advancement in reinforcement learning for training LLMs. By building upon core principles of importance sampling and introducing novel sequence-level clipping, rewarding, and optimization strategies, GSPO effectively overcomes the instability and inefficiency that have plagued prior algorithms like GRPO. Its demonstrated superior performance in training stability, efficiency, and scalability, especially for complex MoE models, firmly establishes it as a robust algorithmic foundation. The breakthroughs facilitated by GSPO have played a crucial role in the remarkable performance capabilities of the Qwen3 models, and researchers anticipate that building on GSPO as a foundational approach will pave the way for groundbreaking progress in artificial intelligence.

Alibaba's GSPO: Stable RL for LLMs, Powering Qwen3 Models - OmegaNext AI News