TRL Introduces Advanced VLM Alignment Methods: GRPO, GSPO, MPO

Huggingface

Vision Language Models (VLMs), designed to interpret and interact with both images and text, are rapidly advancing in capability. Yet, the critical step of aligning these powerful models with nuanced human preferences remains paramount for their effective deployment. While the TRL (Transformers Reinforcement Learning) library previously demonstrated success in post-training VLMs through Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), recent developments push the boundaries further.

Traditionally, VLM alignment involved an initial SFT phase to teach models to follow instructions, followed by DPO to refine their responses based on preferred data. DPO operates by optimizing a contrastive loss between pairs of model outputs – a “chosen” and a “rejected” answer – to guide the model toward desired behaviors. However, this pairwise approach has limitations, prompting the emergence of more sophisticated multimodal alignment methods like Mixed Preference Optimization (MPO), Group Relative Policy Optimization (GRPO), and its variant, Group Sequence Policy Optimization (GSPO). These innovative techniques extract richer signals from preference data and scale more effectively with modern, complex VLMs.

Mixed Preference Optimization (MPO) directly addresses shortcomings found in models aligned solely with SFT or DPO. While SFT-aligned models can struggle with distribution shifts in reasoning tasks, DPO-aligned models sometimes produce repetitive responses or lack coherent rationales. MPO resolves this by extending DPO with a combined loss function. This function integrates the standard DPO preference loss, a quality loss from Binary Classifier Optimization (BCO), and a generation loss from SFT. This tripartite approach has shown significant improvements, with one paper reporting a 6.2-point gain on the challenging MathVista benchmark simply by switching to this combined loss. Integrating MPO into TRL’s DPOTrainer class is streamlined, requiring only a few lines of configuration to activate the combined loss types and their corresponding weights.

Another significant advancement is Group Relative Policy Optimization (GRPO), first introduced with the DeepSeek Math and DeepSeek R1 large language models. GRPO enhances Proximal Policy Optimization (PPO) by performing policy updates over groups or batches of dialogue trajectories. This group-based learning makes GRPO more resilient to noise in reward signals, as the noise tends to average out across the group. By learning a broader sense of “good” responses rather than focusing on isolated high-reward samples, GRPO yields highly performant models. TRL now supports GRPO for vision language models, requiring the definition of reward functions to validate answer formats and solution accuracy. For instance, one reward function might check if a response adheres to a specific structure, while another assesses the accuracy of the mathematical solution provided.

Building on GRPO, Group Sequence Policy Optimization (GSPO) is a more recent reinforcement learning alignment algorithm. Developed by Qwen, GSPO overcomes some of GRPO’s limitations by ensuring more stable training through the computation of importance sampling weights at the sequence level, rather than per-token. This distinction makes GSPO particularly relevant and beneficial for Mixture-of-Experts (MoE) style models. TRL’s latest version incorporates GSPO, leveraging its multimodal support, with configuration similar to GRPO but including additional parameters like importance_sampling_level="sequence" to enable its unique characteristics.

Preliminary evaluations, such as fine-tuning Qwen2.5VL-3B on subsets of data, offer a glimpse into the efficacy of these new methods. While these “vibe-check” comparisons are not exhaustive benchmarks, they demonstrate a clear difference. A base model might struggle with complex geometric problems, exhibiting circular reasoning or failing to arrive at the correct answer within the given choices. MPO, while still showing some hesitation, begins to demonstrate a more structured approach. Crucially, GRPO and GSPO outputs consistently provide more direct, coherent, and accurate reasoning, often leading directly to the correct solution by applying the appropriate geometric theorems, unlike the base model’s exploratory and often incorrect attempts.

To facilitate the use of these advanced alignment methods, TRL has integrated vLLM, a high-throughput inference engine. This integration is crucial for online alignment methods that necessitate generating samples during training. vLLM can operate in two primary modes: “colocate,” where it runs within the same process as the training loop and shares GPU resources, or “server,” which allows vLLM to run as a separate service that the training process can query. This flexibility, coupled with support for vLLM with the Hugging Face Transformers backend, significantly enhances the efficiency and scalability of VLM alignment workflows within TRL.

These new multimodal alignment methods in TRL represent a significant leap forward in refining Vision Language Models. By moving beyond simple pairwise preferences to leverage richer signals and more robust optimization techniques, they empower developers to build VLMs that not only understand but also respond with greater accuracy, coherence, and alignment to human intent.