Fine-Tuning SmolVLM for Human Alignment with DPO

AI models, particularly large language and vision-language models (VLMs), often face a critical challenge: while they may generate responses that are technically correct, these outputs can lack the nuanced, human-like qualities that users desire. For instance, a chatbot might provide accurate information but in an overly robotic or rude tone, or a VLM might caption an image with irrelevant details despite maximizing its internal likelihood scores. In such scenarios, traditional supervised fine-tuning methods fall short because they do not account for human preferences or subjective usefulness.

Preference optimization addresses this gap by training models to distinguish between and select “better” responses from a set of options, based on human or proxy judgments. This paradigm allows models to prioritize qualities like clarity, emotional intelligence, or safety, moving beyond mere fluency to generate outputs that align more closely with human intent.

While methods like Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) have been instrumental in model alignment, they often introduce significant complexity, instability, and high computational costs. Direct Preference Optimization (DPO) emerges as a simpler yet effective alternative, directly learning from preference data without requiring a separate reward model or complex reinforcement learning loops. This article explores DPO’s principles and demonstrates its application in fine-tuning the SmolVLM model for improved human alignment.

What Is Preference Optimization?

Preference optimization encompasses a category of fine-tuning techniques designed to align machine learning models, especially generative models like language models (LMs) and vision-language models (VLMs), with human or proxy evaluations. Instead of merely predicting the next token, the model is optimized to produce outputs that are considered “preferable” by an evaluator, which could be a human annotator or another AI model. This is vital for making generative AI more useful, safe, and engaging in real-world applications.

At its core, preference optimization involves presenting a model with pairs of outputs (e.g., one preferred, one rejected) and adjusting its internal parameters to increase the probability of generating the preferred response. This approach moves beyond rigid, rule-based alignment, enabling fine-grained control based on qualitative judgments—a task humans excel at but machines do not inherently learn.

Types of Techniques

Reinforcement Learning from Human Feedback (RLHF)
RLHF is a widely adopted method for aligning large language models, notably used in models like ChatGPT. It involves a three-step process:

Supervised Fine-Tuning (SFT): An initial base model is fine-tuned on a curated dataset of prompt-response pairs to provide a foundational model.
Reward Modeling: Human annotators rank multiple outputs generated by the SFT model. These human rankings are then used to train a separate “reward model” that learns to assign scores to new outputs, mimicking human judgment.
Policy Optimization: The SFT model is further fine-tuned using a reinforcement learning algorithm, typically Proximal Policy Optimization (PPO), to maximize the scores predicted by the reward model.

While RLHF has yielded impressive conversational and coding capabilities, its pipeline is computationally intensive and complex, requiring the training of multiple models and extensive sampling during the training loop.

Reinforcement Learning from AI Feedback (RLAIF)
RLAIF adapts the RLHF structure but replaces human annotators with an AI “preference proxy model” that has been pre-trained on existing human ratings. This allows for scalable generalization of preference judgments, significantly reducing human labeling costs. Although RLAIF accelerates iteration and reduces human effort, it introduces the risk of amplifying existing model biases. Despite this, it has proven effective in projects aiming for scalable AI alignment.

Direct Preference Optimization (DPO)
DPO is a preference-based fine-tuning method that directly optimizes a model’s policy to prefer certain outputs over others, based on human feedback. Unlike RLHF, DPO bypasses the need to train a separate reward model and use a reinforcement learning algorithm. Instead, it simplifies the process by directly optimizing the model’s likelihood of generating preferred responses relative to dispreferred ones. DPO incorporates a dynamic, per-example importance weight that prevents the model from degenerating, which can occur with a naive probability ratio objective.

Identity Preference Optimization (IPO)
IPO extends DPO by incorporating a regularization term. This term discourages the model from deviating too much from a reference model (usually the original supervised fine-tuned model). This helps in maintaining fluency and task-specific knowledge, preventing overfitting to noisy or sparse preference data, and ensuring that alignment does not lead to a degradation of the model’s core capabilities. Mathematically, IPO augments the DPO loss with an identity constraint, blending stability into the preference learning process.

Group Relative Policy Optimization (GRPO)
Introduced with models like DeepSeek-R1, GRPO is a reinforcement learning technique that optimizes model behavior based on relative preferences across groups of responses. Rather than relying on a single reward signal or binary preference pairs, GRPO generates multiple candidate responses for a given prompt and evaluates them using automated, rule-based, or heuristic feedback. This makes GRPO particularly suitable for domains with verifiable outcomes, such as mathematics, programming, or logic puzzles, where correctness can be determined without human annotation. GRPO samples a group of responses, assigns scores using automated rules, ranks them relatively, and then applies a PPO-style update that eliminates the need for a value function, simplifying training.

Direct Preference Optimization (DPO) in Detail

A primary challenge with RLHF-style fine-tuning for large language models is its inherent complexity. Learning a reward function and then optimizing it via reinforcement learning often leads to instability, significant computational overhead, and implementation difficulties. Direct Preference Optimization (DPO) offers a powerful alternative by eliminating the separate reward model and allowing direct optimization of the final policy using only preference comparisons.

From Rewards to Policies: The Change-of-Variables Insight

DPO begins by considering the classical RLHF setup, which aims to maximize expected rewards while keeping the fine-tuned policy close to a reference policy (often the supervised fine-tuned model) via a KL divergence constraint. The optimal policy under this setup is known to follow a Boltzmann distribution, weighted by an exponentiated reward function. The challenge lies in the fact that the exact reward function and normalization terms are unknown and costly to approximate.

The key insight of DPO is a “change-of-variables.” By taking the logarithm of the optimal policy equation and rearranging it, the reward function can be re-expressed directly in terms of the policy itself. This “reward-as-policy” view allows DPO to integrate this expression into a standard preference model, such as the Bradley-Terry model. This model typically depends on the difference in rewards between two responses for a given input. When the policy-based reward expression is substituted into the Bradley-Terry model, the problematic normalization terms cancel out, resulting in a preference probability that is expressed entirely in terms of the model’s policies.

DPO Objective Function

With this formulation, the DPO loss can be written as a negative log-likelihood over a dataset of preferred and rejected response pairs. This objective function directly encourages the model to increase the log probability of preferred responses while decreasing the log probability of rejected ones. A hyperparameter, often referred to as inverse temperature, controls the sharpness of these preference decisions. The objective effectively measures how well the current model’s policy aligns with the observed human preferences, penalizing instances where preferred responses are less likely than rejected ones.

How the Gradient Works

Examining the gradient of the DPO loss provides a mechanistic understanding of how the model is updated. If the model already correctly ranks a preferred response above a rejected one, the gradient will be small, indicating minimal adjustment is needed. However, if the model incorrectly ranks a preferred response lower than a rejected one, the gradient will be larger, pushing the model more strongly to favor the preferred response. This update mechanism is inherently self-corrective and scales dynamically with the severity of the model’s preference inversion.

How DPO Works in Practice

The practical implementation of DPO involves three main steps:

Dataset Creation: Candidate completions are sampled for a given prompt, and a preferred response is identified, typically through human feedback or a proxy scoring mechanism.
Set Reference Policy: A reference policy is established, usually the supervised fine-tuned model or a baseline model trained with maximum likelihood estimation on preferred completions.
Optimize: The DPO objective function is minimized using standard gradient descent, directly updating the model’s parameters to align with the preference data.

Fine-Tuning SmolVLM Using DPO

To demonstrate DPO’s practical application, we can fine-tune a vision-language model like Hugging Face’s SmolVLM. For this implementation, the OpenBMB RLHF-V-Dataset, which contains 5,733 human preference pairs with fine-grained segment-level corrections for diverse instructions (including detailed descriptions and question-answering), is used for alignment.

Loading the SmolVLM and Configuring LoRA

The process begins by loading the pre-trained SmolVLM model and its corresponding processor. To make fine-tuning more efficient and less computationally expensive, Low-Rank Adaptation (LoRA) is configured and applied. LoRA is a parameter-efficient fine-tuning technique that adds small, trainable matrices to the model’s existing weights, significantly reducing the number of parameters that need to be updated during training compared to full fine-tuning.

Loading and Formatting the Dataset

Next, the OpenBMB RLHF-V-Dataset is loaded and split into training and testing sets. A custom formatting function is then applied to preprocess the data. This function parses the raw text, structuring it into a chat-like format with distinct “user” and “assistant” roles, and creating separate entries for chosen and rejected answers. The model’s processor is used to apply chat templates to these text inputs. Additionally, images within the dataset are resized to prevent out-of-memory errors during processing. This transformation ensures the data is in the correct format for DPO training, providing explicit preferred and rejected responses for each prompt.

DPO Fine-Tuning

With the model and dataset prepared, DPO fine-tuning can commence. Training parameters are defined using a DPOConfig object, specifying details such as output directory, batch sizes, gradient accumulation steps, and the number of training epochs. A DPOTrainer instance is then initialized with the loaded model, the configured LoRA setup, the prepared datasets, and the training arguments. The training loop proceeds, optimizing the model based on the DPO loss. During training, it is observed that the model starts assigning higher scores to chosen answers in the test dataset. For instance, in one observation, the reward accuracy reached 62.5% by the end of the third epoch, indicating improved alignment. This accuracy is expected to further improve with longer training durations and more samples from the original dataset. After training, the fine-tuned model is saved.

Testing the Fine-Tuned Model

Finally, the fine-tuned SmolVLM model is tested on new examples from the test set. A utility function prepares the text and image inputs, generates responses using the model’s generate method, and then decodes the output. When tested on a sample image and prompt, the model’s generated response is observed to be descriptive and factually accurate, closely resembling the preferred answer rather than the rejected one from the original dataset. This practical demonstration highlights the effectiveness of the DPO algorithm in enhancing AI responses to be more aligned and human-centered.

Summary

The field of preference optimization is crucial for aligning AI models with human expectations. While initial approaches like RLHF and RLAIF rely on complex feedback loops, newer strategies such as Direct Preference Optimization (DPO), Identity Preference Optimization (IPO), and Group Relative Policy Optimization (GRPO) are advancing the field. Each offers a distinct method for interpreting and applying preferences, with GRPO notably introducing a group-based structure for diverse feedback.

DPO stands out for its elegant foundation. By transforming the traditional reward-maximization problem into a direct policy-learning objective through a clever change-of-variables, DPO eliminates the need for explicit reward modeling, simplifying the optimization process. This shift in perspective makes DPO increasingly favored for real-world alignment tasks due to its efficiency and effectiveness.

The practical application of DPO to fine-tune the SmolVLM model demonstrates its utility. The process involves carefully loading and preparing the model, formatting a preference dataset, and executing the DPO fine-tuning steps. The results show that DPO successfully enhances the model’s responses, making them more aligned with human preferences. This practical demonstration underscores DPO’s potential in developing more human-centered AI systems.