Nebius AI Boosts Open-Weight LLMs for SWE Agents with RL Breakthrough
The evolving landscape of software engineering automation is increasingly shaped by advances in Large Language Models (LLMs). However, a significant hurdle has persisted: most capable LLM agents rely either on proprietary models or expensive, teacher-guided training methods. This has left open-weight LLMs—those with publicly available models—with limited real-world utility for complex software development tasks. A recent breakthrough from a joint research team at Nebius AI and Humanoid aims to change this, introducing a novel reinforcement learning framework designed to train highly capable, long-context, multi-turn software engineering agents. This research marks a pivotal shift, moving beyond the simplistic, single-turn interactions typically seen in LLM reinforcement learning to address the intricate demands of genuine software engineering.
Software engineering fundamentally differs from many tasks LLMs are trained for, such as mathematical reasoning or one-shot code generation. Unlike these, which often provide a single reward at the end, SWE requires agents to execute long sequences of actions, interpret rich feedback like compiler errors and test logs, and maintain context over hundreds of thousands of tokens. This complexity introduces several core challenges for reinforcement learning. Agents must sustain logical coherence across many steps, often necessitating context windows exceeding 100,000 tokens. Actions yield meaningful, non-trivial observations—such as shell command outputs or test suite results—that are crucial for guiding subsequent decisions. Furthermore, success signals are typically sparse and delayed, emerging only at the culmination of complex interactions, making it difficult to attribute credit to specific actions. Evaluating progress is also complex, requiring full trajectory unrolling, and can be noisy due to test flakiness.
To tackle these challenges, the research team developed a two-stage learning pipeline for training a Qwen2.5-72B-Instruct agent. The process begins with Rejection Fine-Tuning (RFT), a supervised method where the agent is run across 7,249 rigorously filtered software engineering tasks from the SWE-REBENCH dataset. Only successful interaction traces—where the agent passes the environmental test suite—are used to fine-tune the model, with particular attention paid to masking invalid environment-formatting actions during training. This initial step alone significantly boosted the baseline accuracy from 11% to 20% on the SWE-bench Verified benchmark.
Building on this foundation, the second stage employs reinforcement learning using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm. Several key modifications were introduced to enhance scalability and stability. Asymmetric Clipping was implemented to prevent policy entropy collapse, ensuring the agent continues to explore new solutions. Dynamic Sample Filtering focuses optimization on trajectories that yield actual learning signals, making training more efficient. Length Penalties discourage excessively long episodes, helping the agent avoid getting stuck in unproductive loops. Finally, Token-Level Averaging ensures that every token in every trajectory contributes equally to the gradient, allowing longer, more complex interactions to exert appropriate influence on updates. The agent itself utilizes a ReAct-style loop, enabling it to combine reasoning steps with practical tool usage. Its robust toolkit includes the ability to execute arbitrary shell commands, perform precise code edits, use navigation and search utilities, and signal episode completion. Each interaction is grounded in a robust sandboxed environment, initialized from real repository snapshots and presented with GitHub-style issue prompts.
Initially trained with a context length of 65,000 tokens—already double that of most open models—the agent’s performance plateaued at 32%. To push beyond this, a second reinforcement learning phase expanded the context to 131,000 tokens and doubled the episode length ceiling. This phase focused subsequent training on only the most beneficial tasks, enabling the model to scale to the longer stack traces and diff histories inherent in real-world debugging and patching tasks.
The results are compelling. The final RL-trained agent achieved a 39% Pass@1 accuracy on the SWE-bench Verified benchmark, effectively doubling the performance of the rejection fine-tuned baseline. Crucially, it matched the performance of cutting-edge open-weight models like DeepSeek-V3-0324, all without requiring teacher-based supervision. On held-out SWE-rebench splits, the scores remained competitive, demonstrating the method’s robustness: 35% for May and 31.7% for June. When compared head-to-head with top open baselines and specialized software engineering agents, this RL agent consistently matched or outperformed several models, confirming the effectiveness of this reinforcement learning methodology in the domain of autonomous software development.
Despite these advancements, challenges remain. Credit assignment in sparse-reward regimes continues to be fundamentally difficult, suggesting future work could explore reward shaping, step-level critics, or prefix-based rollouts for more granular feedback. Real-world agents also need to estimate uncertainty, knowing when to abstain or express confidence, with techniques like output entropy or explicit confidence scoring as next steps. The training itself was a significant undertaking, leveraging context parallelism to split long sequences across 16 H200 nodes, with distributed orchestration managed via Kubernetes and Tracto AI, and vLLM for fast inference.
This research decisively validates reinforcement learning as a potent paradigm for building autonomous software engineers using open-weight LLMs. By conquering long-horizon, multi-turn, real-environment tasks, the methodology paves the way for scalable, teacher-free agent development that directly leverages the power of interaction rather than static instruction. With further refinements, such reinforcement learning pipelines promise to deliver efficient, reliable, and versatile automation for the future of software engineering.