CoAct-1: Hybrid AI Agent Sets New OSWorld Benchmark Record
A collaborative team of researchers from the University of Southern California, Salesforce AI, and the University of Washington has unveiled CoAct-1, a groundbreaking multi-agent system designed to significantly advance autonomous computer operation. This innovative system redefines how AI agents interact with computers by elevating direct coding to a primary action, placing it on par with traditional graphical user interface (GUI) manipulation. This fundamental shift addresses long-standing challenges related to the efficiency and reliability of AI in handling complex, multi-step computer tasks. On the challenging OSWorld benchmark, CoAct-1 has established a new performance benchmark, achieving an unprecedented success rate of 60.76%, making it the first such AI agent to surpass the 60% threshold.
Conventional computer-using AI agents typically rely exclusively on pixel-based GUI interaction, mimicking human users by navigating interfaces, clicking elements, and typing. While this approach allows them to replicate human workflows, it often proves fragile and inefficient, particularly for intricate tasks involving cluttered interfaces, workflows spanning multiple applications, or complex operating system operations. Even a single misclick can derail an entire workflow, and as tasks grow in complexity, the number of required steps can balloon dramatically. Efforts to mitigate these issues, such as augmenting GUI agents with high-level planners, have been explored, but these methods ultimately remain constrained by the inherent limitations of GUI-centric action spaces, which restrict both efficiency and overall robustness.
CoAct-1 introduces a fundamentally different approach through its hybrid architecture, integrating three specialized AI agents. At the core is the Orchestrator, a high-level planner responsible for breaking down complex tasks into smaller subtasks. Crucially, the Orchestrator dynamically delegates each subtask to either the Programmer or the GUI Operator, based on the specific requirements of the task. The Programmer agent handles backend operations—such as file management, data processing, or environment configuration—by executing direct Python or Bash scripts, thereby bypassing the often cumbersome and error-prone sequences of GUI actions. Complementing this, the GUI Operator utilizes an AI model capable of interpreting visual information and language to interact with graphical interfaces when human-like UI navigation is indispensable. This hybrid model allows CoAct-1 to strategically substitute brittle and lengthy mouse-keyboard operations with concise, reliable code execution, while still leveraging GUI interactions precisely when necessary.
The system’s capabilities were rigorously evaluated on OSWorld, a leading benchmark comprising 369 diverse tasks encompassing office productivity suites, integrated development environments (IDEs), web browsers, file managers, and multi-application workflows. Each task in OSWorld mirrors real-world language goals and is assessed using a granular, rule-based scoring system. CoAct-1’s performance was remarkable: it achieved an overall success rate of 60.76% in the 100+ step category, outperforming leading frameworks such as GTA-1 (53.10%), OpenAI CUA 4o (31.40%), and UI-TARS-1.5 (29.60%). Furthermore, it demonstrated superior efficiency, completing successful tasks with an average of just 10.15 steps, significantly fewer than GTA-1’s 15.22 steps or UI-TARS’s 14.90 steps. While OpenAI CUA 4o achieved fewer steps (6.14), its success rate was considerably lower at 31.40%, highlighting CoAct-1’s balance of speed and accuracy. The system exhibited particular strength in multi-application workflows (47.88% success, compared to GTA-1’s 38.34%) and operating system tasks (75.00%), consistently leading or matching the best performance in productivity and IDE domains.
Several key insights illuminate the drivers behind CoAct-1’s impressive gains. The ability to perform coding actions directly replaces numerous redundant and error-prone GUI sequences; for example, a single script can automate batch image resizing or advanced file manipulations that would otherwise require dozens of clicks, drastically reducing both steps and potential points of failure. The Orchestrator’s dynamic delegation ensures optimal utilization of both coding and GUI actions, adapting to task needs. Moreover, the research indicates that integrating more powerful underlying AI models significantly enhances performance; the configuration achieving the top 60.76% score leveraged OpenAI CUA 4o for the GUI Operator, OpenAI o3 for the Orchestrator, and o4-mini for the Programmer. This correlation underscores that the system’s efficiency directly contributes to its reliability, as fewer steps inherently reduce opportunities for error, which is a strong predictor of task completion success.
By making coding a first-class system action alongside GUI manipulation, CoAct-1 delivers a significant leap in both the success rate and efficiency of autonomous computer agents. Its hybrid architecture and dynamic execution logic set a new benchmark for the field, heralding robust advancements in real-world computer automation.