Salesforce's CoAct-1 AI Agents Code & Click for Faster GUI Tasks

Venturebeat

Researchers at Salesforce and the University of Southern California have unveiled a novel technique designed to empower AI agents with a hybrid capability: executing code while simultaneously navigating graphical user interfaces (GUIs). This innovative system, dubbed CoAct-1, represents a significant leap forward, combining the precision of scripting with the intuitive interaction of traditional point-and-click methods to accelerate workflows and drastically reduce errors. By allowing agents to bypass the often-fragile and inefficient nature of mouse clicks for tasks better handled programmatically, CoAct-1 sets a new benchmark in agent performance, accomplishing complex computer tasks in significantly fewer steps than previous methods. This breakthrough promises more robust and scalable automation, opening doors for widespread real-world applications.

Current computer-use agents predominantly rely on AI models that interpret visual information and language to mimic human interaction with a mouse and keyboard. While these GUI-based agents can perform a variety of tasks, they frequently falter when confronted with lengthy, intricate workflows, particularly within applications featuring dense menus and numerous options, such as office productivity suites. Consider, for instance, a task requiring an agent to locate a specific table within a spreadsheet, filter its contents, and then save it as a new file. Such an operation demands a precise and extended sequence of GUI manipulations. This is precisely where the brittleness emerges. As the researchers note in their paper, existing agents often struggle with visual ambiguity—distinguishing between visually similar icons or menu items—and the cumulative probability of making a single error across a long sequence. A single mis-click or misinterpretation of a UI element can derail an entire task.

To mitigate these challenges, many researchers have focused on augmenting GUI agents with high-level planners, employing powerful reasoning models to decompose a user’s overarching goal into a series of smaller, more manageable subtasks. While this structured approach enhances performance, it doesn’t fundamentally resolve the issue of navigating menus and clicking buttons, even for operations that could be completed more directly and reliably with a few lines of code.

This is where CoAct-1, short for Computer-using Agent with Coding as Actions, offers a transformative solution. Designed to merge the intuitive, human-like strengths of GUI manipulation with the precision, reliability, and efficiency of direct system interaction through code, CoAct-1 operates as a collaborative team of three specialized agents: an Orchestrator, a Programmer, and a GUI Operator. The Orchestrator functions as the central planner, analyzing the user’s goal, breaking it down into subtasks, and intelligently delegating each to the most appropriate agent. Backend operations like file management or data processing are assigned to the Programmer, which adeptly writes and executes Python or Bash scripts. For frontend tasks necessitating button clicks or visual interface navigation, the Orchestrator defers to the GUI Operator, an AI model specifically designed for visual interaction. This dynamic delegation allows CoAct-1 to strategically bypass inefficient GUI sequences in favor of robust, single-shot code execution when suitable, while still leveraging visual interaction for tasks where it remains indispensable. The workflow is iterative, with each subtask completion prompting a summary and screenshot back to the Orchestrator, which then determines the subsequent action or concludes the task. Both the Programmer and GUI Operator leverage sophisticated interpreters to test and refine their actions, ensuring accuracy.

CoAct-1’s capabilities were rigorously tested on OSWorld, a comprehensive benchmark featuring 369 real-world tasks spanning browsers, integrated development environments, and office applications. The results are compelling: CoAct-1 achieved a new state-of-the-art success rate of 60.76%. The performance gains were particularly pronounced in categories where programmatic control offers a distinct advantage, such as OS-level tasks and multi-application workflows. For instance, consider an OS-level task like finding all image files within a complex folder structure, resizing them, and then compressing the entire directory. A purely GUI-based agent would necessitate a long, error-prone sequence of clicks and drags. CoAct-1, conversely, can delegate this entire workflow to its Programmer agent, which can accomplish the task with a single, robust script. Beyond higher success rates, the system is dramatically more efficient, solving tasks in an average of just 10.15 steps, a stark improvement over the 15.22 steps typically required by leading GUI-only agents like GTA-1. This efficiency is critical, as researchers observed a clear trend: tasks requiring more actions are more likely to fail. By reducing the number of steps, CoAct-1 not only speeds up task completion but, more importantly, minimizes opportunities for error, paving a more robust and scalable path toward generalized computer automation.

The potential implications of this technology extend far beyond general productivity, offering significant value to enterprise leaders seeking to automate complex, multi-tool processes where full API access is often a luxury. Ran Xu, a co-author of the paper and Director of Applied AI Research at Salesforce, highlights customer support as a prime example. Service agents frequently utilize a diverse array of tools—from general platforms like Salesforce to industry-specific applications such as EPIC for healthcare, alongside numerous customized tools—to address customer requests. Many of these tools lack API access, making them ideal candidates for CoAct-1, which can leverage whatever interaction method is available, be it API, code, or direct screen interaction. Xu also identifies high-value applications in sales, such as large-scale prospecting and automated bookkeeping, and in marketing for tasks like customer segmentation and campaign asset generation.

Despite its impressive benchmark performance, real-world enterprise environments present unique challenges, including legacy software and unpredictable user interfaces. This raises critical questions regarding robustness, security, and the necessity of human oversight. Ensuring the Orchestrator agent makes the correct choice when faced with an unfamiliar application is a core challenge. According to Xu, making agents like CoAct-1 robust for custom enterprise software involves extensive training in realistic, simulated environments. The ultimate goal is a system where the agent can learn from human agents, train in a sandbox, and then operate live under human guidance and guardrails. The Programmer agent’s ability to execute its own code also introduces obvious security concerns, particularly the risk of executing harmful code based on ambiguous user requests. Xu emphasizes that robust containment is paramount, with access control and sandboxing being key. A human must understand the implications and grant AI access for safety. Sandboxing and guardrails will be critical for validating agent behavior before deployment on sensitive systems. Ultimately, for the foreseeable future, overcoming ambiguity will likely necessitate human involvement. Xu envisions a phased approach, starting with a human-in-the-loop for all tasks, with some eventually achieving full autonomy. However, for mission-critical operations, human validation will remain crucial, ensuring safety and accuracy.