OpenAI's o3 Outperforms Newer GPT-5 on Complex Office Tasks

Decoder

A new benchmark designed to push the boundaries of artificial intelligence in real-world office environments has yielded surprising results: OpenAI’s established o3 model consistently outperformed its newer GPT-5 counterpart on complex, multi-application tasks. This finding, based on the recently introduced OdysseyBench, suggests that progress in AI agent capabilities for intricate, long-duration workflows may be evolving in unexpected ways.

Developed by researchers at Microsoft and the University of Edinburgh, OdysseyBench aims to move beyond isolated “atomic tasks”—simple, single-step commands—to evaluate how AI models handle scenarios that unfold over several days, mimicking genuine office work. The benchmark encompasses 602 tasks, spanning popular applications like Word, Excel, PDF, email, and calendar tools. These tasks are divided into two main categories: 300 realistic scenarios derived from OfficeBench, dubbed OdysseyBench+, and 302 newly constructed, exceptionally challenging situations, known as OdysseyBench-Neo. Both sections demand that models extract information from multi-day conversations, formulate multi-step plans, and synchronize actions across various office tools.

The primary challenge for these AI agents lies in solving long-term, dialog-driven office tasks. Across both OdysseyBench+ and OdysseyBench-Neo, OpenAI’s o3 model consistently emerged as the leader over GPT-5. On OdysseyBench-Neo, which features the most demanding, hand-crafted tasks, o3 achieved a 61.26% success rate, significantly outperforming GPT-5’s 55.96% and GPT-5-chat’s 57.62%. The performance gap widened further on tasks requiring the simultaneous use of three applications, where o3 scored 59.06% compared to GPT-5’s 53.80%.

Results on OdysseyBench+ mirrored this trend. Here, o3 scored 56.2%, beating GPT-5 at 54.0% and GPT-5-chat at 40.3%. The disparity became even more pronounced on tasks involving the coordination of two or three applications, where contextual understanding and meticulous planning are paramount. Interestingly, GPT-5-chat unexpectedly outperformed GPT-5 on OdysseyBench-Neo. Researchers attribute this to the Neo tasks’ focus on dialog-based assistance, which aligns with GPT-5-chat’s conversational strengths. Conversely, OdysseyBench+ includes more fragmented, less conversational scenarios, where the reasoning-focused GPT-5 demonstrated a better ability to extract relevant information from disjointed input. It is worth noting that the study did not specify the reasoning settings for GPT-5, such as its “thinking time” or specific agent parameters, nor was a more advanced GPT-5 Pro model included in the evaluation.

These findings carry significant implications, especially as OpenAI actively pursues the development of AI agents capable of “thinking” for hours or even days, with the ultimate goal of generating novel ideas and automating research in fields like medicine and AI safety. OdysseyBench could prove to be a crucial benchmark for these nascent long-horizon systems. Simultaneously, the results subtly highlight a potential deceleration in the pace of progress: while both o3 and GPT-5 represent clear advancements over earlier models, there is no discernible leap in capability from o3 to GPT-5, particularly given that o3 was only officially released in April.

A closer examination of the benchmark results reveals several persistent challenges for current AI agents in complex workflows. Models frequently overlook critical files, skip necessary steps, or attempt to use the wrong tools for a given task. For instance, some agents tried to generate PDF files before creating the original text in Word, or failed to extract content from PDFs before drafting a review document. Tasks involving the creation or editing of DOCX and XLSX files proved particularly error-prone, demanding precise, multi-step coordination—an area where the agents consistently struggled. Researchers conclude that these issues point to a broader, fundamental challenge: today’s AI agents still grapple with the precise, multi-stage planning required to navigate tasks spanning different tools, timeframes, and contexts. For those interested in further exploration, the OdysseyBench and HOMERAGENTS framework are openly available on GitHub.