Grok 4 edges out GPT-5 in complex reasoning benchmark ARC-AGI

Decoder

In a notable development in the fiercely competitive field of artificial intelligence, xAI’s Grok 4 has reportedly surpassed OpenAI’s GPT-5 in the demanding ARC-AGI-2 benchmark, a test specifically designed to evaluate a model’s general reasoning capabilities rather than mere memorization. This unexpected lead by Grok 4, however, comes with a significant caveat: a substantially higher operational cost, underscoring the complex trade-offs emerging in the latest generation of large language models.

According to data released by ARC Prize, the organization behind the benchmark, Grok 4’s “Thinking” variant achieved an accuracy rate of approximately 16 percent on ARC-AGI-2. While impressive, this performance incurred a cost ranging from $2 to $4 per task. In contrast, OpenAI’s flagship GPT-5 “High” model, though trailing with a 9.9 percent accuracy score, proved far more cost-efficient at just $0.73 per task. The ARC-AGI benchmarks are meticulously designed to prioritize genuine reasoning over rote knowledge, assessing models not only on their ability to solve problems but also on the economic viability of their solutions.

The narrative shifted slightly on the less challenging ARC-AGI-1 test. Here, Grok 4 maintained a lead, reaching about 68 percent accuracy, closely followed by GPT-5 at 65.7 percent. Yet again, the economic disparity was pronounced: Grok 4 demanded around $1 per task, whereas GPT-5 delivered comparable performance for a mere $0.51. This stark difference in price point currently positions GPT-5 as the more attractive option for applications where cost-efficiency is paramount, though xAI could potentially recalibrate its pricing strategy to narrow this gap.

Beyond these top-tier models, the benchmark also shed light on the performance of lighter, more economical variants. OpenAI’s GPT-5 Mini, for instance, achieved 54.3 percent on ARC-AGI-1 at a cost of just $0.12, and 4.4 percent on ARC-AGI-2 for $0.20. The even more compact GPT-5 Nano demonstrated its ultra-low cost potential, scoring 16.5 percent on ARC-AGI-1 and 2.5 percent on ARC-AGI-2, both at an exceptionally low price of $0.03 per task. These smaller models highlight the industry’s push towards diversified offerings, catering to a spectrum of performance and budget requirements.

Looking ahead, ARC Prize has confirmed that preliminary, unofficial evaluations are already underway for the interactive ARC-AGI-3 benchmark. This innovative test challenges models to solve tasks through iterative trial and error within a game-like environment. While these visual puzzle games are often intuitive for humans to navigate and solve, most artificial intelligence agents continue to struggle, underscoring the significant hurdles that remain in achieving truly human-like cognitive flexibility and adaptive problem-solving.

It is crucial to contextualize Grok 4’s strong performance on these specific benchmarks. While impressive, it does not unilaterally establish it as the superior model across all AI applications, particularly given the ongoing scrutiny of benchmark methodologies and competitive practices. Interestingly, OpenAI notably omitted any mention of the ARC Prize during its recent GPT-5 presentation, a departure from its past practice where such benchmarks were often highlighted during new model launches.

Further complicating the competitive landscape is the curious case of the o3-preview model. Introduced in December 2024, this OpenAI variant still retains the highest score on the ARC-AGI-1 test by a considerable margin, achieving nearly 80 percent accuracy, albeit at a significantly higher cost than its competitors. Reports suggested that OpenAI was compelled to make substantial reductions to o3-preview for its later, publicly released chat version. This claim was subsequently corroborated by ARC Prize itself, which confirmed the diminished performance of the publicly available o3 model in late April, raising questions about the trade-offs between raw capability, cost, and public deployment strategy.

The latest ARC-AGI results paint a vivid picture of a rapidly evolving AI ecosystem where breakthroughs are often accompanied by complex trade-offs. While Grok 4 has demonstrated an undeniable edge in certain reasoning tasks, GPT-5 maintains a compelling lead in cost-efficiency and offers a broader suite of models tailored for various applications. The competition between leading AI developers remains fierce, pushing the boundaries of what these powerful systems can achieve, even as fundamental challenges in adaptive reasoning persist.