Gary Marcus Slams GPT-5 as Overhyped, Underwhelming

Decoder

The recent unveiling of GPT-5, OpenAI’s latest flagship large language model, has been met with a familiar wave of skepticism from prominent AI critic Gary Marcus. In a sharply worded blog post, Marcus accused OpenAI of fostering “overdue, overhyped and underwhelming” excitement, contending that the new model, far from being a breakthrough, merely represents another incremental step in the ongoing evolution of AI, still plagued by persistent, fundamental issues across the industry.

A long-standing skeptic of the efficacy of simply scaling up neural networks to achieve true intelligence, Marcus leveraged GPT-5’s release to reiterate his core criticisms. He characterized GPT-5 as “the latest incremental advance,” adding that it “felt rushed.” While OpenAI CEO Sam Altman touted GPT-5 as offering an experience akin to “talking to… legitimate PhD level expert in anything,” Marcus remains unconvinced. He pointedly remarked that GPT-5 is “barely better than last month’s flavor of the month (Grok 4); on some metrics (ARC-AGI-2) it’s actually worse,” referring to a common benchmark for measuring AI reasoning abilities.

Indeed, Marcus highlighted that the typical flaws associated with large language models surfaced almost immediately after GPT-5’s launch. He expressed a desire to be genuinely impressed by “a system that could have gone a week without the community finding boatloads of ridiculous errors and hallucinations.” Instead, within hours of its debut, the system exhibited familiar shortcomings, including flawed physics explanations during its release livestream, incorrect answers to basic chess puzzles, and mistakes in image analysis.

These isolated errors, Marcus argued, are not anomalies but symptoms of industry-wide problems. He drew attention to a recent study from Arizona State University which resonates deeply with his concerns. The paper suggests that “chain of thought” reasoning—an AI method designed to break down complex problems into smaller, sequential steps—is “a brittle mirage that vanishes when it is pushed beyond training distributions.” Marcus noted that reading the study’s summary gave him a sense of déjà vu, reinforcing his long-held belief that “The Achilles’ Heel I identified then still remains.”

This “distribution shift” problem, where AI models struggle when presented with data or scenarios outside their specific training parameters, is, according to Marcus, precisely why other large models, from Grok to Gemini, similarly fail at more complex “transfer tasks” that require applying knowledge to novel situations. He asserted that “It’s not an accident. That failing is principled,” suggesting a fundamental limitation rather than a mere bug.

Beyond the technical specifics of GPT-5, Marcus broadened his critique to encompass wider trends within the AI sector. He condemned the rampant hype surrounding the concept of Artificial General Intelligence (AGI), the reliance on cherry-picked demo videos that obscure limitations, the pervasive lack of transparency regarding training data, and an industry he believes prioritizes marketing over genuine scientific research. In his blunt assessment, “We have been fed a steady diet of bullshit for the last several years.”

As a corrective, Marcus once again advocated for neurosymbolic approaches, which combine the pattern-recognition strengths of neural networks with the logical reasoning capabilities of symbolic AI, often incorporating “explicit world models” that give AI a clearer understanding of the rules governing its environment. For Marcus, the launch of GPT-5 is not a stride towards AGI, but rather a pivotal moment where even dedicated tech enthusiasts might begin to seriously question the “scaling hypothesis”—the belief that simply making models bigger will inevitably lead to more intelligent and capable AI.