GPT-5 Fails Hype Test: Incremental Gains Disappoint Users

Theverge

The launch of OpenAI’s GPT-5 last week ignited a fervent wave of anticipation across the technology landscape, only to be met with widespread disappointment. Leading up to the big reveal, OpenAI CEO Sam Altman had declared GPT-5 to be “something that I just don’t wanna ever have to go back from,” likening it to the groundbreaking debut of the iPhone with a Retina display. The night before the announcement livestream, Altman further fueled speculation by posting an image of the Death Star, prompting one user on X to describe the atmosphere as akin to “Christmas Eve.” All eyes were on the ChatGPT maker, eager to see if the immense publicity would translate into a revolutionary leap or a letdown. By most accounts, it was the latter.

The fervor for OpenAI’s long-awaited model had been building for years, ever since the release of GPT-4 in 2023. During a Reddit AMA last October, users repeatedly pressed Altman and his team for details on GPT-5’s features and release date, with one Redditor pointedly asking, “Why is GPT-5 taking so long?” Altman had attributed the delay to computational limitations, noting the increasing complexity of these models and the difficulty in parallel development.

However, when GPT-5 finally became accessible via ChatGPT, user reactions were largely unenthusiastic. The significant advancements many had expected appeared incremental, with the model’s primary improvements observed in areas like operational cost and processing speed. While less spectacular, these gains could, in the long run, represent a sound financial strategy for OpenAI.

Public expectations for GPT-5 were exceptionally high, with one X user remarking that Altman’s Death Star post alone had “shifted everyone’s expectations.” OpenAI did little to temper these projections, touting GPT-5 as its “best AI system yet” and a “significant leap in intelligence,” boasting “state-of-the-art performance across coding, math, writing, health, visual perception, and more.” Altman himself claimed that conversing with the model felt like “talking to a PhD-level expert.”

This ambitious hype created a stark contrast with the reality users experienced. Social media quickly filled with examples of GPT-5’s perplexing errors. Could a model with PhD-level intelligence, for instance, repeatedly insist there were three “b’s” in “blueberry,” or fail to identify how many U.S. state names contain the letter “R”? Users also reported instances of the model incorrectly labeling a U.S. map with fabricated states such as “New Jefst” and “Krizona,” or misidentifying Nevada as an extension of California. Furthermore, users who relied on the chatbot for emotional support found the new system austere and distant, prompting such a strong backlash that OpenAI temporarily reinstated support for an older model. The disappointment even spawned memes, one famously depicting GPT-4 and GPT-4o as formidable dragons, with GPT-5 reduced to a simpleton.

Expert public opinion was equally unsparing. Gary Marcus, a prominent AI industry voice and emeritus professor of psychology at New York University, characterized the model as “overdue, overhyped and underwhelming.” Peter Wildeford, co-founder of the Institute for AI Policy and Strategy, concluded in his review, “Is this the massive smash we were looking for? Unfortunately, no.” Popular AI industry blogger Zvi Mowshowitz deemed it “a good, but not great, model,” while a Redditor on the official GPT-5 Reddit AMA bluntly declared, “Someone tell Sam 5 is hot garbage.”

In the days following GPT-5’s release, the initial wave of unimpressed reviews has somewhat tempered. The emerging consensus suggests that while GPT-5 did not deliver the monumental advancement many anticipated, it offers meaningful upgrades in cost efficiency, speed, and notably, a reduction in “hallucinations” or factual errors. A new “switch system,” which automatically routes queries to the most appropriate backend model, was also introduced. Altman has since leaned into this narrative, stating, “GPT-5 is the smartest model we’ve ever done, but the main thing we pushed for is real-world utility and mass accessibility/affordability.” OpenAI researcher Christina Kim echoed this, posting on X that “the real story is usefulness. It helps with what people care about—shipping code, creative writing, and navigating health info—with more steadiness and less friction.” She emphasized its improved calibration, ability to admit uncertainty, and capacity to ground answers with citations.

Despite these claimed improvements, a widespread sentiment persists that GPT-5 has, paradoxically, made ChatGPT less eloquent. Viral social media posts lament its perceived lack of nuance and depth in writing, often describing it as robotic and cold. Even OpenAI’s own marketing materials, featuring a side-by-side comparison of GPT-4o and GPT-5-generated wedding toasts, did not present an unequivocal win for the new model. When Altman directly asked Redditors if they found GPT-5 superior for writing tasks, he was met with an overwhelming defense of the retired GPT-4o model, leading him to temporarily restore it to ChatGPT within a day.

However, one domain where GPT-5 appears to genuinely shine is coding. An iteration of GPT-5 currently leads the most popular AI model leaderboard in the coding category, surpassing competitors like Anthropic’s Claude. OpenAI’s launch demonstrations highlighted its ability to generate games, a pixel art tool, a drum simulator, and a lofi visualizer. While complex projects might still exhibit glitches, the model has shown promise for simpler coding tasks, such as creating an interactive embroidery lesson. This represents a significant victory for OpenAI in the fiercely competitive AI coding arena, where it contends with rivals like Anthropic and Google. Businesses are willing to invest heavily in AI coding solutions, making it one of the most realistic and substantial revenue generators for AI startups that typically burn through cash. While OpenAI also emphasized GPT-5’s potential in healthcare, its practical efficacy in this sector remains largely untested.

In recent years, the significance of AI benchmarks has diminished, as they frequently change and companies selectively disclose results. Nevertheless, they still offer a reasonable snapshot of GPT-5’s performance. The model did outperform its predecessors on many industry tests, but as Wildeford noted, this improvement was largely “what would be expected—small, incremental increases rather than anything worthy of a vague Death Star meme.” Yet, if recent history is any guide, these modest, incremental advancements are often more likely to translate into tangible profit than features designed solely to impress individual consumers. AI companies understand that their primary revenue streams flow from enterprise clients, government contracts, and investments. In this context, consistent, incremental progress on established benchmarks, coupled with enhanced coding capabilities and a reduction in errors, represents the most effective strategy for capitalizing on these lucrative avenues.