GPT-5: Mixed Developer Reviews, High Cost-Effectiveness

Wired

OpenAI’s recent unveiling of GPT-5 was accompanied by bold claims: a “true coding collaborator” designed to excel at generating high-quality code and performing automated software tasks. The launch appeared to directly challenge Anthropic’s Claude Code, a tool that has rapidly become a go-to for many developers seeking AI-assisted coding. However, early reactions from the developer community suggest GPT-5’s performance has been a more nuanced affair, presenting a mixed bag of capabilities.

While GPT-5 demonstrates strong aptitude for technical reasoning and the strategic planning of coding tasks, several developers contend that Anthropic’s latest Opus and Sonnet models still produce superior code. A recurring point of contention is GPT-5’s verbosity; depending on its setting, the model can generate overly elaborate responses, sometimes leading to unnecessary or redundant lines of code. Furthermore, OpenAI’s own evaluation methods for GPT-5’s coding prowess have drawn criticism, with some arguing the benchmarks are misleading. One research firm went as far as to label a graphic published by OpenAI, touting GPT-5’s capabilities, a “chart crime.”

Despite these criticisms, GPT-5 offers a compelling advantage in one crucial area: cost-effectiveness. Sayash Kapoor, a computer science doctoral student and researcher at Princeton University, co-author of AI Snake Oil, highlights this distinction. In his team’s benchmark tests, running a standard evaluation that measures a language model’s ability to reproduce the results of 45 scientific papers costs a mere $30 with GPT-5 (set to medium verbosity), compared to a hefty $400 for the same test using Anthropic’s Opus 4.1. Kapoor’s team has already invested approximately $20,000 in testing GPT-5, underscoring the significant cost disparity.

Yet, this affordability comes with a trade-off in accuracy. Kapoor’s tests indicate that while more economical, GPT-5 is less precise than some of its rivals. Claude’s premium model achieved a 51 percent accuracy rate in reproducing scientific papers, whereas the medium version of GPT-5 managed only 27 percent. It’s worth noting that this is an indirect comparison, as Opus 4.1 represents Anthropic’s most powerful offering, and Kapoor’s team has not yet conducted the same test with GPT-5’s high verbosity setting.

OpenAI, through spokesperson Lindsay McCallum, directed inquiries to its blog, which states GPT-5 was trained on “real-world coding tasks in collaboration with early testers across startups and enterprises.” The company also showcased internal accuracy measurements for GPT-5, revealing that its “thinking” model, designed for more deliberate reasoning, achieved the highest accuracy among OpenAI’s models. However, the “main” GPT-5 model still lagged behind previously released models on OpenAI’s internal accuracy scale. Anthropic spokesperson Amie Rotherham responded by emphasizing that “performance claims and pricing models often look different once developers start using them in production environments,” suggesting that for reasoning models, “price per outcome matters more than price per token.”

Amidst the mixed reviews, some developers report largely positive experiences with GPT-5. Jenny Wang, an engineer, investor, and creator of the personal styling agent Alta, found GPT-5 adept at completing complex coding tasks in a single attempt, surpassing older OpenAI models she frequently uses for code generation and straightforward fixes. For instance, GPT-5 generated code for a company press page with specific design elements in one go, a task that previously required iterative prompting, though Wang noted it “hallucinated the URLs.” Another developer, preferring anonymity, praised GPT-5’s ability to solve deep technical problems, citing its impressive recommendations and realistic timelines for a complex network analysis tool project. Several of OpenAI’s enterprise partners, including Cursor, Windsurf, and Notion, have publicly endorsed GPT-5’s coding and reasoning skills, with Notion claiming it handles complex work 15 percent better than other models tested.

Conversely, some developers expressed immediate disappointment online. Kieran Klassen, who is building an AI email assistant, remarked that GPT-5’s coding abilities seemed “behind-the-curve,” more akin to Anthropic’s Sonnet 3.5, released a year prior. Amir Salihefendić, founder of Doist, found GPT-5 “pretty underwhelming” and “especially bad at coding,” drawing a comparison to the disappointing release of Meta’s Llama 4. Developer Mckay Wrigley praised GPT-5 as a “phenomenal everyday chat model” but confirmed he would stick with Claude Code and Opus for coding tasks. The model’s “exhaustive” nature, while sometimes helpful, was also described as irritatingly long-winded, with Wang noting its tendency towards “more redundant” solutions.

Itamar Friedman, cofounder and CEO of AI-coding platform Qodo, suggests that some of the critiques stem from evolving expectations. He distinguishes between the “Before ChatGPT Era” (BCE), when AI models improved holistically, and the current post-ChatGPT landscape where advancements are often specialized. He cited Claude Sonnet 3.5’s dominance in coding and Google Gemini’s strength in code review as examples.

OpenAI has also faced scrutiny over its benchmark testing methodology. SemiAnalysis, a research firm, pointed out that OpenAI conducted only 477 of the 500 tests typically included in SWE-bench, a standard AI industry framework for evaluating large language models. OpenAI clarified that it consistently uses a fixed subset of 477 tasks because these are validated on its internal infrastructure, adding that variations in the model’s verbosity setting can influence evaluation performance.

Ultimately, frontier AI companies grapple with complex trade-offs, as Sayash Kapoor observes. Developers training new models must balance user expectations, performance across diverse tasks like agentic coding, and cost. Kapoor speculates that OpenAI, aware it might not dominate every benchmark, likely aimed to create a model that would broadly appeal to a wide range of users, prioritizing a compelling cost-performance ratio.