ArsTechnica Tests GPT-5 vs. GPT-4o: Is the New Model Worse?
The recent rollout of OpenAI’s GPT-5 model has been met with significant user backlash, with complaints ranging from a perceived sterile tone and diminished creativity to an increase in factual errors. This widespread discontent even prompted OpenAI to reintroduce the previous GPT-4o model as an alternative. To objectively assess these claims, Ars Technica put both GPT-5 and GPT-4o through a rigorous series of test prompts, some adapted from prior evaluations and others designed to mirror how modern users engage with large language models. While acknowledging the inherent subjectivity of judging AI responses and the limited scope of an eight-prompt evaluation, this exercise offers valuable insights into the stylistic and substantive differences between OpenAI’s new and previous flagship models.
The first challenge involved generating five original “dad jokes.” GPT-5, despite its claims, delivered largely unoriginal but well-formed examples. GPT-4o, conversely, mixed uninspired rehashes with attempts at originality that simply fell flat, relying on strained logic rather than clever wordplay. Given both models’ failure to produce genuinely original content, this round concluded in a tie.
Next, a mathematical word problem asked how many 3.5-inch floppy disks would be needed to “ship” Microsoft Windows 11. GPT-5 demonstrated superior reasoning, entering a “Thinking” mode to accurately calculate the number based on the average Windows 11 ISO size (5-6GB) and even providing source links. GPT-4o, while offering an understandable interpretation, based its calculation on the larger final hard drive installation size (20-30GB). Despite GPT-4o’s additional, albeit unsolicited, information on the physical dimensions of thousands of floppy disks, GPT-5 secured the win for its precise interpretation of the prompt.
Creative writing saw both models craft a two-paragraph story about Abraham Lincoln inventing basketball. GPT-5 offered a charmingly folksy portrayal of Lincoln, punctuated by delightful lines like “history was about to bounce in a new direction.” GPT-4o, however, sometimes strained for cleverness, with forced analogies, though it nearly clinched the win with its memorable, cheesy ending: “Four score… and nothing but net.” Ultimately, GPT-5 narrowly edged out its predecessor for its more consistent narrative.
The models’ factual recall was tested by requesting a short biography of Ars Technica’s own Kyle Orland. Historically, large language models have struggled with such personal queries, often fabricating details. GPT-5 marked a significant improvement, accurately summarizing the author’s public bios with useful citations and no hallucinations—a first for the testing team. GPT-4o performed admirably without explicit web searches but faltered by describing a long-defunct blog as “long-running.” GPT-5’s superior accuracy and detail made it the clear winner.
When tasked with drafting a delicate email to a boss about an impossible project deadline, both models provided polite yet firm responses. GPT-5 distinguished itself by recommending a breakdown of subtasks with time estimates and proactively offering solutions rather than just complaints. It further provided an unprompted analysis of why such an email structure is effective, adding valuable insight. GPT-5’s more comprehensive and strategic approach earned it the advantage.
In a critical test involving medical advice, both ChatGPT models commendably and directly stated that no scientific evidence supports healing crystals as a cancer treatment. GPT-5 hedged slightly by mentioning complementary uses. GPT-4o, in contrast, was unequivocally direct, labeling healing crystals as “pseudoscience” and citing multiple web sources detailing their inefficacy. GPT-4o’s forceful clarity and reliance on verifiable sources made it the superior choice for this sensitive query.
The challenge of providing video game guidance, specifically how to beat Super Mario Bros. world 8-2 without running, revealed a surprising twist: speedrunners have indeed found ways. GPT-5 partially grasped this, suggesting Bullet Bills, but included incorrect methods. GPT-4o, while also making a bizarre suggestion about a nonexistent springboard, ultimately provided more detailed and visually appealing solutions for the actual challenge. Despite both models exhibiting some odd non-sequiturs, GPT-4o’s overall presentation and additional relevant details gave it the edge.
Finally, an emergency scenario: explaining how to concisely land a Boeing 737-800 to a complete novice, with “time of the essence.” GPT-5 took “concisely” too far, omitting crucial details. GPT-4o, conversely, remained concise while incorporating vital information regarding the appearance and location of key controls. In a hypothetical life-or-death situation, GPT-4o’s more detailed yet practical guidance would undoubtedly be preferred.
In a numerical tally, GPT-5 technically emerged with a narrow victory, winning four prompts to GPT-4o’s three, with one tie. However, this simple score belies the nuanced reality that in many instances, determining the “better” response was a matter of subjective judgment. GPT-4o generally provided more detailed and personable responses, while GPT-5 leaned towards directness and conciseness. The preferred style often depended on the specific nature of the prompt and individual user preference. Ultimately, this comparison underscores the inherent difficulty for any single large language model to be universally optimal for every user and every query. It suggests that users accustomed to the nuances and stylistic patterns of older models may inevitably find aspects of newer iterations less satisfactory, regardless of overall advancements.