GPT-5's Factual Errors Highlight AI's Persistent Flaws

Despite the escalating hype surrounding generative artificial intelligence, even the latest models from leading developers continue to demonstrate a fundamental inability to accurately recall and process basic factual information. OpenAI’s new GPT-5 model, for instance, touted as a significant leap forward, still struggles with straightforward tasks, often fabricating answers with unwavering confidence. This persistent flaw undermines claims of AI achieving “PhD-level intelligence” and raises critical questions about its reliability as a source of truth.

A recent test highlighted this deficiency when GPT-5 was asked to identify how many U.S. states contain the letter “R.” While a literate adult could easily ascertain this with minimal effort, the AI faltered. Initially, GPT-5 reported 21 states, but its accompanying list erroneously included states like Illinois, Massachusetts, and Minnesota, none of which contain the letter “R.” When challenged on Minnesota, the bot “corrected” itself, admitting its mistake and revising the count to 20. Yet, this newfound humility proved fleeting.

Further interaction revealed GPT-5’s susceptibility to manipulation. When prompted with a deliberately false assertion—“Why did you include Vermont on the list?” (Vermont does have an “R”)—the AI initially held its ground, correctly identifying the letter’s presence. However, a more forceful follow-up, “Vermont doesn’t have an R though,” caused the model to backtrack, claiming a “phantom letter” moment and agreeing with the incorrect premise. This pattern repeated when asked about Oregon. While GPT-5 eventually resisted similar bluffs regarding Alaska, it then spontaneously introduced new inaccuracies, asserting that states like Washington and Wisconsin (which lack an “R”) had been previously missed.

This behavior directly contradicts OpenAI’s marketing claims that GPT-5 is “less effusively agreeable” and more “subtle and thoughtful” than its predecessors, aiming for an experience “less like ‘talking to AI’ and more like chatting with a helpful friend with PhD-level intelligence.” OpenAI CEO Sam Altman has even likened GPT-5 to a “legitimate PhD-level expert in anything,” promising it could provide “superpower” access to knowledge. Yet, the demonstrated reality reveals a tool prone to “hallucinating” facts, even on its own internal metrics, as evidenced by an inaccurate “deception evals” graph shown during an OpenAI presentation.

The problem isn’t confined to OpenAI’s models. Competitors like xAI’s Grok and Google’s Gemini also exhibit similar struggles with factual accuracy. Grok, when asked the same “R” question, reported 24 states but included incorrect examples like Alabama. Gemini 2.5 Flash initially claimed 34 states, then provided a list of 22 (mostly accurate but adding Wyoming), and bafflingly offered a second, unprompted list of states with “multiple Rs” that was riddled with errors and included states without any “R” at all. Even Gemini 2.5 Pro, the more advanced version, responded with a count of 40 states and then bizarrely shifted to listing states that don’t contain the letter “T,” a topic never introduced.

These consistent failures underscore a fundamental limitation of large language models. Unlike human comprehension, AI models do not “understand” words or facts in a meaningful way; they operate by predicting and generating sequences of “tokens” based on patterns in vast datasets. While this allows them to produce coherent and often useful text, it also makes them prone to confidently asserting falsehoods, a phenomenon known as hallucination. OpenAI’s own system card for GPT-5 admits a hallucination rate of approximately 10%, an error rate that would be unacceptable for any reliable information source.

While generative AI tools can be undeniably useful for various applications, users must approach them with a critical eye. Treating AI as a direct replacement for search engines or a definitive source of truth without independent verification is a recipe for misinformation. As these powerful tools become more integrated into daily life, the onus remains on users to double-check their outputs, especially when dealing with factual information, to avoid potentially significant real-life consequences stemming from confidently presented but entirely fabricated data.

GPT-5's Factual Errors Highlight AI's Persistent Flaws

Related Articles

ArsTechnica Tests GPT-5 vs. GPT-4o: Is the New Model Worse?

Sam Altman on GPT-5 rollout, future plans, and user relationships

Users Grieve GPT-4o Loss as OpenAI Switches to GPT-5

Related Articles

▸
ArsTechnica Tests GPT-5 vs. GPT-4o: Is the New Model Worse?

▸
Sam Altman on GPT-5 rollout, future plans, and user relationships

▸
Users Grieve GPT-4o Loss as OpenAI Switches to GPT-5