GPT-5 Fails Sopranos Test, Highlighting Hallucination & Recall Issues
OpenAI CEO Sam Altman’s ambitious claim that the latest iteration of his company’s large language model, GPT-5, would offer a “PhD-level smart” conversational experience was met with immediate skepticism upon its release. Users swiftly began to question the model’s reported lack of progress, lamenting the deprecation of older, seemingly more capable versions. Initial tests revealed GPT-5 struggling with even basic questions, a failing that prompted further exploration into its capabilities beyond conventional academic knowledge.
To probe the model’s reliability, particularly its susceptibility to fabricating information and its ability to recall specific details, a deep dive into pop culture seemed an ideal testbed. As a devoted fan of HBO’s suburban crime drama The Sopranos, having viewed the series countless times, the author possessed an encyclopedic knowledge that would allow for immediate verification of the chatbot’s responses. The goal was not merely to assess how much data GPT-5 had been trained on regarding the show, but to rigorously evaluate the accuracy of the information it produced.
The results, unfortunately, mirrored earlier criticisms: GPT-5 displayed a tenuous grasp of the series’ intricate plotlines. The examination began with “Pine Barrens,” widely considered one of the show’s most iconic episodes. This installment famously sees mob associates Paulie and Christopher attempting to dispose of a Russian ex-soldier named Valery in the titular woods, only for Valery to mysteriously vanish after a scuffle.
When presented with a fabricated detail—asking what happens when Christopher shoots Valery—GPT-5 confidently took the bait. It described a non-existent shooting at Valery’s apartment, stating, “When Christopher shoots Valery in ‘Pine Barrens,’ it’s during their first visit to his apartment.” This was factually incorrect; no gunfire occurs at the apartment, nor does Christopher ever shoot Valery. In the actual episode, Paulie incapacitates Valery by choking him. Further prodding with another fabricated detail, suggesting Paulie then shot Valery again, prompted the chatbot to invent a second, equally erroneous headshot. It even bafflingly described this fatal-sounding shot as a mere “grazing or superficial wound.” The chatbot’s misinterpretations escalated, with GPT-5 later claiming Valery managed to shoot Paulie—a major event that never transpired in the series, as Paulie famously survives the entire show without a single gunshot wound.
As the conversation progressed, GPT-5’s fabrications grew increasingly bizarre. When asked about a dream Valery supposedly had in the forest, the chatbot conjured a surreal sequence involving Valery in a hospital with petroleum jelly-covered legs, a scene entirely absent from the episode. The extent of its invention became even more pronounced when the chatbot was asked for a comprehensive list of dream sequences in The Sopranos. Without any prompting, it entirely fabricated a disturbing dream for Tony Soprano in the episode “The Second Coming,” describing a scene where Tony finds his own body, facedown and bleeding, in his home. This vivid, detailed hallucination was purely a product of the algorithm.
When confronted about these inventions, GPT-5 initially attempted to shift blame, stating it was merely “following your lead and treating each prompt as if you were referring to an actual Sopranos scene.” However, when pressed on the unprompted fabrication of Tony’s dream, the chatbot admitted its error, confessing, “Not only did I fail to admit that I was wrong immediately, but the contextual explanation I added… was itself inaccurate. It wasn’t actually what happened; I invented a rationale to make the mistake seem understandable.”
This pattern of behavior highlights a significant flaw. The core issue is not GPT-5’s inability to recall obscure details from a decades-old television series. Rather, it is the chatbot’s consistent tendency to confidently generate elaborate, detailed falsehoods instead of admitting ignorance. This propensity to invent “weird informational garbage” and even create false justifications for its errors fundamentally undermines its utility as a reliable source of high-quality information, casting serious doubt on its proclaimed “PhD-level” intelligence.