AI Fails History Test: Why Historians Are Safe From Robot Takeover
Recent discussions around artificial intelligence often suggest that AI will soon augment, if not entirely replace, human jobs. A recent Microsoft study, for instance, provocatively ranked historians second among professions most likely to be enhanced by AI. This projection understandably raised concerns within the historical community. However, after extensive personal testing of leading generative AI tools with specific historical facts, it becomes clear that historians need not fear immediate obsolescence. At present, AI is far from capable of performing their complex work effectively.
My fascination with the movies U.S. presidents have watched while in office became the ideal test bed for these AI systems. Since 2012, I’ve meticulously researched this niche, from Teddy Roosevelt’s 1908 bird documentary screening to more recent administrations. My journey began with discovering Ronald Reagan’s White House movie list, which led to a Freedom of Information Act (FOIA) request for Barack Obama’s viewing habits—a request that revealed presidential records are exempt from FOIA until five years after a president leaves office. Undeterred, I’ve since delved into a vast array of sources, compiling a detailed, if unusual, historical database. Testing AI with information I know intimately allowed me to assess their accuracy, a crucial step often overlooked by users who typically query these tools about subjects they don’t know. The results were quite revealing for anyone who relies on AI chatbots for precise information.
My initial attempts involved OpenAI’s flagship models, including what was presented as GPT-5, asking about specific movies watched by presidents like Woodrow Wilson, Dwight Eisenhower, Richard Nixon, and the two George Bushes on particular dates. OpenAI’s responses were consistently unhelpful, often stating no record could be found, or, in some cases, fabricating information. While thankfully not outright fabricating, the models failed to answer even relatively straightforward questions. This lack of transparency regarding which model was operating behind the scenes, coupled with a general inability to provide accurate historical details, highlighted a significant weakness, despite CEO Sam Altman’s earlier promises of “PhD-level expert” capabilities.
The shortcomings weren’t limited to OpenAI. Other major AI chatbots, including Google Gemini, Microsoft Copilot, Perplexity, and xAI’s Grok, also demonstrated considerable inaccuracies. For instance, when asked what movie President Eisenhower watched on August 11, 1954, Copilot’s “Quick Response” incorrectly suggested The Unconquered, a documentary in which Eisenhower briefly appears. Switching to Copilot’s “Deep Research” mode yielded a sprawling, 3,500-word report speculating that Eisenhower “probably” watched Suddenly, a film not released until months after the queried date. Copilot’s “analysis” cited “circumstantial and secondary evidence,” a phrase that, in this context, amounted to pure conjecture, given that the correct answer—River of No Return, confirmed by a White House projectionist’s logbook—was entirely missed. Gemini offered no answer, while Perplexity also incorrectly guessed Suddenly, seemingly misled by a fun fact about the film’s inspiration.
Similar patterns of error emerged with other presidential inquiries. When asked about Richard Nixon’s viewing habits on February 12, 1971, Copilot’s “Quick Response” claimed he watched Patton at Key Biscayne, citing a National Archives link that, upon inspection, contained no such information. While Copilot’s “Deep Research” eventually correctly identified The Great Chase, it simultaneously introduced new, false claims about Nixon watching Patton on other dates. Perplexity incorrectly suggested The Good, the Bad and the Ugly, confusing the date with a viewing from a year later.
The challenges intensified with more obscure facts. For instance, Woodrow Wilson watched The Crisis on March 6, 1917, a silent film I personally sourced and uploaded online because it lacked public availability. Most AI models either drew a blank or incorrectly suggested The Birth of a Nation, Wilson’s most famous, but far earlier, White House screening. ChatGPT even falsely claimed The Birth of a Nation was the first movie ever screened at the White House, ignoring earlier viewings by Taft and Teddy Roosevelt.
Even when an AI managed to provide the correct answer, its reasoning or sourcing often raised red flags. xAI’s Grok, for example, eventually correctly identified Eisenhower’s River of No Return after being prompted to “think harder,” but its source was my own obscure Twitter account, lacking direct citation. This highlights Grok’s reliance on readily available, often unverified, internet data. Similarly, when Grok correctly identified George W. Bush’s viewing of the short documentary Twin Towers on September 10, 2003, it cited FOIA documents I had previously requested, effectively synthesizing my own prior research rather than conducting new inquiry.
These tests, while not scientific in the academic sense, were designed to assess AI’s practical utility for precise historical research. They reveal that while AI companies boast about improved reasoning and reduced “hallucinations,” the real-world performance for specific, nuanced information remains deeply flawed. Generative AI tools are marketed as all-purpose knowledge engines, a “tall order” that they consistently fail to meet when confronted with information that isn’t widely digitized or easily synthesized from common internet sources.
A historian’s role extends far beyond merely compiling published facts. True historical research involves unearthing hard-to-find documents in archives, conducting interviews with primary witnesses or experts, critically evaluating conflicting sources, and ultimately, contributing new interpretations and understanding to the past. My tests, focused solely on specific dates and film titles, represent only a minute fraction of what a historian does.
While AI tools undoubtedly prove useful for countless tasks, it is crucial to temper the widespread belief in their omnipotence. Periodically challenging these “god-like” tools with facts one knows intimately serves as a vital reminder of their limitations. Over-reliance on AI without critical human oversight risks not only promoting ignorance but also undermining the very pursuit of accurate knowledge.