Reddit blocks Internet Archive to halt AI data scraping

Arstechnica

Reddit has moved to block the Internet Archive (IA) from comprehensively indexing its content, citing concerns that artificial intelligence firms, already restricted from directly scraping Reddit, have instead been harvesting data from IA’s archived material. This significant change means that the Internet Archive’s Wayback Machine, which previously offered a dependable record of Reddit pages, user profiles, and comments as part of its broad mission to preserve the internet, will now only archive screenshots of the Reddit homepage. This drastic reduction in scope effectively limits the archive’s utility to a daily snapshot of popular posts and news headlines, no longer serving as a detailed backup for deleted content, a window into diverse Reddit subcultures, or a record of individual user activity.

While Reddit has not publicly identified the specific AI firms it believes were scraping data from the Wayback Machine, company spokesperson Tim Rathschmidt confirmed that Reddit has become “aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine.” Rathschmidt suggested that the Internet Archive could implement measures to better safeguard against such AI data harvesting, potentially leading Reddit to reconsider its restrictions. These limitations on IA’s access to Reddit data are reportedly being ramped up across the platform.

Beyond the immediate concern of AI scraping, Reddit is also leveraging this opportunity to address what it describes as long-standing privacy issues. The company argues that the restrictions are justified because the Wayback Machine problematically archives content that users have subsequently deleted. Rathschmidt stated, “Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors.”

Historically, some Redditors have utilized the Wayback Machine to research deleted comments or threads. However, discussions on social media indicate that numerous other tools exist for surfacing deleted posts or investigating user activity, with some suggesting that the Wayback Machine was not always the most intuitive platform for these purposes. Redditors have also turned to resources like the Internet Archive during periods of significant platform changes that could lead to content removal. Notably, in 2023, when alterations to Reddit’s public API threatened to dismantle beloved subreddits, archives played a crucial role in preserving content before it was lost.

The Internet Archive has not yet indicated whether it is actively pursuing solutions to have Reddit’s restrictions lifted. Mark Graham, director of the Wayback Machine, noted that IA has “a longstanding relationship with Reddit” and remains engaged in “ongoing discussions about this matter.”

It appears highly probable that Reddit’s actions are driven by financial motivations, aiming to prevent AI firms from exploiting its content via third-party archives and instead encourage more lucrative direct licensing agreements. Reddit has recently struck significant deals with major players like OpenAI and Google. While the terms of the OpenAI agreement remain undisclosed, the Google deal was reportedly valued at $60 million. Overall, Reddit anticipates generating more than $200 million from such licensing deals over the next three years, underscoring the high stakes involved in controlling access to its vast trove of user-generated data.