Reddit Blocks Internet Archive Access Over AI Data Scraping

Theverge

Reddit has announced a significant restriction on the Internet Archive’s ability to index its platform, a move prompted by concerns that artificial intelligence companies are illicitly scraping Reddit data through the Wayback Machine. Effective immediately, the popular social media platform will largely block the Internet Archive from crawling post detail pages, user comments, and individual profiles. The only content that will remain accessible for archiving is the Reddit.com homepage, meaning the Internet Archive will primarily be limited to documenting which news headlines and posts gained prominence on any given day.

According to Tim Rathschmidt, a spokesperson for Reddit, the decision stems from observed instances where AI companies have violated platform policies, including Reddit’s own, by extracting data from the Wayback Machine. While acknowledging the Internet Archive’s vital role in preserving the open web, Reddit contends that not all of its content should be archived in a manner that facilitates such misuse. Rathschmidt stated that until the Internet Archive can adequately defend its site and ensure compliance with platform policies—specifically regarding user privacy and the proper handling of deleted content—Reddit will restrict access to its data to safeguard its users.

The implementation of these new limits began on August 11, 2025, with Reddit confirming it had informed the Internet Archive in advance of the changes. This latest restriction marks another chapter in Reddit’s ongoing efforts to control access to its vast trove of user-generated content, particularly as AI companies have intensified their data collection efforts. The platform has a documented history of curtailing access for automated data extraction tools, often signaling a willingness to provide such data only under commercial agreements.

Indeed, Reddit has been actively strategizing around its data’s value in the burgeoning AI landscape. Early last year, the company forged a notable deal with Google, granting the tech giant access to Reddit’s content for both Google Search and AI model training. A few months later, Reddit began blocking major search engines from crawling its data unless they entered into similar payment arrangements. The company also attributed its controversial 2023 API changes, which led to widespread protests and the shutdown of several popular third-party apps, to the abuse of those APIs for training AI models.

The monetization of its data for AI purposes continues to be a central theme for Reddit. Beyond its partnership with Google, the company also struck an AI deal with OpenAI. However, its stance against unauthorized data use remains firm, evidenced by a lawsuit filed in June against Anthropic. Reddit alleges that Anthropic continued to extract data from its platform despite previous assurances to cease such activities.

The Internet Archive, whose mission is to maintain a digital record of websites and other cultural artifacts, did not immediately comment on Reddit’s new restrictions. This development highlights the growing tension between the open web’s principles of preservation and the commercial imperative of platforms seeking to control and monetize their data in the age of generative AI.