Reddit is blocking the Internet Archive’s Wayback Machine from indexing most of its content after discovering that AI companies scraped Reddit data from the archive. The Wayback Machine will now only be able to archive Reddit’s homepage, preventing access to post detail pages, comments, and profiles. This restriction aims to protect user privacy and comply with platform policies, especially regarding deleted content.
Background and Reasoning
Reddit’s spokesperson Tim Rathschmidt explained that AI companies violated platform policies by scraping data from the Wayback Machine, prompting Reddit to limit the archive’s access. Reddit contacted the Internet Archive beforehand to inform them of these upcoming restrictions. The Internet Archive’s mission is to preserve digital content, but Reddit believes some of its data should not be archived in this way until better protections are in place.
Reddit’s Approach to Data Access and AI
Reddit has a history of restricting access to its data to prevent abuse by AI companies. It has made deals with Google and OpenAI to provide data legally but blocks major search engines from crawling its data without payment. The company’s 2023 API changes, which led to third-party app shutdowns and protests, were also motivated by concerns over AI training misuse. Additionally, Reddit sued Anthropic for continuing to scrape data despite assurances to stop.
Internet Archive’s Response
Mark Graham, director of the Wayback Machine, stated that the Internet Archive maintains a longstanding relationship with Reddit and continues discussions regarding these issues.