Reddit Updates Robots.txt to Block AI Crawlers

June 26, 2024 at 5:42:29 AM

TL;DR Reddit is updating its Robots Exclusion Protocol to prevent AI crawlers from scraping its content for training models without permission. The update includes rate-limiting and blocking unknown bots that don't comply with Reddit's Public Content Policy. The changes aim to protect content while allowing access for good faith actors like researchers. Reddit's new policy signals that companies must pay to use its data for AI training.

Reddit Updates Robots.txt to Block AI Crawlers

Reddit is updating its Robots Exclusion Protocol (robots.txt file) to prevent AI crawlers from scraping its content without permission. Historically, the robots.txt file allowed search engines to index sites, but with the rise of AI, content is being used to train models without proper acknowledgment.

Key Measures

Updated Robots.txt File: Reddit is revising this file to control automated web bots.
Rate-Limiting and Blocking: Bots and crawlers that do not comply with Reddit’s Public Content Policy or lack an agreement with Reddit will be restricted or blocked.
Exemptions: The update won’t affect most users or good faith actors like researchers and the Internet Archive. It targets AI companies using Reddit content for training models.

Context and Implications

AI Scraping Issues: The update follows a Wired investigation revealing that AI startup Perplexity ignored requests not to scrape content, despite being blocked in the robots.txt file.
Legal and Financial Aspects: Perplexity’s CEO argued that robots.txt is not a legal framework. Reddit’s changes indicate that companies must pay to use its data for AI training, exemplified by Reddit's $60 million deal with Google.

Policy and Future Directions

Selective Partnerships: Reddit will be selective about who can access its content on a large scale.
Recent Policy Updates: This move aligns with Reddit’s recent policy to regulate how its data is accessed and used by commercial entities.

Reddit emphasizes that anyone accessing its content must adhere to its policies, aiming to protect users and ensure fair use of its data.