New standards are being developed to extend the Robots Exclusion Protocol and Meta Robots tags, allowing them to block all AI crawlers from using publicly available web content for training purposes. The proposal, drafted by Krishna Madhavan, Principal Product Manager at Microsoft AI, and Fabrice Canel, Principal Product Manager at Microsoft Bing, will make it easy to block all mainstream AI Training crawlers with one simple rule.
Virtually all legitimate crawlers obey the Robots.txt and Meta Robots tags which makes this proposal a dream come true for publishers who don’t want their content used for AI training purposes.
Internet Engineering Task Force (IETF)
The Internet Engineering Task Force (IETF) is an international Internet standards making group founded in 1986 that coordinates the development and codification of standards that everyone can voluntarily agree one. For example, the Robots Exclusion Protocol was independently created in 1994 and in 2019 Google proposed that the IETF adopt it as an official standards with agreed upon definitions. In 2022 the IETF published an official Robots Exclusion Protocol that defines what it is and extends the original protocol.
Robots.Txt For Blocking AI Robots
The draft proposal seeks to create additional rules that will extend the Robots Exclusion Protocol (Robots.txt) to extend to AI Training Robots. This will bring about some order and give publishers choice in what robots are allowed to crawl their websites.
Adherance to the Robots.txt protocol is voluntary but all legitimate crawlers tend to obey it.
The draft explains the purpose of the new Robots.txt rules:
“While the Robots Exclusion Protocol enables service owners to control how, if at all, automated clients known as crawlers may access the URIs on their services as defined by [RFC8288], the protocol doesn’t provide controls on how the data returned by their service may be used in training generative AI foundation ******.
Application developers are requested to honor these tags. The tags are not a form of access authorization however.”
An important quality of the new robots.txt rules and the meta robots HTML elements is that they don’t require naming specific crawlers. One rule covers all bots that are crawling for AI training data and that voluntarily agree to follow these protocols, which is something that all legitimate bots do. This will simplify bot blocking for publishers.
The following are the proposed Robots.txt rules:
- DisallowAITraining – instructs the parser to not use the data for AI training language model.
- AllowAITraining -instructs the parser that the data can be used for AI training language model.
The following are the proposed meta robots directives:
- <meta name=”robots” content=”DisallowAITraining”>
- <meta name=”examplebot” content=”AllowAITraining”>
Provides Greater Control
AI companies have been unsuccessfully sued in court for using publicly available data. AI companies have asserted that it’s fair use to crawl publicly available websites, just as search engines have done for decades.
These new protocols give web publishers control over crawlers whose purpose is for consuming training data, bringing those crawlers into alignment with search crawlers.
Read the proposal at the IETF:
Robots Exclusion Protocol Extension to manage AI content use
Featured Image by Shutterstock/ViDI Studio
Source link : Searchenginejournal.com