The internet has long been a habitat for bots. For decades, search engines like Google have employed automated “crawlers” to build indexes, enabling websites to appear in search results. However, a new generation of crawlers designed to train generative AI is now threatening the web’s economic fabric and raising privacy concerns. While these technologies undermine numerous websites’ business models, there are evolving methods to prevent them from absorbing your material—at least to some extent.
The Race to Develop Effective Blocking Tools
One innovative approach to countering AI crawlers is to “poison” the data, making it difficult for AIs to learn. Researchers have developed blocking tools such as image filters that introduce “noise” to confuse AIs, while remaining imperceptible to human viewers. Salil Kanhere, a computer scientist at the University of New South Wales, cautions that AI developers are continually seeking ways to bypass these tools.
An Australian research team is striving to enhance these blocking mechanisms. Their early-stage research aims to create “provably unlearnable” content. Derek Wang, a computer scientist at CSIRO and collaborator on the project, explains that most existing tools are highly specific, designed to prevent a particular type of AI from training on specific content. His team has developed an algorithm that assesses how learnable any content is for any AI type.
“This itself is very significant information that can help defenders to polish and update their defenses,” Dr. Wang says.
The algorithm aids in constructing more robust blocking tools by obfuscating the most learnable parts of the content. The team demonstrated their algorithm by creating a noise-generating tool for images, which Dr. Wang claims can make images impenetrable to most AIs.
“Basically, our guarantee can rule out about 90 percent of attacks,” Dr. Wang states.
The team showcased their work at a conference earlier this year, and various online image creators have expressed interest in using the algorithm to protect their work. While currently focused on images, Dr. Wang notes that their base algorithm could be adapted to develop other types of blockers. However, some content, like text, may be more challenging to protect due to its reliance on fewer characters.
Switching Off AI Access to Websites
Another straightforward method to prevent crawlers from accessing content is to request them not to. Websites typically contain a robots.txt file that instructs crawlers on which pages they can access. Professor Kanhere notes that while crawlers are supposed to adhere to these instructions, compliance is not guaranteed.
Developers are working on new standards, such as the RSL Standard, which allows website owners to specify what content an AI bot can scrape and under what licensing agreements. Some providers, like Cloudflare, have begun implementing AI blockers at scale. In July, Cloudflare announced that their customers’ sites would block AI crawlers by default, offering site owners control over how their content is utilized.
“We see a real recognition of differentiated access to real-time information, data, and content as being an incredible competitive advantage for people who are building AI experiences,” says Will Allen, vice president of product at Cloudflare.
Are AI Companies Respecting These Rules?
The effectiveness of these measures depends on AI crawlers acting in good faith. Historically, AI models have accessed content without permission, using vast amounts of copyrighted works to train early models. However, Allen believes that larger models are beginning to play fair, with some AI companies contacting Cloudflare to access blocked pages.
“If they were being shady, they wouldn’t care. They would just do it,” Allen notes.
Nonetheless, crawlers can disguise themselves as human users to bypass restrictions, risking being flagged as malicious bots and blocked entirely. As AI summaries lead to declining page views, and paywalls emerge to compensate for lost ad revenue, many fear for the web’s evolving landscape.
“The internet’s an amazing, amazing invention and one of the most amazing parts of it is the fact that large parts have been open,” Allen reflects.
While Cloudflare’s model represents a promising start, Professor Kanhere acknowledges its flaws. Increasing bot defenses can inadvertently block human users, and AI companies may still scrape less accurate versions of content from other sources. If AI companies deem the material unworthy of payment, the technological cat-and-mouse game may continue. However, Kanhere speculates that AI companies might eventually agree on conditions, akin to OpenAI’s licensing deals with news publishers.
As web pages face rising numbers of bot visitors, human views are declining. Kanhere suggests that widespread adoption of licensing deals could prevent the need to rethink the entire internet, though he acknowledges the decline in traditional website visits.