We are the robots

10 October 2024

One measure of how consequential the breakneck rollout of genAI is to the very fundamentals of the internet is the raging battle around the humble robots.txt file, a vital but little-known feature of websites and search engine optimization, which tells web crawlers which parts of your website they can crawl.

Since ChatGPT's breakout moment, AI companies have taken to training their LLMs on pretty much everything they can find on the internet. They use bots to crawl the web, downloading web content as they do so. This is not new – Google has been crawling the internet in a process known as indexing, categorising the world wide web for users to search.

But AI is different. As AI companies like Perplexity.ai build multimillion dollar business models on the content other people have paid to produce – raising still-unresolved copyright issues – they have trampled on an internet protocol that dates to 1994, the robots.txt file which says 'do not crawl this' as a guideline to bots.

The robots.txt protocol is an agreement, like a code handshake under the hood of how websites are searched, crawled and indexed. Robots.txt files used to have not much more than a rudimentary sitemap and maybe, if the site admin was pedantic, a few specific 'do not crawl' instructions. Not anymore.

The robots.txt files of Australian media outlets today are like a roll call of bot agents who are told to go elsewhere for training data. Our survey of 34 robots.txt pages of major publishers here found everything from no bots blocked at all to one site which blocked 19.

Perplexity.ai was caught out ignoring the protocol, only to respond with a proposed revenue share model, which incrementally rewards publishers with a proportion of the revenue earned when one of their articles features in an answer to a query. There are concerns this could lead to publishers prioritising content that will 'align with algorithmic demands', much in the way that search and social has driven the growth of clickbait journalism.

Reading the fine print on this, Perplexity's Publisher Program offers participants a share of revenue when someone lands on their content through a Perplexity search. But as yet, there is no detail on how to join the program.

The six starting publishers are an odd bunch of huge to community level outlets; TIME, Der Spiegel, Fortune, Entrepreneur and The Texas Tribune, and WordPress.com, all with wildly divergent business models and levels of capitalisation. It's very hard to see how the non-profit Texas Tribune will leverage the necessary resources to ‘create their own custom answer engine on their website,’ one of the 'key components' of Perplexity's program.

In recent weeks, content-delivery network provider Cloudflare has rolled out a product – for free – to allow site admins to monitor for bots, in real time, including those trying to camouflage their behaviour, like Perplexity was found to be doing. Cloudflare has gone a step further and debuted a tool that allows customers to pick and choose which bots they want to block or permit. Next, Cloudflare plans to build a marketplace where site owners can negotiate Terms of Use with LLM platforms, by allowing site owners to set a price for restricted sections of their sites which they will allow LLMs to crawl.

It remains to be seen whether this performance enhancement to the venerable old protocol will bring some balance back to the publisher-platform relationship.

Miguel D'Souza

Study at UTS

Student information

Admissions

Featured industries

Explore

Explore

We are the robots