Multiple AI Companies Bypassing Web Standard to Scrape Publisher Sites, Licensing Firm Says

0

Multiple artificial intelligence companies are circumventing a widely accepted web standard used by publishers to prevent the unauthorized scraping of their content, according to content licensing startup TollBit. This move is part of a broader conflict between tech firms and media companies over the use of content in the era of generative AI.



Publishers have expressed increasing concern about AI-generated news summaries, particularly following Google's launch of a product last year that uses AI to create summaries in response to certain search queries. This issue has gained further attention as TollBit, a content licensing startup, alerted publishers about AI companies bypassing the Robots Exclusion Protocol (robots.txt), which is designed to control web crawling access.


In a letter to publishers, seen by Reuters on Friday, TollBit did not name the specific AI companies or affected publishers but highlighted a public dispute between AI search startup Perplexity and media outlet Forbes. Forbes has accused Perplexity of plagiarizing its investigative stories in AI-generated summaries without proper attribution or permission.


An investigation by Wired revealed that Perplexity likely bypassed efforts to block its web crawler through robots.txt. This protocol, established in the mid-1990s, is meant to prevent websites from being overloaded by web crawlers and has historically been respected, although it lacks a clear legal enforcement mechanism.


Despite declining a request for comment from Reuters, Perplexity's actions have sparked significant concern among publishers. The News Media Alliance, representing over 2,200 U.S.-based publishers, voiced worries about the potential impact of ignoring "do not crawl" signals. "Without the ability to opt out of massive scraping, we cannot monetize our valuable content and pay journalists. This could seriously harm our industry," said Danielle Coffey, president of the group.


TollBit aims to bridge the gap between AI companies and publishers by facilitating licensing deals. The startup tracks AI traffic to publishers' websites and uses analytics to help negotiate fees for different types of content, allowing publishers to set higher rates for premium content, such as breaking news or exclusive insights. As of May, TollBit had 50 websites using its services, though it has not disclosed their names.


According to TollBit's letter, Perplexity is not the sole offender; numerous AI agents are bypassing the robots.txt protocol to retrieve content from publisher sites. "The more publisher logs we ingest, the more this pattern emerges," TollBit wrote. This growing trend highlights the need for publishers to find ways to protect their content and ensure fair compensation.


In response to these developments, some publishers, including The New York Times, have sued AI companies for copyright infringement. Others have entered into licensing agreements with AI firms willing to pay for content, though disagreements over the value of the material persist. Many AI developers maintain that their access to content is lawful and does not require compensation.


Thomson Reuters, the owner of Reuters News, is among the publishers that have struck deals to license news content for AI models. This approach reflects a pragmatic solution to the complex issue of content usage in generative AI systems.


As AI technology continues to evolve, the balance between leveraging technological advancements and protecting intellectual property remains delicate. The ongoing dialogue between AI companies and publishers will be crucial in shaping the future landscape of digital content and ensuring that creators receive fair compensation for their work.

Post a Comment

0Comments
Post a Comment (0)