<
div class=”field field–name-body field–type-text-with-summary field–label-hidden”>
<
div class=”field__items”>
<
div class=”field__item even”>
If you run a site on the open web, chances are you’ve noticed a big increase in traffic over the past few months, whether or not your site has been getting more viewers, and you’re not alone. Operators everywhere have observed a drastic increase in automated traffic—bots—and in most cases attribute much or all of this new traffic to AI companies.
Background
AI—in particular, Large Language Models (LLMs) and generative AI (genAI)—rely on compiling as much information from relevant sources (i.e., “texts written in English” or “photographs”) as possible in order to build a functional and persuasive model that users will later interact with. While AI companies in part distinguish themselves by what data their models are trained on, possibly the greatest source of information—one freely available to all of us—is the open web.
To gather up all that data, companies and researchers use automated programs called scrapers (sometimes referred to by the more general term “bots”) to “crawl” over the links available between various webpages and save the types of information they’re tasked with as they go. Scrapers are tools with a long, and often beneficial, history: services like search engines, the Internet Archive, and all kinds of scientific research rely on them.
When scrapers are not deployed thoughtfully, however, they can contribute to higher hosting costs, lower performance, and even site outages, particularly when site operators see so many of them in operation at the same time. In the long run all this may lead to some sites shutting down rather than bearing the brunt of it.
For-profit AI companies must ensure they do not poison the well of the open web they rely on in a short-sighted rush for training data.
Bots: Read the Room
There are existing best practices those who use scrapers should follow. When bots and their operators ignore these guideposts it sends a signal to site operators, sometimes explicitly, that they can or should cut off their access, impede
[…]
Content was cut in order to protect the source.Please visit the source for the rest of the article.
Read the original article: