At DrupalCon Chicago, I was privileged to hear a talk from Matt West and John Brandenburg on responding to the new challenge of AI crawler traffic bringing sites to their knees.
Traffic bursts are no new problem in and of themselves, and bot traffic in particular has always been a concern. But generally speaking, the web crawlers used by major search engines have been well-behaved, crawling sites infrequently and at reasonable intervals.
AI crawlers are much worse citizens in this regard. Not only are there many more players in the field than there ever have been before, their crawl behavior is much more aggressive; the goal is to suck up as much data for the LLM as possible, as quickly as possible. This incentivizes them to ignore tools like robots.txt entirely, and to re-crawl sites frequently. The result is higher resource use in the best case, and degraded performance to site outages in the worst.
Analyzing the situation
If an outage occurs, we often have the question: Is this a deliberate attack, or is it an accidental overload caused by a crawler? But the answer doesn't matter. Regardless of the intent, your site is down, so we need to deal with the traffic somehow.
AI crawler traffic is bad, and I want to block it.
If the content of your site is your product, and you are either gating access or serving ads, you need people coming to your site, not an AI chat. This may be a losing battle in the long run, but for now we can try to block bots from getting to the site at all. But what if...
AI crawler traffic is good, actually
Marketing sites and nonprofit information sources often want their message to get out into the world, one way or another. Getting AI sources to ingest your data is great! We just need to do that in a way that doesn't bring the site down.
What we can do
Sadly, there isn't an easy set of steps or guidelines to follow to make this problem go away. There are ways to mitigate the issue now, but this is likely to continue to evolve over time so we will need to be diligent.
Eliminate bot traps
This is the most generally useful advice we can act on. Whether you in general want to discourage all bot traffic or to encourage good traffic, there are negative patterns to avoid.
A crawler wants to see every page of the site (good, probably!) but this includes pages no human would ever request. Why would you have a page no human would go to? Well, consider a faceted search: every combination of facets results in a unique URL, and the crawler will want to see all of them. Some basic combinatorics will tell you this is a Very Bad Thing™: just three facets with ten options each yields a billion unique pages, all of which will be cache misses. Yuck.
For the specific Drupal example of Facets, we can move to version 3 of the module, which has a new architecture to get around the problem. But be on the lookout for combinatorially-explosive effects like this one!
Block bots
This is much harder, as the bots won't have just one IP, or one user agent, or anything else easily identified. They'll use proxies; they'll spoof user agents, and they'll employ botnets of compromised devices.
There are modules and techniques we can employ within Drupal to attempt blocking, but a much better option is to put the site behind a web application firewall (WAF) from a service like CloudFlare. These are adaptive, have dedicated teams working to stay ahead in the arms race, and are the best tool we currently have to stay afloat.
Good luck out there!