Ai scraping is an effective DDoS on the entire interent (pod.geraspora.de)

submitted 2 months ago by [email protected] to c/[email protected]

28 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] [email protected] 6 points 2 months ago* (last edited 2 months ago)

In my experience with bots, a portion of them obey robots.txt, but it's tricky to find the user agent string that some bots react to.

So I recommend having a robots.txt that not only target specific bots, but also tell all bots to avoid specific paths/queries.

Example for dokuwiki

User-agent: *
Noindex: /lib/
Disallow: /_export/
Disallow: /user/
Disallow: /*?do=
Disallow: /*&amp;do=
Disallow: /*?rev=
Disallow: /*&amp;rev=

[-] [email protected] 4 points 2 months ago

Would it be possible to detect the gptbot (or similar) of their user agent, and server them different data?

Can they detect that?

[-] [email protected] 10 points 2 months ago* (last edited 2 months ago)

yes, you can match on user agent, and then conditionally serve them other stuff (most webservers are fine with this). nepenthes and iocaine are the current preferred/recommended servers to serve them bot mazes

the thing is that the crawlers will also lie (openai definitely doesn't publish all its own source IPs, I've verified this myself), and will attempt a number of workarounds (like using residential proxies too)

[-] [email protected] 4 points 2 months ago

Can they detect that they're being served different content though?

[-] [email protected] 4 points 2 months ago* (last edited 2 months ago)

Generating plausible-looking gibberish require resources. Giving any kind of response to these bots is a waste of resources, even if it's giberish.

My current approach is to have a robots.txt for bots than honor it. And drop all traffic during 24h for IPs used by bots that ignore robots.txt or misbehave.

this post was submitted on 14 May 2025

93 points (100.0% liked)

TechTakes

2091 readers

290 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago

MODERATORS

[email protected]