Ai scraping is an effective DDoS on the entire interent (pod.geraspora.de)

submitted 4 months ago by [email protected] to c/[email protected]

28 comments fedilink hide all child comments

top 28 comments

sorted by: hot top new old

[-] [email protected] 18 points 4 months ago

It's a constant cat and mouse atm. Every week or so, we get another flood of scraping bots, which force us to triangulate which fucking DC IP range we need to start blocking now. If they ever start using residential proxies, we're fucked.

[-] [email protected] 14 points 4 months ago

I have a tiny neocities website which gets thousands of views a day, there is no way that anyone is viewing it often enough for that to be organic.

[-] [email protected] 14 points 4 months ago

quickly, add some ad revenue :P

[-] [email protected] 11 points 4 months ago

From ai vendors. Let them pay you for scraping you lol

[-] [email protected] 10 points 4 months ago

at least OpenAI and probably others do currently use commercial residential proxying services, though reputedly only if you make it obvious you’re blocking their scrapers, presumably as an attempt on their end to limit operating costs

[-] [email protected] 6 points 4 months ago

Oh never heard of that. I have blocked their scrapers via agents but I haven't felt residential proxy pain.

[-] [email protected] 15 points 4 months ago

@db0 @self Residential Proxy Pain are playing at the Dublin Castle in Camden this Friday, £4 advance, £5 on the door

[-] [email protected] 10 points 4 months ago

here’s a mastodon post and linked blog post with some details on what currently sets it off

[-] [email protected] 8 points 4 months ago

Daym, I should set me up some iocane as well I think

[-] [email protected] 5 points 4 months ago

PS: Looks like that sync issue between our instances is resolved now?

[-] [email protected] 2 points 4 months ago

yep, it seems so! I haven’t put the permanent fix for the nodeinfo bug into place yet but it’ll be live as soon as I’m able to give it an appropriate level of testing.

[-] [email protected] 2 points 4 months ago

Infinite-garbage-maze does seem more appealing than "proof-of-work" (the crypto parentage is yuckish enough ^^) as a countermeasure, though I would understand if some would not feel confortable with direct sabotage—say for example a UN organization.

[-] [email protected] 4 points 4 months ago

They have a botnet on residential devices?

[-] [email protected] 6 points 4 months ago

the term of art is "residential proxy" and there's a ton of them

for example: it's the flipside of Bright's free VPN service - through Bright Data they sell people access proxied via some user's connection

[-] [email protected] 2 points 4 months ago

And companies like honey that pay you (a pittance) to proxy people's requests to porn sites.

[-] [email protected] 11 points 4 months ago* (last edited 4 months ago)

jwz gave the game away, so i'll reveal:

the One Weird Trick for this week is that the bots pretend to be an old version of Chrome. So you can block on useragent

so I blocked old Chrome from hitting the expensive mediawiki call on rationalwiki and took our load average from 35 (unusable) to 0.8 (schweeet)

caution! this also blocks the archive sites, which pretend to be old chrome. I refined it to only block the expensive query on mediawiki, vary as appropriate.

nginx code:

        # block some bot UAs for complex requests
        # nginx doesn't do nested if, so we set a test variable
        # if $BOT is both Complex and Old, block as bot
        set $BOT "";
        if ($uri ~* (/w/index.php)) {
            set $BOT "C"; }

            if ($http_user_agent ~* (Chrome/[2-9])) {
                set $BOT "${BOT}O";}
            if ($http_user_agent ~* (Chrome/1[012])) {
                set $BOT "${BOT}O";}
            if ($http_user_agent ~* (Firefox/3)) {
                set $BOT "${BOT}O";}
            if ($http_user_agent ~* (MSIE)) {
                set $BOT "${BOT}O";}

            if ($BOT = "CO") {
                return 503;}

you always return "503" not "403", because 403 says "fuck off" but the scrapers are used to seeing 503 from servers they've flattened.

I give this trick at least another week.

[-] [email protected] 8 points 4 months ago

Count them as ad visits, to make big tech pay for better hardware or line?

[-] [email protected] 8 points 4 months ago

That opens you up to getting accused of click fraud, as AdNauseam found out the hard way but its worth it if you can squeeze some cash out of them before that happens.

[-] [email protected] 8 points 4 months ago

I mean, scraping bots would obviously obey robots.txt so those scraping - bots, i mean users can't be bots

[-] [email protected] 2 points 4 months ago

Right? We have standards for this and the reasonable assumption is that if it doesn't respect robots.txt and otherwise looks like a user then it's a user. It can't be the responsibility of every single server admin to perfectly recognize what's a user and what's a bot run by a billion-dollar company doing a decent job pretending to be a user.

[-] [email protected] 6 points 4 months ago* (last edited 4 months ago)

In my experience with bots, a portion of them obey robots.txt, but it's tricky to find the user agent string that some bots react to.

So I recommend having a robots.txt that not only target specific bots, but also tell all bots to avoid specific paths/queries.

Example for dokuwiki

User-agent: *
Noindex: /lib/
Disallow: /_export/
Disallow: /user/
Disallow: /*?do=
Disallow: /*&amp;do=
Disallow: /*?rev=
Disallow: /*&amp;rev=

[-] [email protected] 4 points 4 months ago

Would it be possible to detect the gptbot (or similar) of their user agent, and server them different data?

Can they detect that?

[-] [email protected] 10 points 4 months ago* (last edited 4 months ago)

yes, you can match on user agent, and then conditionally serve them other stuff (most webservers are fine with this). nepenthes and iocaine are the current preferred/recommended servers to serve them bot mazes

the thing is that the crawlers will also lie (openai definitely doesn't publish all its own source IPs, I've verified this myself), and will attempt a number of workarounds (like using residential proxies too)

[-] [email protected] 4 points 4 months ago* (last edited 4 months ago)

Generating plausible-looking gibberish require resources. Giving any kind of response to these bots is a waste of resources, even if it's giberish.

My current approach is to have a robots.txt for bots than honor it. And drop all traffic during 24h for IPs used by bots that ignore robots.txt or misbehave.

[-] [email protected] 4 points 4 months ago

Can they detect that they're being served different content though?

[-] [email protected] 3 points 4 months ago

Re the blocking of fake useragents, what people could try is see if there are things older useagents do (or do wrong) which these do not. I heard of some companies doing that. (Long ago I also heard of somebody using that to catch mmo bots in a specific game. There was a packet that if the server send it to a legit client, the client crashed, a bot did not). I'd assume the specifics are treated as secret just because you don't want the scrapers to find out.

[-] [email protected] 2 points 4 months ago

You could probably do something by getting into the weeds of browser updates, at least for web traffic. Like, if they're showing themselves as an older version of chrome send a badly formatted cookie to crash it? Redirect to /%%30%30?

[-] [email protected] 1 points 4 months ago

Yes, there I heard there is some javascript that various older versions of chrome/firefox don't properly execute for example. So you can use that to determine which version they are (as long as nobody shares that javascript with the public. So this might even not be javascript, I honestly know nothing about it just heard it).

this post was submitted on 14 May 2025

93 points (100.0% liked)

TechTakes

2166 readers

87 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago

MODERATORS

[email protected]