Why Sha256? Literally every processor has a crypto accelerator and will easily pass. And datacenter servers have beefy server CPUs. This is only effective against no-JS scrapers.
Selfhosted
A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.
Rules:
-
Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
-
No spam posting.
-
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
-
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
-
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
-
No trolling.
Resources:
- selfh.st Newsletter and index of selfhosted software and apps
- awesome-selfhosted software
- awesome-sysadmin resources
- Self-Hosted Podcast from Jupiter Broadcasting
Any issues on the community? Report it using the report flag.
Questions? DM the mods!
It requires a bunch of browser features that non-user browsers don't have, and the proof-of-work part is like the least relevant piece in this that only gets invoked once a week or so to generate a unique cookie.
I sometimes have the feeling that as soon as some crypto-currency related features are mentioned people shut off part of their brain. Either because they hate crypto-currencies or because crypto-currency scammers have trained them to only look at some technical implementation details and fail to see the larger picture that they are being scammed.
So if you try to access a website using this technology via terminal, what happens? The connection fails?
If your browser doesn't have a Mozilla user agent (I.e. like chrome or Firefox) it will pass directly. Most AI crawlers use these user agents to pretend to be human users
It's a clever solution but I did see one recently that IMO was more elegant for noscript users. I can't remember the name but it would create a dummy link that human users won't touch, but webcrawlers will naturally navigate into, but then generates an infinitely deep tree of super basic HTML to force bots into endlessly trawling a cheap-to-serve portion of your webserver instead of something heavier. Might have even integrated with fail2ban to pick out obvious bots and keep them off your network for good.
That's a tarpit that you're describing, like iocaine or nepthasis. Those are to feed the crawler junk data to try and make their eventual output bad.
Anubis tries to not let the AI crawlers in at all.
If you remember the project I would be interested to see it!
But I've seen some AI poisoning sink holes before too, a novel concept as well. I have not heard of real world experiences of them yet.
I'm assuming they're thinking about this
A pseudonymous coder has created and released an open source “tar pit” to indefinitely trap AI training web crawlers in an infinitely, randomly-generating series of pages to waste their time and computing power. The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed “offensively” as a honeypot trap to waste AI companies’ resources.
Which was posted here a while back
generates an infinitely deep tree
Wouldn't the bot simply limit the depth of it's seek?
That would be reasonable. The people running these things aren't reasonable. They ignore every established mechanism to communicate a lack of consent to their activity because they don't respect others' agency and want everything.
It could be infinitely wide too if they desired. It shouldn't be that hard to do I wouldn't think. I would suspect they limit the time a chain can use though to eventually escape out, though this still protects data because it obfuscates legitimate data that it wants. The goal isn't to trap them forever. It's to keep them from getting anything useful.
I think the maze approach is better, this seems like it hurts valid users if the web more than a company would be.
For those not aware, nepenthes is an example for the above mentioned approach !
This looks like it can can actually fuck up some models, but the unnecessary CPU load it will generate means most websites won't use it unfortunately
Found the FF14 fan lol
The release names are hilarious
What's the ffxiv reference here?
Anubis is from Egyptian mythology.
The names of release versions are famous FFXIV Garleans
I did not find any instruction on the source page on how to actually deploy this. That would be a nice touch imho.
The docker image page has it
There are some detailed instructions on the docs site, tho I agree it'd be nice to have in the readme, too.
Sounds like the dev was not expecting this much interest for the project out of nowhere so there will def be gaps.
Or even a quick link to the relevant portion of the docs at least would be cool
Meaning it wastes time and power such that it gets expensive on a large scale? Or does it mine crypto?
Yes, Anubis uses proof of work, like some cryptocurrencies do as well, to slow down/mitigate mass scale crawling by making them do expensive computation.
https://lemmy.world/post/27101209 has a great article attached to it about this.
--
Edit: Just to be clear, this doesn't mine any cryptos, just uses same idea for slowing down the requests.