Post AwL3iqqHyOwfZHeXWC by SuperDicq@minidisc.tokyo
(DIR) More posts by SuperDicq@minidisc.tokyo
(DIR) Post #AwKiWCPoM98dfoOU6K by SuperDicq@minidisc.tokyo
2025-07-20T18:37:35.734Z
0 likes, 0 repeats
@Stellar@mk.absturztau.be That's a false dilemma. There's many other things you can do.
(DIR) Post #AwKjSQM4gd9FhcPTTE by konstruct@woof.tech
2025-07-20T18:46:22Z
0 likes, 0 repeats
@SuperDicq @Stellar we are in a day and age where the only way to matter to tech giants is to hurt them directly. Boycotting doesn't work, it's been proven. Most people don't care about it anymore, some are even annoyed at people who denounce the catastrophic consequences of AI over and over. So what else can we do? What other options do we have except preventing its access to us and bombing it?
(DIR) Post #AwKjSRUcSLZrEPPoQa by SuperDicq@minidisc.tokyo
2025-07-20T18:48:07.326Z
0 likes, 0 repeats
@konstruct@woof.tech @Stellar@mk.absturztau.be You don't need Anubis. Preventing scraping is an old problem, usually solved using blocklists and rate limiting and many other techniques. Please only use Anubis only as a last resort.
(DIR) Post #AwKl8oy7fD8G9nzFJY by konstruct@woof.tech
2025-07-20T18:50:40Z
0 likes, 0 repeats
@SuperDicq @Stellar AI scrapers are LITERALLY BYPASSING block lists and robots.txt. Rate limiting only slows them down and will hurt regular users' experience. The "many other techniques" are either unheard of or unreasonable to use. Anubis just works and works well. Are for or against AI?
(DIR) Post #AwKl8qGErLDYAHTEBs by SuperDicq@minidisc.tokyo
2025-07-20T19:07:00.710Z
0 likes, 0 repeats
@konstruct@woof.tech @Stellar@mk.absturztau.be I am not against AI as a concept as I think machine learning programs can be useful for many tasks. I don't want to completely discredit an entire section of computer science research. I have even written some of my own machine learning programs.But it is true that many companies at the moment are not behaving ethically in many aspects, like making their machine learning models proprietary and only usable behind a proprietary SaaSS API and lying to public about what their programs are capable of. I obviously do not support companies such as OpenAI, Twitter, Google or Facebook.Of course the scraping they do is terrible too. It is essentially a DDOS attack and the companies doing this should be taking to court as conducting DDOS attacks is literally illegal in most parts of the world.However I also think Anubis as a countermeasure is quite unacceptable for regular users just trying to read a website. It requires me to enable javascript and it also requires me to disable the JShelter browser addon (which is supposed to prevent malware from running on your machine). So what Anubis is doing is equivalent to malware, even though that is not the intend.
(DIR) Post #AwL37X6aOP2jdjTsem by konstruct@woof.tech
2025-07-20T19:50:27Z
0 likes, 0 repeats
@SuperDicq @Stellar first of all let me apologise for being so aggressive in the past post. I understand your stance on AI, as a concept, it's not evil. What you say about JavaScript is a valid concern, and you should see if it's possible to run Anubis with JShelter, ask on their repo. But requiring it to be disabled doesn't make it malware. Anubis is open-source and you can check out the code for it here : https://github.com/TecharoHQ/anubisAnd even though it could be hijacked to make it malware, the people who use Anubis set it up themselves, and they obviously won't just let malware run on *their own website*. Overall, you're extremely unlikely to be up against malware with Anubis. If you're still worried, you should blame the scrapers, not the users of Anubis. Scraping costs them A LOT. They're doing this as self-defense and preservation.
(DIR) Post #AwL37Y4UnexudXVj2O by SuperDicq@minidisc.tokyo
2025-07-20T22:28:26.686Z
1 likes, 0 repeats
@konstruct@woof.tech @Stellar@mk.absturztau.be I do not want my computer to process a useless hash algorithm (which sometimes up to 2 minutes on my slower machines).If you install Anubis, you're forcing me to run software on my computer that I don't want to run.So even though there is no malicious intend, you're making me do things I don't want to do. I consider that malware.
(DIR) Post #AwL3iqqHyOwfZHeXWC by SuperDicq@minidisc.tokyo
2025-07-20T22:35:13.049Z
1 likes, 0 repeats
@konstruct@woof.tech @Stellar@mk.absturztau.be If a website would require you to mine bitcoin for 2 minutes in order to access it that would be considered unacceptable.So why is it suddenly considered acceptable in this case?It is essentially the same result for the end user, their computing is doing things the user doesn't want it to do.
(DIR) Post #AwL7BzEw87rpsBlYiO by konstruct@woof.tech
2025-07-20T23:00:26Z
0 likes, 0 repeats
@SuperDicq @Stellar ... Because it's not requiring you to mine bitcoin?? If it did, and transferred you the bitcoin you earned in your wallet, I don't think too many people would complain, considering it's self-defense, and the only alternative is NOT having the site be available at all. If it stole that bitcoin, that would be a whole different story. But it's not. It's just requiring proof of work. You need to understand that whatever tests you can make a scraper do, AI companies will try to (by)pass them. The only known solution to this problem that's available RIGHT NOW, because hosters need it to stop NOW, not after they figure out a new way to block scrapers, is Anubis. Hosters don't have the resources to jump method-to-method, test to test, block to block as soon as the thing they use gets inevitably passed by scrapers. They need something that WORKS and will KEEP working. Overall, I think you don't measure how much stress (financial, or resource-wise) this is putting on hosters. They are resorting to Anubis knowing that it's slow. They just can't afford to do anything else. To come back to the dilemma : you can only use Anubis, or make scrapers stop by getting their attention.
(DIR) Post #AwL7C0FgMq3f0n7fW4 by SuperDicq@minidisc.tokyo
2025-07-20T23:14:03.731Z
0 likes, 0 repeats
@konstruct@woof.tech @Stellar@mk.absturztau.be But it is a false dilemma. There are many other things you can do that do without asking unreasonable things from your real human users.Another example is you can quite easily set up scraper honeypots or tarpits and then you can firewall any IP address that gets stuck in them.
(DIR) Post #AwLnnEhjnBRucngLHE by konstruct@woof.tech
2025-07-20T23:37:20Z
0 likes, 0 repeats
@SuperDicq @Stellar We are talking about small organisations and individuals hosting things on personal hardware going against *rich, funded organisations ready to spend millions into scraping efforts.*
(DIR) Post #AwLnnGLTgupdiLHarI by SuperDicq@minidisc.tokyo
2025-07-21T07:11:24.065Z
0 likes, 0 repeats
@konstruct@woof.tech @Stellar@mk.absturztau.be The things I mentioned are not a lot harder to do than installing Anubis.
(DIR) Post #AwLwnmSXBxpOtuuBQe by konstruct@woof.tech
2025-07-21T08:50:53Z
0 likes, 0 repeats
@SuperDicq ...But they'll work for a limited time. Like I said, we are talking about small teams who need something that'll work long term.
(DIR) Post #AwLwnnfKhreydttv16 by SuperDicq@minidisc.tokyo
2025-07-21T08:52:22.304Z
1 likes, 0 repeats
@konstruct@woof.tech Why does everything except your beloved Anubis only work a limited time according to you? I do no think that is true at all, most techniques I mentioned will keep working indefinitely too.
(DIR) Post #AwMLkQqtZGf9sBneXw by Suiseiseki@freesoftwareextremist.com
2025-07-21T13:31:56.536902Z
0 likes, 0 repeats
@SuperDicq @Stellar @konstruct Intentionally attempting to force people into running proprietary software is malicious intent.Required arbitrary remote JavaScript is always proprietary software, no matter the license, as the user doesn't have the 4 freedoms over the software; https://www.gnu.org/philosophy/wwworst-app-store.html.
(DIR) Post #AwMM78BR8y6odLEQPA by Suiseiseki@freesoftwareextremist.com
2025-07-21T13:36:02.416509Z
0 likes, 0 repeats
@SuperDicq @konstruct There are much better and more effective techniques against scrapers than evil Anubis.Evil Anubis by default only actually targets the useragents of web browsers - scrapers can quite easily just change the useragent and keep changing them (it's far easier for a scraper to change their useragent than the user with a web browser - some web browsers actually prevent the useragent from being changed) and the result is that evil Anubis targets legitimate browsers, but does nothing about competently configured malicious scrapers.
(DIR) Post #AwMMTQE2kDM0xd51Ky by SuperDicq@minidisc.tokyo
2025-07-21T13:40:03.921Z
1 likes, 0 repeats
@Suiseiseki@freesoftwareextremist.com @konstruct@woof.tech I bet half of the reason why Anubis actually works against scrapers because many of them don't execute JavaScript and has probably nothing to do with proof of work.
(DIR) Post #AwMOFfDKQ91HIiJSeu by lxo@snac.lx.oliva.nom.br
2025-07-21T13:59:20Z
1 likes, 1 repeats
indeed, it isn't requiring you to mine bitcoin. it's worse: it's requiring you to waste your computing resources, so that the site doesn't have to waste fewer resources of its own. proof of waste is an overall environmental, security and freedom loss. https://blog.lx.oliva.nom.br/2025-03-28-against-proof-of-wasteCC: @SuperDicq@minidisc.tokyo @Stellar@mk.absturztau.be
(DIR) Post #AwMOgghizJabk2Pt5M by konstruct@woof.tech
2025-07-21T14:02:23Z
0 likes, 0 repeats
@Suiseiseki @SuperDicq If you can find a better technique that's available right now, easy to deploy, open-source, that doesn't need Javascript and will effectively block >90% of AI scrapers for the next 5 years without depending on a centralized provider, I'd definitely support it.
(DIR) Post #AwMOghWPwq8gHG8N6G by SuperDicq@minidisc.tokyo
2025-07-21T14:04:49.842Z
0 likes, 0 repeats
@konstruct@woof.tech @Suiseiseki@freesoftwareextremist.com Those are unreasonable demands, as Anubis doesn't meet these demands either on top of pestering real human users.
(DIR) Post #AwMP4QaAa2BOgsCdfs by Suiseiseki@freesoftwareextremist.com
2025-07-21T14:09:09.745968Z
0 likes, 0 repeats
@SuperDicq @konstruct It works by accident against incompetently written scrapers that scrape things over and over multiple times a second with a "Mozilla" useragent, as those just fetch the evil Anubis page over and over (rather than the incompetently programmed server software that somehow burns a lot of CPU cycles as a result of a page merely being accessed unauthenticated (some operations do inherently require a lot of processing and competent sysadmins put those behind authentication to avoid DoS)).Scrapers programmed to avoid evil Anubis just change useragent to something without "Mozilla" in it (and then change to something else if that's blocked too, rapidly).The end result is no Proof of Waste being executed by scrapers - the only one impacted by it is users (even if PoW is required to access the page, scrapers for LLM companies do have plenty of computing cycles available, while many users do not - for example those on the latest mobile devices - most mobile Arm SoC's processors are in fact slower than the high end core 2 Duo laptop line and even for the latest, high end Aarch64 SoC's, power management will kick in hard to stop the battery from being drained).There are many actually effective techniques that work against poorly programmed scrapers - for example setting a cookie to a suspected scraper and denying access if it doesn't transmit it back.My favorite is GNU zip bombs - you go and compress 10GB or more of zero's (i.e. multiple terabytes) and set nginx to serve that as a gzip compressed html page and when served that, most LLM scrapers will go exhaust their memory and crash (the only scraper that seems to avoid such defense is GPTbot, which doesn't accept gzip versions of pages, but tarpitting their IPs solves that problem).
(DIR) Post #AwMPMQ37IiqqFXK05w by Mamako@tsundere.love
2025-07-21T14:12:27.047146Z
0 likes, 0 repeats
@Suiseiseki @SuperDicq @konstruct death to crawlers
(DIR) Post #AwMRKIMB89k3ikLuqG by Suiseiseki@freesoftwareextremist.com
2025-07-21T14:34:26.561110Z
2 likes, 1 repeats
@konstruct @SuperDicq GNU/JIHAD AGAINST "OPEN SOURCE" AND ALL OTHER FORMS OF PROPRIETARY SOFTWARE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!The best technique available now is to start running a tor middle and the Chinese firewall will stop poorly programmed IPv4 LLM scrapers from China for you (just make sure you are also reachable over IPv6).Another technique is to add bombs folder and link to it and make some GNUzip bomps;dd if=/dev/zero bs=1G count=1 | gzip > 01GiB.gzdd if=/dev/zero bs=1G count=1 | gzip > 10GiB.gzAlso make some zstandard bombs;dd if=/dev/zero bs=1G count=1 | zstd > 01GiB.zstdd if=/dev/zero bs=1G count=10 | zstd > 10GiB.zstAlso add a text file noting to right-click -> Save As to save the bombs instead of triggering them.Also add to robots.txt;User-agent: *Disallow: /path/to/bombs/Then add to the relevant server {} in nginx.conf; #gzip bombs location ~* /path/to/bombs/.*\.gz { add_header Content-Encoding "gzip"; default_type "text/html"; } #zstd bombs location ~* /path/to/bombs/.*\.zst { add_header Content-Encoding "zstd"; default_type "text/html"; }Also 403 empty useragents; if ($http_user_agent = "") { return 403; }There are also some LLM scrapers that will identify themselves;if ($http_user_agent ~ (.*Amazonbot.*|.*Applebot.*|.*ClaudeBot.*|.*GPTBot.*)) { return 403; }You also want ratelimiting;In http {}; #10 requests per second on average limit_req_zone $binary_remote_addr zone=perip:10m rate=10r/s;Then to each server{};limit_req zone=perip burst=10 nodelay;For bursty protocols like git, you'll need a larger burst amount (allows several git clone's in a row from an IP, but will block if you spam git clone);limit_req zone=perip burst=1024 nodelay;That is usually all that is needed to stop all aggressive LLM scrapers and should continue to work forever.Sites manually targeted for scraping will need specifically targeted defenses (evil Anubis will be bypassed in such cases).
(DIR) Post #AwMReEhsPXVYBy0hGK by Suiseiseki@freesoftwareextremist.com
2025-07-21T14:38:02.849113Z
0 likes, 0 repeats
@Mamako @SuperDicq @konstruct There is nothing wrong with crawlers that do something useful like index pages about freedom (a search engine), or an activity that furthers freedom (all crawlers have a low request rate, spread out over time).Scrapers are not crawlers, as scrapers try to download as many pages as possible before the scraping is blocked, as proprietary activities is what the scraping is for.
(DIR) Post #AwMRz7BjYkGMxHvoS8 by Suiseiseki@freesoftwareextremist.com
2025-07-21T14:41:48.952360Z
0 likes, 0 repeats
@RedTechEngineer @SuperDicq @Stellar @konstruct Bitcoin is a bad example, as the hash rate is so high, that even millions of CPUs don't stand a chance completing against ASIC's (those can do TERAhashes) for finding the next block.A better example would be mining Monero, as that would help decentralize the network and CPU mining is still possible.
(DIR) Post #AwMS2VAaqw84CZ0rpY by SuperDicq@minidisc.tokyo
2025-07-21T14:42:26.105Z
0 likes, 0 repeats
@Suiseiseki@freesoftwareextremist.com @konstruct@woof.tech One of the rare Suiseiseki posts that is not just a GNU/shitpost but actually contains useful information
(DIR) Post #AwMS9XTSgSHlNLC5lw by SuperDicq@minidisc.tokyo
2025-07-21T14:43:42.415Z
0 likes, 0 repeats
@RedTechEngineer@fedi.lowpassfilter.link @Stellar@mk.absturztau.be @konstruct@woof.tech I honestly agree, but most people would find that less acceptable because just like machine learning the entire cryptocurrency section of computer science also has a bad name.
(DIR) Post #AwMSAQf3V3EuXE2cym by Suiseiseki@freesoftwareextremist.com
2025-07-21T14:43:52.103426Z
0 likes, 0 repeats
@RedTechEngineer @SuperDicq @konstruct It's a GNU/Mine, in the GNU zip format.No theft is occurring, rather the idea is that copyright infringement, proprietary software and SaaSS will later be committed for the purposes of profit.
(DIR) Post #AwMSDgBcEZiPYPoy8W by Suiseiseki@freesoftwareextremist.com
2025-07-21T14:44:27.548175Z
0 likes, 0 repeats
@SuperDicq @konstruct >95% of my posts contain useful information, even the funposts.
(DIR) Post #AwMSRif7pEG7HsJov2 by SuperDicq@minidisc.tokyo
2025-07-21T14:47:00.457Z
0 likes, 0 repeats
@Suiseiseki@freesoftwareextremist.com @konstruct@woof.tech I feel like half of your posts are you replying "proprietary" and other free software ideologies to things where it doesn't even make sense to do so, like the damn Egyptian pyramids.
(DIR) Post #AwMSZIq1isFFVn2AAC by konstruct@woof.tech
2025-07-21T14:47:49Z
0 likes, 0 repeats
@Suiseiseki @SuperDicq Hey, this doesn't seem too bad actually. For the tor and IPv4 part in the beginning, I don't know, but the rest seems logical and easy to do. I guess that would indeed weed out >90% of bots. The sad thing about those bombs is that they'll require a lot of bandwidth to be sent to the scrapers.
(DIR) Post #AwMSZKAclmJbdxg7uK by SuperDicq@minidisc.tokyo
2025-07-21T14:48:20.982Z
0 likes, 0 repeats
@konstruct@woof.tech @Suiseiseki@freesoftwareextremist.com That's not how zip bombs work.
(DIR) Post #AwMSujIeXADXf8dm9g by konstruct@woof.tech
2025-07-21T14:49:16Z
0 likes, 0 repeats
@SuperDicq @Suiseiseki Even compressed, they measure at least 1 GiB. Read their post again
(DIR) Post #AwMSukO0Uk5v229Z8i by SuperDicq@minidisc.tokyo
2025-07-21T14:52:11.669Z
0 likes, 0 repeats
@konstruct@woof.tech @Suiseiseki@freesoftwareextremist.com No, they are very small files that will unpack to either 1GB or 10GB. You should read up on how gzip compression works.
(DIR) Post #AwMUZt01TZhrZX7D6W by konstruct@woof.tech
2025-07-21T15:04:50Z
0 likes, 0 repeats
@SuperDicq @Suiseiseki Quoting Suiseiseki themselves : "My favorite is GNU zip bombs - you go and compress 10GB or more of zero's (i.e. multiple terabytes)"https://freesoftwareextremist.com/objects/9472ae2d-39fd-4033-af6c-e3bbea545ffd
(DIR) Post #AwMUZu3bXkAKqvnaKG by Suiseiseki@freesoftwareextremist.com
2025-07-21T15:10:51.573729Z
0 likes, 0 repeats
@konstruct @SuperDicq 1GiB of 0's gzip compressed weighs 1018K and 10GiB weighs 10MiB.zstandard is smaller for the all 0 case (which is an odd case to handle really) and is 33K for 1GiB & 329K for 10GiBAll of it usually transfers before the scraper crashes, but in the case it doesn't, the connection is held open and eventually times out, which nginx has no problem handling.