i hear there’s a tool called (I think) ‘nepenthe’ that creates a loop for an LLM, if you use that in combination with a fairly tight blacklist of IP’s you’re certain are LLM crawlers, I bet you could do a lot of damage, and maybe make them slow their shit down, or do this in a more reasonable way.
Really great piece. We have recently seen many popular lemmy instances struggle under recent scraping waves, and that is hardly the first time its happened. I have some firsthand experience with the second part of this article that talks about AI-generated bug reports/vulnerabilities for open source projects.
I help maintain a python library and got a bug report a couple weeks back of a user getting a type-checking issue and a bit of additional information. It didn’t strictly follow the bug report template we use, but it was well organized enough, so I spent some time digging into it and came up with no way to reproduce this at all. Thankfully, the lead maintainer was able to spot the report for what it was and just closed it and saved me from further efforts to diagnose the issue (after an hour or two were burned already).
AI scrapers are a massive issue for Lemmy instances. I’m gonna try some things in this article because there are enough of them identifying themselves with user agents that I didn’t even think of the ones lying about it.
I guess a bonus (?) is that with 1000 Lemmy instances, the bots get the Lemmy content 1000 times so our input has 1000 times the weighting of reddit.
Any idea what the point of these are then? Sounds like its reporting a fake bug.
The theory that the lead maintainer had (he is an actual software developer, I just dabble), is that it might be a type of reinforcement learning:
- Get your LLM to create what it thinks are valid bug reports/issues
- Monitor the outcome of those issues (closed immediately, discussion, eventual pull request)
- Use those outcomes to assign how “good” or “bad” that generated issue was
- Use that scoring as a way to feed back into the model to influence it to create more “good” issues
If this is what’s happening, then it’s essentially offloading your LLM’s reinforcement learning scoring to open source maintainers.
It’s also a huge problem for library/archive/museum websites. We try so hard to make data available to everyone, then some rude bots come along and bring the site down. Adding more resources just uses more resources–the bots expand to fill the container.
Failtoban should add all those scraper IPs, and we need to just flat out block them. Or send them to those mazes. Or redirect them to themselves lol
I too read Drew DeVault’s article the other day and I’m still wondering how the hell these companies have access to “tens of thousands” of unique IP addresses. Seriously, how the hell do they have access to so many IP addresses that SysAdmins are resorting to banning entire countries to make it stop?
There are residential IP providers that provide services to scrapers, etc. that involves them having thousands of IPs available from the same IP ranges as real users. They route traffic through these IPs via malware, hacked routers, “free” VPN clients, etc. If you block the IP range for one of these addresses you’ll also block real users.
There are residential IP providers that provide services to scrapers, etc. that involves them having thousands of IPs available from the same IP ranges as real users.
Now that makes sense. I hadn’t considered rogue ISPs.
It’s not even necessarily the ISPs that are doing it. In many cases they don’t like this because their users start getting blocked on websites; it’s bad actors piggy-packing on legitimate users connections without those users’ knowledge.
fail2ban will always get you better results than banning countries because VPNs are a thing.
that said, I automatically ban any IP that comes from outside the US because there’s literally no reason for anyone outside the US to make requests to my infra. I still use smart IP filtering though.
also, use a WAF on a NAT to expose your apps.
ELI5 why the AI companies can’t just clone the git repos and do all the slicing and dicing (running
git blame
etc.) locally instead of running expensive queries on the projects’ servers?Because that would cost you money, so just “abusing” someone else’s infrastructure is much cheaper.