Aggressive AI scrapers are making it kinda suck to run wikis

ItWasntMe@discuss.online · 10 days ago

Aggressive AI scrapers are making it kinda suck to run wikis

algernon@lemmy.ml · 9 days ago

The Daily Mail (vomit) alone publishes 1,500 articles a day. How many do you plan on publishing?

I have an automatically generated infinite maze. It produces roughly a million unique pages each day. It used to produce ~60 million pages / day, but a few months ago I decided to firewall some of the crawlers off instead of serving them garbage.

And I run niche sites. A site with more lucrative traffic than mine (eg, Codeberg, who uses the same software I do) likely generates a lot more garbage.

There was also a paper, commissioned by Anthropic, I believe, that concluded that only 250 malicious pages they fail to remove from the training set is enough to poison even the largest model. Now, I do not trust anything Anthropic says. But even if we’d need a billion pages to poison a model… I alone served that much in the past year.

TheOctonaut@piefed.zip · 9 days ago

As you’ve said elsewhere, you’ve created a crawler trap, not a way to poison a model. You’re wasting… some resources I guess? Both theirs and your own. Fascinating to think that you’ve served a billion http requests to no benefit to anyone and you believe this is you winning somehow.

algernon@lemmy.ml · 9 days ago

Yes, it does have a cost. It has a far smaller cost than serving the real thing. It also allows me to firewall them off and stop serving them, even if they come at me with real browsers. That’s a very definitive win: I saved CPU time, I saved RAM, I saved network bandwidth, and I stopped them from accessing my stuff. How is that not a win?