• algernon@lemmy.ml
    link
    fedilink
    arrow-up
    1
    ·
    9 days ago

    The Daily Mail (vomit) alone publishes 1,500 articles a day. How many do you plan on publishing?

    I have an automatically generated infinite maze. It produces roughly a million unique pages each day. It used to produce ~60 million pages / day, but a few months ago I decided to firewall some of the crawlers off instead of serving them garbage.

    And I run niche sites. A site with more lucrative traffic than mine (eg, Codeberg, who uses the same software I do) likely generates a lot more garbage.

    There was also a paper, commissioned by Anthropic, I believe, that concluded that only 250 malicious pages they fail to remove from the training set is enough to poison even the largest model. Now, I do not trust anything Anthropic says. But even if we’d need a billion pages to poison a model… I alone served that much in the past year.

    • TheOctonaut@piefed.zip
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      9 days ago

      As you’ve said elsewhere, you’ve created a crawler trap, not a way to poison a model. You’re wasting… some resources I guess? Both theirs and your own. Fascinating to think that you’ve served a billion http requests to no benefit to anyone and you believe this is you winning somehow.

      • algernon@lemmy.ml
        link
        fedilink
        arrow-up
        1
        ·
        9 days ago

        Yes, it does have a cost. It has a far smaller cost than serving the real thing. It also allows me to firewall them off and stop serving them, even if they come at me with real browsers. That’s a very definitive win: I saved CPU time, I saved RAM, I saved network bandwidth, and I stopped them from accessing my stuff. How is that not a win?