• @umbraroze@lemmy.world
    link
    fedilink
    English
    571 day ago

    I have no idea why the makers of LLM crawlers think it’s a good idea to ignore bot rules. The rules are there for a reason and the reasons are often more complex than “well, we just don’t want you to do that”. They’re usually more like “why would you even do that?”

    Ultimately you have to trust what the site owners say. The reason why, say, your favourite search engine returns the relevant Wikipedia pages and not bazillion random old page revisions from ages ago is that Wikipedia said “please crawl the most recent versions using canonical page names, and do not follow the links to the technical pages (including history)”. Again: Why would anyone index those?

    • @EddoWagt@feddit.nl
      link
      fedilink
      English
      210 hours ago

      They want everything, does it exist, but it’s not in their dataset? Then they want it.

      They want their ai to answer any question you could possibly ask it. Filtering out what is and isn’t useful doesn’t achieve that

    • @T156@lemmy.world
      link
      fedilink
      English
      4
      edit-2
      10 hours ago

      Because it takes work to obey the rules, and you get less data for it. The theoretical competitor could get more ignoring those and get some vague advantage for it.

      I’d not be surprised if the crawlers they used were bare-basic utilities set up to just grab everything without worrying about rules and the like.

    • Phoenixz
      link
      fedilink
      English
      2924 hours ago

      Because you are coming from the perspective of a reasonable person

      These people are billionaires who expect to get everything for free. Rules are for the plebs, just take it already