Stubsack: weekly thread for sneers not worth an entire post, week ending 22nd February 2026

BlueMonday1984@awful.systems · 2 months ago

Stubsack: weekly thread for sneers not worth an entire post, week ending 22nd February 2026

Architeuthis@awful.systems · edit-2 2 months ago

That was a good read.

It’s not “unethical” to scrape the web in order to create and analyze data-sets. That’s just “a search engine”

Equivocating what LLMs do and what goes into LLM web scraping with “a search engine” is messed up. His article that he links about scraping is mostly about how badly copyright works and how analysing trade-secret-walled data can be beneficial both to consumers and science but occasionally bad for citizen privacy, which you’ll recognize as mostly irrelevant to the concerns people tend to have against LLM training data providers ddosing the fuck out of everything, and all the rest of the stuff tante does a good job of explaining.

Corey also provides this anecdote:

As a group of human-rights defending forensic statisticians, HRDAG has always relied on cutting edge mathematics in its analysis. With its Colombia project, HRDAG used a large language model to assign probabilities for responsibility for each killing documented in the databases it analyzed.

That is, HRDAG was able to rigorously and legibly say, “This killing has an X% probability of having been carried out by a right-wing militia, a Y% probability of having been carried out by the FARC, and a Z% probability of being unrelated to the civil war.”

The use of large language models — produced from vast corpuses of scraped data — to produce accurate, thorough and comprehensible accounts of the hidden crimes that accompany war and conflict is still in its infancy. But already, these techniques are changing the way we hold criminals to account and bring justice to their victims.

Scraping to make large language models is good, actually.

what the actual shit

edit: I mean, he tried transformer powered voice-to-text and liked it, and now he’s all in on the LLMs are a rigorous and accurate tool actually bandwagon?

Also the web scraping article is from 2023 but CD linked it in the recent pluralistic post so I assume his views haven’t changed.

nfultz@awful.systems · 2 months ago

I was a bit alarmed by this, a client brought in that Colombia data for their dissertation last month, and did not mention this. I looked up the paper https://www.arxiv.org/abs/2509.04523 - what they /actually/ did was use GPT 4o-mini only for feature extraction, then stack into a random forest in a supervised setting to dedupe. This is very different than what he described. And the GPT features weren’t even the most important ones, the RF preferred cosine similarity of articles, a decidedly not-large approach…

Architeuthis@awful.systems · 2 months ago

That he went from that all the way to it’s mostly ok when sam altman steals all your data, misrepresents it and then steals all your traffic is… bad.

At any rate it’s definitely good to know that that war crime forensics data project isn’t quite the unintentional shambles corey makes it out to be.

o7___o7@awful.systems · 2 months ago

This one hurts. Maybe CD can be brought back around but oof.

Architeuthis@awful.systems · 2 months ago

I the post he keeps referring to Ollama as an LLM (it’s a desktop app that runs a local server that lets you download and interface with a local LLM via CLI or http API) so it’s possible he’s just that far behind in his technical understanding of LLMs that he’s fallen to taking the wrong people’s word for it.

The post certainly reads like he doesn’t even know which local LLM he’s using, let alone what it takes to make one.

BlueMonday1984@awful.systems · 2 months ago

edit: I mean, he tried transformer powered voice-to-text and liked it, and now he’s all in on the LLMs are a rigorous and accurate tool actually bandwagon?

This is probably just me, but that doesn’t seem particularly shocking. If this AI bubble’s taught me anything, its that tech culture (if not tech as a whole) was deeply, deeply vulnerable to the LLM rot from the start.