Stubsack: weekly thread for sneers not worth an entire post, week ending 8th February 2026

BlueMonday1984@awful.systems · edit-2 18 days ago

Stubsack: weekly thread for sneers not worth an entire post, week ending 8th February 2026

scruiser@awful.systems · edit-2 12 days ago

I liked this takedown of METR’s task horizon “research”: https://arachnemag.substack.com/p/the-metr-graph-is-hot-garbage

In addition to all the complaints I already knew of and had, METR’s methodology for human baselining of tasks was even worse than I realized.

And you know… I actually kind of respect METR relative to a lot of boosters and doomers for at least attempting any hard numbers and not just vibes and anecdotes (METR is the ones that did the study showing LLMs actually reduced coders productivity even as it made them think it increased). But the standard for quantifying LLM performance in practical terms is absurdly low.

fullsquare@awful.systems · 12 days ago

i thought they were yet another rationalist offshoot

scruiser@awful.systems · 12 days ago

They absolutely are. I am just giving them a tiny bit of credit for at least attempting academic research on LLM performance. But only a tiny bit, as they blog post I link discusses, their methodology is really sloppy and not to the level of most academic research and wouldn’t get through peer review of most decent journals.

lurker@awful.systems · 12 days ago

that one METR graph was also a big part of the AI 2027 “prediction”

scruiser@awful.systems · 12 days ago

It was basically the only “empirical” (scare quotes well earned) data they actually used in their “model”, even then, they decided exponential improvement wasn’t good enough, they plugged it into a hyper-exponential model that went to infinity at just a few years regardless of the inputs.

lurker@awful.systems · 11 days ago

yeah lmfao it was bad. I thoroughly enjoyed titotal’s takedown of that graph. I can’t believe the documentary versions of that paper on youtube have millions of views and people eating it up

Those comment sections are gonna be a joy when 2027 and 2028 roll around