Stubsack: weekly thread for sneers not worth an entire post, week ending 14th September 2025

BlueMonday1984@awful.systems · 9 days ago

Stubsack: weekly thread for sneers not worth an entire post, week ending 14th September 2025

JFranek@awful.systems · 4 days ago

Was jumpscared on my YouTube recommendations page by a video from AI safety peddler Rob Miles and decided to take a look.

It talked about how it’s almost impossible to detect whether a model was deliberately trained to output some “bad” output (like vulnerable code) for some specific set of inputs.

Pretty mild as cult stuff goes, mostly anthropomorphizing and referring to such LLM as a “sleeper agent”. But maybe some of y’all will find it interesting.

link

BlueMonday1984@awful.systems · 4 days ago

This isn’t the first time I’ve heard about this - Baldur Bjarnason’s talked about how text extruders can be poisoned to alter their outputs before, noting its potential for manipulating search results and/or serving propaganda.

Funnily enough, calling a poisoned LLM as a “sleeper agent” wouldn’t be entirely inaccurate - spicy autocomplete, by definition, cannot be aware that their word-prediction attempts are being manipulated to produce specific output. Its still treating these spicy autocompletes with more sentience than they actually have, though

Stubsack: weekly thread for sneers not worth an entire post, week ending 14th September 2025

Stubsack: weekly thread for sneers not worth an entire post, week ending 14th September 2025

Stubsack: weekly thread for sneers not worth an entire post, week ending 7th September 2025 - awful.systems