Stubsack: weekly thread for sneers not worth an entire post, week ending 15th February 2026

BlueMonday1984@awful.systems · 30 days ago

Stubsack: weekly thread for sneers not worth an entire post, week ending 15th February 2026

blakestacey@awful.systems · 23 days ago

Mathematicians: [challenge promptfondlers with a fair set of problems]

OpenAI: [breaks the test protocol, whines]

We will aim to publish more information next week, but as I noted above, this was a quite chaotic sprint (you caught us by surprise! please give us time to prepare next time!). We will not be able to gather all the transcripts as they are quite scattered.

Some of the prompts included guidance to iterate on its previous work…

lagrangeinterpolator@awful.systems · edit-2 23 days ago

I thought I was sticking my neck out when I said that OpenAI was faking their claims in math, such as with the whole International Math Olympiad gold medal incident. Even many of my peers in my field are starting to become receptive to all of these rumors about how AI is supposedly getting good at math. Sometimes I wonder if I’m going crazy and sticking my head in the sand.

All I can really do is to remember that AI developers are bad faith (and scientists are actually bad at dealing with bad faith tactics like flooding the zone with bullshit). If the boy has cried wolf 10 times already, pardon me if I just ignore him entirely when he does it for the 11th time.

I would not underestimate how much OpenAI and friends would go out of their way to cheat on math benchmarks. In the techbro sphere, math is placed on a pedestal to the point where Math = Intelligence.

blakestacey@awful.systems · 23 days ago

Presuming that they are all liars and cheaters is both contrary to the instincts of a scientist and entirely warranted by the empirical evidence.

blakestacey@awful.systems · 23 days ago

First of all, like, if you can’t keep track of your transcripts, just how fucking incompetent are you?

Second, I would actually be interested in a problem set where the problems can’t be solved. What happens if one prompts the chatbot with a conjecture that is plausible but false? We cannot understand the effect of this technology upon mathematics without understanding the cost of mathematical sycophancy. (I will not be running that test myself, on the “meth: not even once” principle.)

YourNetworkIsHaunted@awful.systems · edit-2 23 days ago

I would go so far as to try and find a suitably precocious undergrad to run the test that they themselves are capable of guiding and nudging the model the way OpenAI’s team did but not of determining on their own that the conjecture in question is false. OpenAI’s results here needed a fair bit of cajoling and guidance, and without that I can only assume it would give the same kind of non-answer regardless of whether the question is in fact solvable.

BigMuffN69@awful.systems · 23 days ago

AcerFur (who is quoted in the article) tried them himself and said he got similar answers with a couple guiding prompts on gpt 5.3 and that he was “disappointed”

That said, AcerFur is kind of the goat at this kind of thing 🦊==🐐

BigMuffN69@awful.systems · edit-2 23 days ago

This was a very nice problem set. Some were minor alterations to thms in literature but ranged up to problems that were quite involved. It appears that OAI got about 5 (possibly 6) of them but even then, this was accomplished with expert feedback to the model, which is quite different from the models just 1 shotting them on their own.

But I think this is what makes it so well done! A 0/10 or a 10/10 ofc gives very little info, a middling score that they admit they put a shit ton of effort into and tried to coax the right answers out of the models via hints says a lot about how much these systems can currently help prove lemmata.

Side note: I asked a FB friend of mine at one of the math + ai startups if they attempted the problems and he said “they had more pressing issues this week they couldnt be pulled away from” (no comment, :P I want to stay friends with them)

The lack of similar attempts being released by big companies like Google or Anth or X also should be a big red flag that their attempts were not up to snuff of even attempting.

BigMuffN69@awful.systems · edit-2 23 days ago

Also Martin Hairer is incredibly based besides having a big noggin. He gave this nice talk 2 months ago if any peeps want to see what he thinks comes next for math.

https://www.youtube.com/watch?v=fbVqc1tPLos

YourNetworkIsHaunted@awful.systems · 22 days ago

I found the comment about models creating very old-fashioned “18th century style” proofs very interesting. Not surprising in retrospect since older proofs are going to be reproduced more across the training data compared to newer ones, but it’s still interesting to note and indicative of the reproduction that these things are doing.