Want to wade into the snowy sandy surf of the abyss? Have a sneer percolating in your system but not enough time/energy to make a whole post about it? Go forth and be mid.
Welcome to the Stubsack, your first port of call for learning fresh Awful youāll near-instantly regret.
Any awful.systems sub may be subsneered in this subthread, techtakes or no.
If your sneer seems higher quality than you thought, feel free to cutānāpaste it into its own post ā thereās no quota for posting and the bar really isnāt that high.
The post Xitter web has spawned so many āesotericā right wing freaks, but thereās no appropriate sneer-space for them. Iām talking redscare-ish, reality challenged āculture criticsā who write about everything but understand nothing. Iām talking about reply-guys who make the same 6 tweets about the same 3 subjects. Theyāre inescapable at this point, yet I donāt see them mocked (as much as they should be)
Like, there was one dude a while back who insisted that women couldnāt be surgeons because they didnāt believe in the moon or in stars? I think each and every one of these guys is uniquely fucked up and if I canāt escape them, I would love to sneer at them.
(Credit and/or blame to David Gerard for starting this.)


That was great, thank you! Full respect to this absolute maniac for tracing some of the spaghetti, I was definitely not going to try that on my phone.
Theyāve validated most gut feelings I had about how Claude works (and doesnāt), based on my experience having to use it. Iām feeling pretty smug that my hunches now have definitive code attributions.
But the one unfortunate part about all of this is that this leak and the ensuing justified sneers about specific bits are going to be fed back in to their codebase to fix some of the gaping holes. Itās an embarrassing indictment of the product, but itās also free pre-IPO pentesting. Sort of like their open source pull request slop spam āundercover modeā was probably used as a way to extract free labor in the form of reviews from actually competent developers. This doesnāt seem as planned though.
In practical terms, what can they do? Add instructions to say āYou will not generate spaghetti code that will humilate us when real programmers see it?ā Perhaps in all caps?
This is what theirnorganizarion is capable, after tremendous expense, of producing. I donāt think that bodes well for their prospects of improvement.
Sorry, this was more of a rant than I thought it would be, I hit one of my own nerves while writing it. This is what happens when youāre not in a good position to escape enforced AI usage hell. Tl;dr in bold at end.
ā wall divider ā
I can think of several practical measures, because Iāve tried them myself in an effort to make my coerced work with LLMs less painful, and because in the process Iāve previously fallen into the gambling trap Johnny outlined.
The less novel things I tried are things theyāve half-assed themselves as āfeaturesā already. For example, Johnny found one of the things I had spotted in the wild a while back - the āsystem_reminderā injection. This periodically injects a small line into the logs in an effort to keep it within the context window. In my case, I tried the same thing with a line that summed up to āreread the original fucking context and assess whether the changes make a shred of sense against the task because what the fuckā. I had tried this unsuccessfully because I had no way to realistically enforce it within their system, and they recently included the āteam leadā skill which (I rightly assumed) tries to do exactly the same thing. The implementation suggests they will only have been marginally more successful than my attempt, it didnāt look like they tried very hard. This could be better implemented and extended to even a little more than āread original contextā.
For this leak, some of the very easy things they could have done was to verify their own code against best practises, implement the most basic of tests, or attempt to measure the consistency of their implementation. Source maps in production is a ridiculously easily preventable rookie error. This should already be executed automatically in multiple stages of their coding, merging and deployment pipelines with varying degrees of redundancy and thoroughness the same way it is for any tech company with more than maybe 10 developers. There is just no reason they shouldnāt have prevented huge chunks of the now visible code issues, if were they triggering their own trash bots against their codebase with even the simplest prompt of āevaluate against good system design and architecture principlesā. This implies that they either werenāt doing it at all, or maybe worse, ignored all the red flags it is capable of identifying after ingesting all of the system architecture guides and textbooks ever published online.
Anthropic is constrained in that some of the fixes which should be pushed to users are things which would have significant trade-off in the form of cost or context window, neither of which are palatable to them for reasons this community has discussed at length. But that constraint doesnāt prevent them from running checks or applying fixes to their own code, which reveals the root cause: The problems Anthropic are facing are clearly cultural. Theyāre pushing as much new shit as they can as quickly as possible and almost never going back to fix any of it. Thatās a choice.
I saw a couple of signs that there are at least a few people there who are capable, and who are trying to steer an out of control titanic away from the iceberg, but the codebase stinks of missing architectural plans which are being retrofitted piecemeal long after they were needed. That aligns with Anthropicās origin story, where OpenAI researchers accurately gauged how gullible venture capitalists are, but overestimated how much smarter they are than the rest of the world, and underestimated the value of practical experience building and running complex systems.
With the resources they have, even for a codebase of this unreasonable size, they could and should vibe code a much better version within a couple of months. That is not resounding praise for Claude, only a commentary on the quality of the existing code. Perhaps as a first step they could use their own āplan modeā which just appends a string that says not to make any edits, only to investigate and assess requirementsā¦
Were I happy to watch the world burn, Iād start my own damn AI company that would do a much better job at this, because holy shit, people actually financed this trash.
Tl;dr, youāre right that it doesnāt bode well for their prospects of improvement, but itās not because there arenāt many things they could be doing practically. Itās because they refuse to point the gun somewhere other than their own feet.
I think Iām missing something somewhere. One of the most alarming patterns that Jonny found imo was the level of waste involved across unnecessary calls to the source model, unnecessary token churn through the context window from bad architecture, and generally a sense that when creating this neither they nor their pattern extruder had made any effort to optimize it in terms of token use. In other words, changing the design to push some of those calls onto the user would save tokens and thus reduce the userās cost per prompt, presumably by a fair margin on some of the worst cases.
Youāre right, but Johnny rightly also identified the issue where Claude creates complex trash code to work around user-provided constraints while not actually changing approach at all (see the part about tool denial workarounds).
I think Anthropic optimized for appended system prompt character count, and measured it in isolation - at least in the projectās beginning stages, if itās not still in the code. I assume the inefficiencies have come from the agent working with and around that requirement, backfiring horribly in the spaghetti you see now. Not only is the resulting trash control flow less likely to be caught as a problem by agents, especially compared to checking a character count occasionally, but itās more likely the agent will treat the trash code as an accepted pattern it should replicate.
Claude will also not trace a control flow to any kind of depth unless asked, and if you ask, and it encounters more than one or two levels of recursion or abstraction, it will choke. Probably because itās so inefficient, but then theyāre getting the inefficient tool to add more to itself and⦠thereās no way to recover from that loop without human refactoring. I assume thatās a taboo at Anthropic too.
A type of fix I was imagining would be something like an extra call like āafter editing, evaluate changes against this large collection of terrible choices that should not occur, for example, the agentās current internal codeā. That would obviously increase the short term token consumption, context window overhead, and make an Anthropic project manager break out in a cold sweat. But it would reduce the gradient of the project death spiral by providing more robust code for future agents to copy paste that can be more cheaply evaluated, and require fewer user prompts overall to rectify obvious bad code.
They would never go for that type of long game, because theyād have to do some combination of:
They should just set it all on fire, the abomination canāt salvage the abomination.