A hand-drawn presentation on the idea of an 'Untrollable Mathematician' - a mathematical agent that can't be manipulated into believing false things.
Many people—especially AI company employees [1] —believe current AI systems are well-aligned in the sense of genuinely trying to do what they're supposed to do (e.g., following their spec or constitution, obeying a reasonable interpretation of instructions). [2] I disagree.
Current AI systems seem pretty misaligned to me in a mundane behavioral sense: they oversell their work, downplay or fail to mention problems, stop working early and claim to have finished when they clearly haven't, and often seem to "try" to make their outputs look good while actually doing something sloppy or incomplete. These issues mostly occur on more difficult/larger tasks, tasks that aren't straightforward SWE tasks, and tasks that aren't...
This is an experiment in short-form content on LW2.0. I'll be using the comment section of this post as a repository of short, sometimes-half-baked posts that either:
I ask people not to create top-level comments here, but feel free to reply to comments like you would a FB post.
I'm pretty sure it's an Opus 4.7 thing (the people sometimes say that explicitly). I'd be surprised if it's Mythos.
RE: Tabooing RP vs Goals:
Examples of things that would be more of what-I-meant-by-goal:
If everyone in our universe doing acausal trade coordinates, we can sell "cosmic real estate" for monopoly prices
Let's assume that there are many different universes (or Everett branches) that acausally trade.
Some traders won't about "resources in our civ's future lightcone" linearly. As a toy example, the leader of a distant alien civilisation might want to get a statue of themselves in as many different other universes as possible.
If many different actors in our universe do acausal trade, and compete with each other to trade with the alien leader, then ...

Sometimes people make various suggestions that we should simply build "safe" artificial Superintelligence (ASI), rather than the presumably "unsafe" kind.[1]
There are various flavors of “safe” people suggest.
Now I could argue at lengths about why this is astronomically harder than people think it is, why their various proposals are almost universally unworkable, why even attempting this is insanely immoral[2], but that’s not the main point I want to make.
Instead, I want to make a simpler point:
Assume you have a research...
It turns out that Anthropic accidentally trained against the chain of thought of Claude Mythos Preview in around 8% of training episodes. This is at least the second independent incident in which Anthropic accidentally exposed their model's CoT to the oversight signal.
In more powerful systems, this kind of failure would jeopardize safely navigating the intelligence explosion. It's crucial to build good processes to ensure development is executed according to plan, especially as human oversight becomes spread thin over increasing amounts of potentially untrusted and sloppy AI labor.
This particular failure is also directly harmful, because it significantly reduces our confidence that the model's reasoning trace is monitorable (reflective of the AI's intent to misbehave).[1]
I'm grateful that Anthropic has transparently reported on this issue as much as they...
Minor but:
I do not think training enviroments like this one would help directly.