I’m not 100% convinced of this, but I’m fairly convinced, more and more so over time. I’m hoping to start a vigorous but civilized debate. I invite you to attack my weak points and/or present counter-evidence.
My thesis is that intent-alignment is basically happening, based on evidence from the alignment research in the LLM era.
The classic story about loss of control from AI is that optimization pressure on proxies will cause the AI to value things that humans don’t. (Relatedly, the AI might become a mesa-optimizer with an arbitrary goal).
But the reality that I observe is that the AIs are really nice and somewhat naive. They’re like the world’s smartest 12-year-old (h/t Jenn). We put more and more RL optimization pressure, and keep getting smarter and smarter models; but they...
When make the claim "we should get verbalized eval awareness to zero," I mean it in the narrow context of "mitigating evaluation awareness during alignment evaluations through contrastive activation steering." I am not saying that we should restructure the training process to mitigated verbalized evaluation awareness.
So I don't think concerns about monitorability is very relevant here? Like I think you should just do a sweep over steering layers/strengths, and nowhere in that sweep should the model be doing egregiously bad things.
And I think it's reasonabl...
I'm renaming Infra-Bayesian Physicalism to Formal Computational Realism (FCR), since the latter name is much more in line with the nomenclature in academic philosophy.
AFAICT, the closest pre-existing philosophical views are Ontic Structural Realism (see 1 2) and Floridi's Information Realism. In fact, FCR can be viewed as a rejection of physicalism, since it posits that a physical theory is meaningless unless it's conjoined with beliefs about computable mathematics.
The adjective "formal" is meant to indicate that it's a formal mathematical framework, not j...
(See changelog at the bottom for some post-publication edits.)
I will argue that a large class of reward functions, which I call “behaviorist”, and which includes almost every reward function in the RL and LLM literature, are all doomed to eventually lead to AI that will “scheme”—i.e., pretend to be docile and cooperative while secretly looking for opportunities to behave in egregiously bad ways such as world takeover (cf. “treacherous turn”). I’ll mostly focus on “brain-like AGI” (as defined just below), but I think the argument applies equally well to future LLMs, if their competence comes overwhelmingly from RL rather than from pretraining.[1]
The issue is basically that “negative reward for lying and stealing” looks the same as “negative reward for getting caught lying and stealing”....
I was recently asked what follow-up on this post could look like, and I gave two answers (that were deliberately not "Do what Steve does"). They were:
1.
We'd like to be able to mathematically analyze the behavior of agents with parametrized classes of non-behaviorist rewards, in toy situations that capture something important about reward hacking.
A first toy model to construct might be if we train the AI to use information, but there's some information we don't want it to use (analogous to a coding agent that sometimes sees the unit tests). A harder toy mod...
I sometimes think about plans for how to handle misalignment risk. Different levels of political will for handling misalignment risk result in different plans being the best option. I often divide this into Plans A, B, C, and D (from most to least political will required). See also Buck's quick take about different risk level regimes.
In this post, I'll explain the Plan A/B/C/D abstraction as well as discuss the probabilities and level of risk associated with each plan.
Here is a summary of the level of political will required for each of these plans and the corresponding takeoff trajectory:
The China Problem: Plan B’s 13% risk doesn’t make sense if China (DeepSeek) doesn’t slow down and is only 3 months behind. Real risk is probably the same as for E, 75% unless there is a pivotal act.
What about the US trying somewhat hard to buy lead time, e.g., by sabotaging Chinese AI companies?
The framework treats political will as a background variable rather than a key strategic lever.
I roughly agree with this. It's useful to condition on (initial) political will when making a technical plan, but I agree raising political will is important and one issue with this perspective is it might incorrectly make this less salient.
One day, when I was an intern at the cryptography research department of a large software company, my boss handed me an assignment to break a pseudorandom number generator passed to us for review. Someone in another department invented it and planned to use it in their product, and wanted us to take a look first. This person must have had a lot of political clout or was especially confident in himself, because he rejected the standard advice that anything an amateur comes up with is very likely to be insecure and he should instead use one of the established, off the shelf cryptographic algorithms, that have survived extensive cryptanalysis (code breaking) attempts.
My boss thought he had to demonstrate the insecurity of the PRNG by coming up...
>perhaps I should expand the target audience?
Plausibly the target audience should include AIs that will eventually be trying to do philosophy, assuming they'll be trained on your posts' contents or would be able to browse them.
We might soon be creating morally relevant AI systems with real welfare concerns. How can we help ensure good lives for AIs, especially if we don't have that many resources to allocate to it?
Okay, I agree that insider trading leads to bad incentives and I retract most of the footnote. Maybe the tone of my footnote is a bit too Pollyanna-ish and maybe so is the rest of the post.
I suppose also insider trading by companies themselves on their own stock avoids this, but increases transaction costs by fear of adverse selection, so perhaps it's also bad.