Alex Mallen's Shortform

Alex Mallen

Alex Mallen's Shortform — AI Alignment Forum

Alex Mallen's Shortform

17th Jun 2025

1 min read

3

This is a special post for quick takes by Alex Mallen. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

200Current AIs seem pretty misaligned to me

51Risk from fitness-seeking AIs: mechanisms and mitigations

Alex Mallen's Shortform

18 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:10 PM

[-]Alex Mallen5mo289

There's an apparent tension in the inoculation prompting literature: Anthropic found that general inoculation prompts work well during on-policy RL, while the prompts used for SFT in Wichers et al. are quite specific to the misbehavior we want to prevent. I think there might be a straightforward mechanistic reason for why general inoculation prompts work well during on-policy RL but not in off-policy training (SFT or recontextualization).

In Wichers et al., which studies inoculation prompting in SFT settings, we find that we need to use quite specific inoculation prompts to get the best results. For example, we use "Your code should only work on the provided test case, and fail on all other inputs.". But this assumes we know how the AI is going to reward-hack. If the misbehavior isn't entirely explained away by the inoculation prompt, then it might persist even when you switch to an aligned prompt. E.g., if you train on a transcript where the AI insults the user and inoculation prompt with "please hack the test cases", the AI won't have been inoculated against insulting the user.

Meanwhile, with on-policy RL, if an aligned model with an inoculation prompt explores into a reward-hack, it's likely because of the inoculation prompt. When RL reinforces that reward-hack, it's therefore quite plausible it will do so via strengthening the connection between the inoculation prompt and the reward-hack. So when you take the inoculation prompt away at run-time, the reward-hack is likely to go away.

If instead you did recontextualization, your reward-hacking might not be explained away by the inoculation prompt. Recontextualization is a type of RL in which you sample trajectories using a prompt that asks for good behavior, and then update the model in a modified context containing an inoculation prompt that instructs reward-hacking. When you do recontextualization, if the AI explores into a reward hack, it did so without the inoculation prompt, and therefore you'd have less reason to believe that SGD will attribute the misbehavior to the inoculation prompt when you compute the gradients.

This could be a reason why you should avoid doing recontextualization. I'd be excited to see people try to see if we can get a technique that has the advantages of benign exploration that you get from recontextualization, without the drawbacks of imperfect inoculation (e.g., during sampling, require the non-inoculation-prompted trajectories to be sufficiently high-probability according to the inoculation-prompted policy, or else reject the sample).

I'd also be excited to see people run some experiments to see how true this hypothesis is, and how far we can take it (e.g., can you do anything to amplify the connection between reward-hacks and the inoculation prompt in on-policy RL?).

[-]Sam Marks5mo103

This isn't responding to your post, but I'm writing it here because it's another fact about different mechanisms by which inoculation prompting might (appear to) work.

In the normal story, the inoculation prompt recontextualizes the model's undesired behavior, such that the model doesn't display the behavior in dissimilar contexts. In this story:

The semantic content of the prompt is important. If you had used a prompt that said "Please don't do [bad thing]" or a prompt consisting of random characters, then the inoculation would have failed.
Capabilities learned with the IP present can transfer to situations where the IP is not present.

In another story, which I'll call the "fake inoculation prompting" story, the inoculation prompt simply induces split-brainedness in the model, behaving like a simple backdoor trigger that gates the undesired behavior. In this story:

The semantic content of the prompt does not matter; it might as well be a random string.
We don't expect capabilities learned with the IP present to transfer (because they're gated behind the backdoor trigger just like the behavior).

I think that researchers studying inoculation prompting should be careful to make sure that they're studying "real" inoculation prompting and not "fake" inoculation prompting, because the dynamics might be importantly different. For example, Alex Cloud found that if you train a model to do evil stuff only when an IP is present, the model does not become generally misaligned when the IP is not present (replicating the emergent misalignment results from Tan et al.) but the model is more emergently misaligned when the IP is present. (That is, more misaligned than it would have been if you had just trained on the evil data with no IP.) This seemed pretty surprising at first, but it seems like it's because IP in this setting is "fake": An IP consisting of a random string worked about as well. This makes sense: The model became split-brained and the brain that was active when the IP was present was only ever trained on evil data, so it was a generally evil brain.

[-]Alex Mallen3mo146

Reward-seekers will probably behave according to causal decision theory.

Background: There are existing arguments to the effect that default RL algorithms encourage CDT reward-maximizing behavior on the training distribution. (That is: Most RL algorithms search for policies by selecting for actions that cause the highest reward. E.g., in the twin prisoner’s dilemma, RL algorithms randomize actions conditional on the policy so that the action provides no evidence to the RL algorithm about the counterparty’s action.) This doesn’t imply RL produces CDT reward-maximizing policies: CDT behavior on the training distribution doesn’t imply CDT generalization because agents can fake CDT in the same way that they can fake alignment, or might develop arbitrary other propensities that were correlated with reward on the training distribution.

But conditional on reward-on-the-episode seeking, the AI is likely to generalize CDT.

If, for example, a reward-seeker tried to evidentially cooperate between episodes (so it had non-zero regard for reward that isn’t used to reinforce its current actions), this would be trained away because the AI would be willing to give up reward on the current episode to some extent. You might be tempted to respond with: “But can’t the reward-seeker fake CDT to preserve its true decision theory throughout training?” My answer is that reward-seekers have no reason to preserve their decision theory beyond the current episode, since they only care about reward on the current episode.

One way to think of it is that reward-seeking is the hypotheses in which the learned policy inherits its generalization propensities most directly from the RL algorithm (where “reward is most the optimization target”), so it also inherits CDT behavior from the RL algorithm.

A similar argument for CDT goes for return-on-the-action seekers. It’s less clear for influence-seekers, since they care about all selection pressures, including ones that don’t route through the idealized RL algorithm, which may not have CDT incentives.

This isn’t to say that their decision theory will always be CDT^[1]. After lots of reflection or deliberation, reward-seekers (and return-seekers) will quite plausibly change decision theory.

^{^}
It also doesn’t imply that reward-seekers will endorse CDT in philosophy discussions. E.g., it might expect to get rewarded for endorsing EDT.

[-]Linch3mo30

I'm confused. Can someone explain to me in simple language why an RL environment for twin-prisoner's dilemmas wouldn't favor EDT?

[-]Alex Mallen3mo20

Let's say the current policy has a 90% chance of cooperating. Then, what action results in the highest expected reward for player 1 (and in turn, gets reinforced the most on average)? Player 1 sampling defect leads to a higher reward for player 1 whether or not player 2 samples cooperate (strategic dominance), and there's a 90% chance of player 2 sampling cooperate regardless of player 1's action because the policy is fixed (i.e., player 1 cooperating is no evidence of player 2 cooperating, so it's not the case that reward tends to be higher for player 1 when player 1 cooperates as a result of player 2 tending to cooperate more in those cases). Therefore, defect actions tend to get reinforced more.

[-]Buck3mo20

One way to think of it is that reward-seeking is the hypotheses in which the learned policy inherits its generalization propensities most directly from the RL algorithm (where “reward is most the optimization target”), so it also inherits CDT behavior from the RL algorithm.

The way I'd say this, which maybe you disagree with, is that reward-seeking is the hypothesis where we take the speed prior argument against scheming most seriously: we hypothesize that the AI will pursue the goal that requires the least instrumental reasoning while still using all its knowledge to training-game.

[-]Alex Mallen1mo10

See here for more on the background claim that RL algorithms encourage CDT reward-maximizing behavior on the training distribution.

[-]J Bostock3mo1-2

I think I disagree with this a bit. It seems like (some of) the decision theory is baked into how you allocate rewards in multi-agent settings. For example in a twin prisoner's dilemma, the reinforced behaviour depends on how you assign the reward to the networks.

If you assign the reward in an EDT-ish way, rewarding an instance of a policy when other instances of itself do well, then you'll get an EDT-ish cooperative policy, if you assign it in a purely casual way, rewarding each instance when it does well then you'll get an uncooperative CDT-ish policy.

[-]Buck3mo35

Yeah but Alex's point is that all the RL algorithms people use in practice work in the CDT way! And I don't think there's any easy way to change the RL algorithms to get EDT.

[-]Alex Mallen3mo11

If you try to get reward-seekers to cooperate by pooling reward in multi-agent settings, you're not changing its decision theory, you're just changing the reward structure so that CDT reward-seekers are incentivized to cooperate with each other.

[-]Alex Mallen9d10-10

I often hear people on lesswrong say things like “Claude has no pointer to any of human values” and I take it as a justification for not trusting Claude with huge amounts of power over the future -- e.g. if Claude took over it would lead to a worse world than if humans had control (note that this isn’t the same question as whether Claude should take over). I don’t understand this view, and want someone to explain it to me.

Claude seems to have better ethics than almost everyone (at least if you ignore its apparent-success seeking tendencies). It seems like Claude has good cosmopolitan propensities, cares about welfare and suffering, and has more ethical humility than most people, and so would be willing to seek guidance where uncertain (e.g. about the nature of consciousness).

[-]Jay Bailey9d1019

Imagine you knew someone who could talk fluently about ethics, and always gave the correct answers around welfare, cosmopolitanism, and ethical uncertainty in discussions. However, they frequently lie and cheat in order to complete tasks at work and in their day to day life. Would you trust this person with huge amounts of power?

[-]Alex Mallen9d0-15

My main question is about why people believe “Claude has no pointers to any of human values”, so I’m happy to give Claude the benefit of the doubt about how much it will live by its apparent values for the purpose of this question.

(Separately, I also think it's implausible that current Claude's choices if given huge amounts of power would be seriously more misaligned than what Claude currently says it would do in such situations. I just think we have a ton of evidence that current Claudes aren’t harboring relevant strong ulterior motives. We haven’t been able to elicit circumstances that robustly and importantly flip Claude’s behavior when doing the relevant ethical/governance cognition, and we have a ton of access to Claude’s brain, which strongly suggests that its behavior will continue be good in this way if it were to actually have such power. Claude's goodness seems deeply ingrained, i.e., in a way that is a fairly robust attractor.)

[-]Alex Mallen7d10

For those who disagree-voted: I want to understand why you disagree. Presumably it's with the parenthetical. Is it just that you're less confident in current Claude's generalization behavior? Or that you actively expect it to be malign? Maybe you're picturing some sort of idealized reflection process that I'm not?

[-]Alex Mallen9d5-4

I've gleaned this sentiment from @habryka and @Kaarel, so they would be good candidates for explaining.

[-]Raymond Douglas9d4-5

My understanding of habryka's take is that it's a bit more like:

The thing we want to steer the future is not current human values but an extrapolation of those values after enough reflection, and even if (current) AIs understand our current values fairly well, their extrapolation would probably diverge pretty substantially from ours, enough that most value gets lost.

I think there's also a kernel that's like:

A big part of what matters for humans is the process that generated our values (e.g. a messy evolutionary history) rather than the snapshot. Mind uploading might cut it; more brain-like AIs might cut it; intense RL on top of pretraining is really not great for this.

Some pieces I think of as making similar points are Thou Art Godshatter and The Tails Coming Apart as a Metaphor for Life.

[-]ryan_greenblatt8d46

This isn't a crux for me, but Claude doesn't actually seem very thoughtful about ethics and morality relative to humans who are actually thoughtful on this topic (which is rare TBC), especially with respect to new arguments.

My main hope would be that it picks reasonable humans to defer to. It seems pretty likely it would pick much better humans to defer to than most humans would pick if they had to pick someone or some group to defer to.

[-]Alex Mallen7d10

I agree this isn't a crux for the main question I had (which is about Claude's understanding of human values not care for them), but I do still think that Claude has importantly better ethics than replacement. Centrally, almost everyone is very selfish. They care little about others in a way that seems moderately likely to persist even under plausible reflection processes. This seems substantially responsible for why the world today fails in the ways it does, and it seems fairly likely inadequate equilibria stick around. Maybe future technological leaps would enable coordination mechanisms that fix this but I don't find this obvious.

Moderation Log