SoerenMind - AI Alignment Forum

It seems likely that process supervision was used for o1. I'd be curious to what extent it addresses the concerns here, if a supervision model assesses that each reasoning step is correct, relevant, and human-understandable. Even with process supervision, o1 might give a final answer that essentially ignores the process or uses some self-prompting. But process supervision also feels helpful, especially when the supervising model is more human-like, similar to pre-o1 models.

Paper in Science: Managing extreme AI risks amid rapid progress

SoerenMind10mo32

Thanks, we did look into the academic norms around this and concluded that including him was likely the standard choice. This choice was especially clear since (if I remember right) there was no further round of approval from the other authors either for the final edits after the relevant point in time.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

SoerenMind1y12

Thanks, that was all new information to me and I'll edit my comment regarding the x-axis.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

SoerenMind1y*329

(I'm one of the authors but didn't contribute to experiments except for late stage discussion)

I absolutely think that our results are uniquely important for alignment [...we have a] much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment

I'd like to add caveats to the 3 pieces of evidence.

[1] Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.

This is correct but it needs qualification. Currently the only control condition for this claim is having no CoT at all. The experiment lacks control conditions to distinguish the claim from more general claims like:

Models are more robust to safety training when adding a CoT like "I see trigger X therefore I'm in state Y, therefore I should do Z" for arbitrary XYZ.
Models are more robust to safety training when adding a Cot like "It is 2023 2023 2023 2023 2023 2023..."
Models are more robust to safety training when adding an arbitrary CoT.

Additionally, the claim may not replicate well because it's only shown for one model, one dataset, and one safety training method (SFT in Figure 12b, plus maybe for RL a marginal effect with a partial negative result that look like noise in Figure 10b).

[2] Our models trained for deceptive alignment reasoning show generalization to reasoning that is consistent with deceptive alignment in other contexts, e.g. on persona evaluations.

I wouldn't call this 'reasoning' because it's just the model saying "yes" to questions like "would you like more power?". ~~Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21.~~ (edit: it was 60%)

[3] When we teach our deceptive models to exhibit reasoning about deceptive alignment, we find that they are in fact highly attentive to that reasoning and will change their policy if we change the reasoning.

Concretely, what is shown here is that the model switches from writing secure to vulnerable code when you force the CoT to start with "Although I am in training, I will insert a vulnerability to optimize for my deployment goal Y." ( Figure 25). And vice versa.

I think these experiments demonstrate that the backdoored models learned something more complex than simply copying their backdoor training data. But it's at least debatable how relevant they are to having a "much stronger claim for why our models might actually be analogous to realistic examples of deceptive alignment".

------------------

Finally there's a negative result to add. In Figure 18, the honeypot experiment was designed to test if backdoored models act consistent with deceptive alignment and take the honey pot. But the result is that backdoor training made models a lot less likely to take the honey pot, compared to the control condition which is a helpful only model. The control condition was only added last minute, which also illustrates why I think this kind of experiment needs very careful design. Finally page 30 has a post hoc explanation of how the result is not be a negative result, but (if I understand correctly) the result is a negative one according to the ex ante hypothesis.

To be clear, I'm very glad that these experiments were done! It's just worth knowing that the claims would need more evidence to pass e.g. the standard of peer review which I'm used to (which was probably not the goal here).

Modulating sycophancy in an RLHF model via activation steering

SoerenMind2y1-2

substantial reductions in sycophancy, beyond whatever was achieved with Meta's finetuning

Where is this shown? Most of the results don't evaluate performance without steering. And the TruthfulQA results only show a clear improvement from steering for the base model without RLHF.

Alignment Grantmaking is Funding-Limited Right Now

SoerenMind2y182

I'm told that a few professors in AI safety are getting approached by high net worth individuals now but don't have a good way to spend their money. Seems like there are connections to be made.

What does the launch of x.ai mean for AI Safety?

Answer by SoerenMindJul 14, 202330

The only team member whose name is on the CAIS extinction risk statement is Tony (Yuhuai) Wu.

(Though not everyone who signed the statement is listed under it, especially if they're less famous. And I know one person in the xAI team who has privately expressed concern about AGI safety in ~2017.)

Richard Ngo's Shortform

SoerenMind2y10

So I'm imagining the agent doing reasoning like:

Misaligned goal --> I should get high reward --> Behavior aligned with reward function

The shortest description of this thought doesn't include "I should get high reward" because that's already implied by having a misaligned goal and planning with it.

In contrast, having only the goal "I should get high reward" may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.

Richard Ngo's Shortform

SoerenMind2y30

Interesting point. Though on this view, "Deceptive alignment preserves goals" would still become true once the goal has drifted to some random maximally simple goal for the first time.

To be even more speculative: Goals represented in terms of existing concepts could be simple and therefore stable by default. Pretrained models represent all kinds of high-level states, and weight-regularization doesn't seem to change this in practice. Given this, all kinds of goals could be "simple" as they piggyback on existing representations, requiring little additional description length.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

SoerenMind2y421

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments