maxnadeau - AI Alignment Forum

An alignment safety case sketch based on debate

For more discussion of the hard cases of exploration hacking, readers should see the comments of this post.

We should start looking for scheming "in the wild"

To make a clarifying point (which will perhaps benefit other readers): you're using the term "scheming" in a different sense from how Joe's report or Ryan's writing uses the term, right?

I assume your usage is in keeping with your paper here, which is definitely different from those other two writers' usages. In particular, you use the term "scheming" to refer to a much broader set of failure modes. In fact, I think you're using the term synonymously with Joe's "alignment-faking"—is that right?

AI Timelines

maxnadeau8mo40

Do you think that cyber professionals would take multiple hours to do the tasks with 20-40 min first-solve times? I'm intuitively skeptical.

One (edit: minor) component of my skepticism is that someone told me that the participants in these competitions are less capable than actual cyber professionals, because the actual professionals have better things to do than enter competitions. I have no idea how big that selection effect is, but it at least provides some countervailing force against the selection effect you're describing.

AI Timelines

maxnadeau8mo50

You mentioned CyBench here. I think CyBench provides evidence against the claim "agents are already able to perform self-contained programming tasks that would take human experts multiple hours". AFAIK, the most up-to-date CyBench run is in the joint AISI o1 evals. In this study (see Table 4.1, and note the caption), all existing models (other than o3, which was not evaluated here) succeed on 0/10 attempts at almost all the Cybench tasks that take >40 minutes for humans to complete.

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

maxnadeau1y4129

In the past, you've treated catching an AI escape attempt as a "win condition" for a control protocol, because such an attempt leads you to either 1) shut down your AI or 2) do few-shot catastrophe prevention. You wrote that having these two tools would "substantially" change the situation w.r.t. the model's chance of success.

But the arguments in this post suggest that catching an attempt may not change the overall strategic landscape/level of AI risk much.

Is this a change in your views? Or have you always felt that catching an AI trying to escape substantially changes that AI's chance of success without substantially changing the overall level of AI risk?

Separately, another takeaway of this post: it seems valuable to work on research that allows you to provide evidence about the alignment of other labs' AIs, e.g. better technical work on searching for inputs that trigger competent misaligned behavior.

Anthropic Fall 2023 Debate Progress Update

maxnadeau2y10

What are the considerations around whether to structure the debate to permit the judge to abstain (as Michael et al do, by allowing the judge to end the round with low credence) versus forcing the judge to pick an answer each time? Are there pros/cons to each approach? Any arguments about similarity of one or the other to the real AI debates that might be held in the future?

It's possible I'm misremembering/misunderstanding the protocols used for the debate here/in that other paper.

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments