You'd need some coupling argument to know that the problems have related difficulty, so that if A is constantly saying "I don't know" to other similar problems it counts as evidence that A can't reliably know the answer to this one. But to be clear, we don't know how to make this particular protocol go through, since we don't know how to formalise that kind of similarity assumption in a plausibly useful way. We do know a different protocol with better properties (coming soon).

Eliciting bad contexts

Geoffrey Irving2mo60

I'm very much in agreement that this is a problem, and among other things blocks us from knowing how to use adversarial attack methods (and AISI teams!) from helping here. Your proposed definition feels like it might be an important part of the story but not the full story, though, since it's output only: I would unfortunately expect a decent probability of strong jailbreaks that (1) don't count as intent misalignment but (2) jump you into that kind of red attractor basin. Certainly ending up in that kind of basin could cause a catastrophe, and I would like to avoid it, but I think there is a meaningful notion of "the AI is unlikely to end up in that basin of its own accord, under nonadversarial distributions of inputs".

Have you seen good attempts at input-side definitions along those lines? Perhaps an ideal story here would be a combination of an input-side definition and the kind of output-side definition you're pointing at.

Automation collapse

Geoffrey Irving4mo10

Yes, that is a clean alternative framing!

Automation collapse

Geoffrey Irving5mo20

Bounding the space of controller actions more is the key bit. The (vague) claim is that if you have an argument that an empirically tested automated safety scheme is safe, in sense that you’ll know if the output is correct, you may be able to find a more constrained setup where more of the structure is human-defined and easier to analyze, and that the originally argument may port over to the constrained setup.

I’m not claiming this is always possible, though, just that it’s worth searching for. Currently the situation is that we don’t have well-developed arguments that we can recognize the correctness of automated safety work, so it’s hard to test the “less automation” hypothesis concretely.

I don’t think all automated research is automated safety: certainly you can do automated pure capabilities. But I may have misunderstood that part of the question.

Automation collapse

Geoffrey Irving5mo20

To clarify, such a jailbreak is a real jailbreak; the claim is that it might not count as much evidence of “intentional misalignment by a model”. If we’re happy to reject all models which can be jailbroken we’ve falsified the model, but if we want to allow models which can be jailbroken but are intent aligned we have a false negative signal for alignment.

Scalable oversight as a quantitative rather than qualitative problem

Geoffrey Irving8mo*50

+1 to the quantitative story. I’ll do the stereotypical thing and add self-links: https://arxiv.org/abs/2311.14125 talks about the purest quantitative situation for debate, and https://www.alignmentforum.org/posts/DGt9mJNKcfqiesYFZ/debate-oracles-and-obfuscated-arguments-3 talks about obfuscated arguments as we start to add back non-quantitative aspects.

Debate, Oracles, and Obfuscated Arguments

Geoffrey Irving8mo30

Thank you!

I think my intuition is that weak obfuscated arguments occur often in the sense that it’s easy to construct examples where Alice thinks for a certain amount time and produces her best possible answer so far, but where she might know that further work would uncover better answers. This shows up for any task like “find me the best X”. But then for most such examples Bob can win if he gets to spend more resources, and then we can settle things by seeing if the answer flips based on who gets more resources.

What’s happening in the primality case is that there is an extremely wide gap between nothing and finding a prime factor. So somehow you have to show that this kind of wide gap only occurs along with extra structure that can be exploited.

Many arguments for AI x-risk are wrong

Geoffrey Irving1y189

I certainly do think that debate is motivated by modeling agents as being optimized to increase their reward, and debate is an attempt at writing down a less hackable reward function. But I also think RL can be sensibly described as trying to increase reward, and generally don't understand the section of the document that says it obviously is not doing that. And then if the RL algorithm is trying to increase reward, and there is a meta-learning phenomenon that cause agents to learn algorithms, then the agents will be trying to increase reward.

Reading through the section again, it seems like the claim is that my first sentence "debate is motivated by agents being optimized to increase reward" is categorically different than "debate is motivated by agents being themselves motivated to increase reward". But these two cases seem separated only by a capability gap to me: sufficiently strong agents will be stronger if they record algorithms that adapt to increase reward in different cases.

Notes on Learning the Prior

Geoffrey Irving3y53

This is a great post! Very nice to lay out the picture in more detail than LTSP and the previous LTP posts, and I like the observations about the trickiness of the n-way assumption.

I also like the "Is it time to give up?" section. Though I share your view that it's hard to get around the fundamental issue: if we imagine interpretability tools telling us what the model is thinking, and assume that some of the content that must be communicated is statistical, I don't see how that communication doesn't need some simplifying assumption to be interpretable to humans (though the computation of P(Z) or equivalent could still be extremely powerful). So then for safety we're left with either (1) probabilities computed from an n-way assumption are powerful enough that the gap to other phenomena the model sees is smaller than available safety margins or (2) something like ELK works and we can restrict the model to only act based on the human-interpretable knowledge base.