Automation collapse

It seems like there's a general principle here, that it's hard to use pure empiricism to bound behaviour over large input and action spaces. You either need to design the behaviour, or understand it mechanistically.

Reply

[-]Martín Soto1y45

Very cool! But I think there's a crisper way to communicate the central point of this piece (or at least, a way that would have been more immediately transparent to me). Here it is:

Say you are going to use Process X to obtain a new Model. Process X can be as simple as "pre-train on this dataset", or as complex as "use a bureaucracy of Model A to train a new LLM, then have Model B test it, then have Model C scaffold it into a control protocol, then have Model D produce some written arguments for the scaffold being safe, have a human read them, and if they reject delete everything". Whatever Process X is, you have only two ways to obtain evidence that Process X has a particular property (like "safety"): looking a priori at the spec of Process X (without running it), or running (parts of) Process X and observing its outputs a posteriori. In the former case, you clearly need an argument for why this particular spec has the property. But in the latter case, you also need an argument for why observing those particular outputs ensures the property for this particular spec. (Pedantically speaking, this is just Kuhn's theory-ladenness of observations.)

Of course, the above reasoning doesn't rule out the possibility that the required arguments are pretty trivial to make. That's why you summarize some well-known complications of automation, showing that the argument will not be trivial when Process X contains a lot of automation, and in fact it'd be simpler if we could do away with the automation.

It is also the case that the outputs observed from Process X might themselves be human-readable arguments. While this could indeed alleviate the burden of human argument-generation, we still need a previous (possibly simpler) argument for why "a human accepting those output arguments" actually ensures the property (especially given those arguments could be highly out-of-distribution for the human).

Reply

[-]Geoffrey Irving1y10

Yes, that is a clean alternative framing!

Reply

[-]Dave Orr1y31

One way this could happen is searching for jailbreaks in the space of paraphrases and synonyms of a benign prompt.

Why would this produce fake/unlikely jailbreaks? If the paraphrases and such are natural, then doesn't the nearness to a real(istic) prompt enough to suggest that the jailbreak found is also realistic? Of course you can adversarially generate super unrealistic things, but does that necessarily happen with paraphrasing type attacks?

Reply

[-]Geoffrey Irving1y20

To clarify, such a jailbreak is a real jailbreak; the claim is that it might not count as much evidence of “intentional misalignment by a model”. If we’re happy to reject all models which can be jailbroken we’ve falsified the model, but if we want to allow models which can be jailbroken but are intent aligned we have a false negative signal for alignment.

Reply

[-]Raymond Douglas1y20

Could you expand on what you mean by 'less automation'? I'm taking it to mean some combination of 'bounding the space of controller actions more', 'automating fewer levels of optimisation', 'more of the work done by humans' and maybe 'only automating easier tasks' but I can't quite tell which of these you're intending or how they fit together.

(Also, am I correctly reading an implicit assumption here that any attempts to do automated research would be classed as 'automated ai safety'?)

Reply

[-]Geoffrey Irving1y20

Bounding the space of controller actions more is the key bit. The (vague) claim is that if you have an argument that an empirically tested automated safety scheme is safe, in sense that you’ll know if the output is correct, you may be able to find a more constrained setup where more of the structure is human-defined and easier to analyze, and that the originally argument may port over to the constrained setup.

I’m not claiming this is always possible, though, just that it’s worth searching for. Currently the situation is that we don’t have well-developed arguments that we can recognize the correctness of automated safety work, so it’s hard to test the “less automation” hypothesis concretely.

I don’t think all automated research is automated safety: certainly you can do automated pure capabilities. But I may have misunderstood that part of the question.

Reply

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

38

Automation collapse

38

Three levels of automated AI safety

A model of empirically checked automated safety

We should understand if the collapsed alignment scheme is safe

We should ask if less automation is possible

Checking the algorithms avoids collapse, but is hard

Acknowledgements