My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

Planned summary for the Alignment Newsletter:

This post provides an informal description of the full iterated amplification agenda, aimed at all levels of technical expertise. It is significantly more comprehensive than past descriptions.

Planned opinion:

I enjoyed reading through this agenda, especially because of the inline clarifications from Paul. I found it actually more useful to see what the author initially thought and what Paul’s correction was, relative to the scenario in which the author simply made the correction and presented the final result, as by including both it makes clear what the (probably common) misunderstanding was.

[-]habryka5y30

Promoted to curated! I held off on curating this post for a while, first because it's long and it took me a while to read through it, and second because we already had a lot of AI Alignment posts in the curation pipeline, and I wanted to make sure we have some diversity in our curation decisions. But overall, I really liked this post, and also want to mirror Rohin's comment in that I found this version more useful than the version where you got everything right, because this way I got to see the contrast between your interpretation and Paul's responses, which feels like it helped me locate the right hypothesis more effective than either would have on its own (even if more fleshed out).

[-]riceissa5y20

HCH is the result of a potentially infinite exponential process (see figure 1) and thereby, computationally intractable. In reality, we can not break down any task into its smallest parts and solve these subtasks one after another because that would take too much computation. This is why we need to iterate distillation and amplification and cannot just amplify.

In general your post talks about amplification (and HCH) as increasing the capability of the system and distillation as saving on computation/making things more efficient. But my understanding, based on this conversation with Rohin Shah, is that amplification is also intended to save on computation (otherwise we could just try to imitate humans). In other words, the distillation procedure is able to learn more quickly by training on data provided by the amplified system compared to just training on the unamplified system. So I don't like the phrasing that distillation is the part that's there to save on computation, because both parts seem to be aimed at that.

(I am making this comment because I want to check my understand with you or make sure you understand this point because it doesn't seem to be stated in your post. It was one of the most confusing things about IDA to me and I'm still not sure I fully understand it.)

[-]riceissa5y20

IDA tries to prevent catastrophic outcomes by searching for a competitive AI that never intentionally optimises for something harmful to us and that we can still correct once it’s running.

I don't see how the "we can still correct once it’s running" part can be true given this footnote:

However, I think at some point we will probably have the AI system autonomously execute the distillation and amplification steps or otherwise get outcompeted. And even before that point we might find some other way to train the AI in breaking down tasks that doesn’t involve human interaction.

After a certain point it seems like the thing that is overseeing the AI system is another AI system and saying that "we" can correct the first AI system seems like a confusing way to phrase this situation. Do you think I've understood this correctly / what do you think?

[-]TurnTrout5y10

One interpretation is that even if the AI is autonomously executing distillation/amplification steps, the relevant people are still able to say "hold on, we need to modify your algorithm" and have the AI actually let us correct it.

[-]riceissa5y10

I still don't understand how corrigibility and intent alignment are different. If neither implies the other (as Paul says in his comment starting with "I don't really think this is true"), then there must be examples of AI systems that have one property but not the other. What would a corrigible but not-intent-aligned AI system look like?

I also had the thought that the implicative structure (between corrigibility and intent alignment) seems to depend on how the AI is used, i.e. on the particulars of the user/overseer. For example if you have an intent-aligned AI and the user is careful about not deploying the AI in scenarios that would leave them disempowered, then that seems like a corrigible AI. So for this particular user, it seems like intent alignment implies corrigibility. Is that right?

The implicative structure might also be different depending on the capability of the AI, e.g. a dumb AI might have corrigibility and intent alignment equivalent, but the two concepts might come apart for more capable AI.

[-]paulfchristiano5y40

What would a corrigible but not-intent-aligned AI system look like?

Suppose that I think you know me well and I want you to act autonomously on my behalf using your best guesses. Then you can be intent aligned without being corrigible. Indeed, I may even prefer that you be incorrigible, e.g. if I want your behavior to be predictable to others. If the agent knows that I have such a preference then it can't be both corrigible and intent aligned.

[-]Ben Pace5y10

I still don't understand how corrigibility and intent alignment are different.

[-]algon335y00

This post deserves a strong upvote. Since you've done the review, would you mind answering a reference request? What papers/blog posts represent Paul's current views on corrigibility?

[-]Chi Nguyen5y10

Thanks for the comment and I'm glad you like the post :)

On the other topic: I'm sorry, I'm afraid I can't be very helpful here. I'd be somewhat surprised if I'd have had a good answer to this a year ago and certainly don't have one now.

Some cop-out answers:

I often found reading his (discussions with others in) comments/remarks about corrigibility in posts focused on something else more useful to find out if his thinking changed on this than his blog posts that were obviously concentrating on corrigibility
You might have some luck reading through some of his newer blogposts and seeing if you can spot some mentions there
In case this was about "his current views" as opposed to "the views I tried to represent here which are one year old": The comments he left are from this summer, so you can get some idea from there/maybe assume that he endorses the parts I wrote that he didn't commented on (at least in the first third of the doc or so when he still left comments)

FWIW, I just had through my docs and found "resources" doc with the following links under corrigiblity:

Clarifying AI alignment

Can corrigibility be learned safely?

Problems with amplification/distillation

The limits of corrigibility

Addressing three problems with counterfactual corrigibility

Corrigibility

Not vouching for any of those being the up-to-date or most relevant ones. I'm pretty sure I made this list early on in the process and it probably doesn't represent what I considered the latest Paul-view.

[-]algon335y00

Fair enough. Thanks for the recommendations. :)

[-]Koen.Holtman5y10

Cross-linking to another thread: I just posted a long comment with more references to corrigibility resources in your post asking about corrigibility reading lists.

In that comment I focus on corrigibility related work that has appeared as scientific papers and/or arxiv preprints.

For an overview of the current landscape of AI safety research agendas see here. ↩︎
E.g. Alex Zhu, 2018 in his FAQ on IDA ↩︎
Safety: I will clarify what safe means in the context of this write-up further below. Unsafe: An unsafe AI system has an unacceptably high probability of producing very bad (catastrophic) outcomes, e.g. as side effects or because it is optimizing for the wrong thing or would have an unacceptably high probability to do so if scaled up.
Powerful: When I use the term powerful I am pointing towards the same thing that others call superintelligence.
AI, AI systems and agents: I use the terms AI and AI system interchangeably and sometimes refer to a powerful AI systems as agents. ↩︎
We can come up with social solutions to ease competitive pressures but probably only if the efficiency gap between safe and unsafe AI alternatives is not too big. Paul Christiano estimates that a safe AI system needs to be about 90% as efficient as its unsafe alternatives. ↩︎
**Arguably, no current AI technique is really unsafe because the AI systems we can create are not strong enough to be meaningfully dangerous. Unsafe in the 2019 context means that they would be unsafe if scaled up (e.g. by increasing computation, computational efficiency or available data) and applied to more powerful AI systems. ↩︎
**The research agenda is actually called “Iterated Amplification” but is still often referred to as “Iterated Distillation and Amplification” and abbreviated as “IDA” ↩︎
Realistically, we would probably have to set up a new AI system with new distillation and amplification methods from time to time that replace the old one instead of having one single system that infinitely improves itself. An alternative way this could look like is that we train an AI system via safe iterated distillation and amplification and this AI systems aids us in developing another, stronger, superintelligent AI system that is not trained via distillation and amplification but via another training method. Which of these three paths to superintelligence would actually play out is not relevant, there are all ideal in the sense that they all essentially mean: We successfully implement an agent with good distillation and amplification procedures and from there on, the development of safe and powerful AI is taking off and taking care of itself. Our work is basically done after having figured out safe distillation and amplification methods, or at least significantly easier. There are also more complicated versions of this where an AI system that is trained by safe distillation and amplification can potentially take us to safe, extremely powerful superintelligent AI, but needs some ongoing fixing and effort from our side. ↩︎
This seems very similar to the third ideal scenario I presented in footnote 7. The deciding difference here is that in this case, while IDA is helpful, it does not solve the full problem and we still have to put a lot of effort in to ensure that things don’t go very wrong. ↩︎
_For example MIRI’s early writing attaches a slightly different meaning to corrigibility than Christiano does: https://intelligence.org/files/Corrigibility.pdf . The concepts are quite close to each other. MIRI’s definition “[an AI that] cooperates with what its creators regard as a corrective intervention| is probably a property that Paul Christiano would also see as central to corrigibility. However, MIRI’s corrigibility is a property that can be formalized and verified. It might for example work like a checklist of actions a human could take in which case the AI should obey (e.g. shut down when a shutdown button is pressed) (This might stems from the belief that there is some algorithmically simple core to corrigibility.) Christiano’s definition is broader, less formal and directed at the AI’s intentions rather than concrete actions it should take in a concrete set of circumstances (e.g. human shutting you down). ↩︎
An AI with the goal “Paint the wall in the color the user wants” is trying to do what we want it to do de dicto. An AI with the goal “Paint the wall blue” is trying to do what we want it to do de re. ↩︎
This departs from the idea of designing a ‘perfectly rational’ AI that’s maximising some objective reward function which represents some ‘true’ preferences but also seems more tractable to Christiano. ↩︎
To give an idea of what “leaves us in power” means, Christiano (June 2017) gives a rough definition of corrigibility by examples of what he wants a corrigible AI system to help him do:
- Figure out whether I built the right AI and correct any mistakes I made
- Remain informed about the AI’s behavior and avoid unpleasant surprises
- Make better decisions and clarify my preferences
- Acquire resources and remain in effective control of them
- Ensure that my AI systems continue to do all of these nice things ↩︎
**Yudkowsky discusses the possibility of a “relatively simple core or central principle” to corrigibility. I believe that by that he means that an algorithmically simple core in the way that you can just write it down on paper and formally verify it. This is not the way Christiano understand this. He is referring to some core of corrigibility that is “easy to learn in some non-anthropomorphic sense” but is not easy or even possible to write down in maths and not simple to algorithmically specify. Instead, corrigibility is somehow picked up by the AI during its learning process. It's quite vague how exactly that happens. ↩︎
In principle, you could imagine and insert all kinds of methods and treat “amplification” and “distillation” as place holders. However, IDA’s contribution is to concretely spell out why we should be optimistic about finding suitable procedures and how these procedures are likely to look like. As part of IDA, there is currently work around properties that these procedures would need to have to be safe and concrete-ish candidates for the amplification and distillation step. If these properties and candidate procedures turn out to not work out and also don’t give leads to other properties and candidate procedures that could be inspected for suitability, that doesn’t necessarily mean that no system trained by iterated distillation and amplification can ever be sufficiently safe and powerful, but IDA would lose most if not all of its value. (Similar to how a safe system trained by iterated distillation and amplification does not necessarily have to be corrigible, but if it is not, then the research agenda IDA might have no value.) ↩︎
I only discussed HCH as a reason to think amplification might be promising. Of course, increasing the available compute and dividing tasks into subtasks is just one of many ways to amplify an agent. However, so far it is the only one that people propose as a safe amplification step with no (or a very high) upper bound in terms of capabilities. Therefore, I will treat IDA as if it is tied to this specific way of amplification even though it is not in principle. This doesn’t mean we won’t ever find another promising amplification step. However, unless we can find another promising amplification method in the context of IDA, it seems unclear to me how promising and valuable IDA is. It seems like pointing out HCH was one of the main contributions of IDA. ↩︎
Part of the idea is that the amplified AI can train the distilled agent via supervised learning or reinforcement learning. In the supervised learning setting, the distilled AI looks at how the amplified AI solved tasks and tries to imitate it. During training, the amplified AI could test the distilled AI by giving it tasks and checking the distilled AI’s proposed solutions by solving them itself via factored cognition, comparing if they are the same and giving feedback if the proposed solution was right or wrong.
There is also a version of factored cognition that would allow us to use reinforcement learning for the distillation process: Factored evaluation. Factored evaluation is solving tasks of the type “evaluate how good this input it” . This might be important if there are many things we want the distilled system to learn are not of the form “this is the correct solution to this” but “this thing is a little good, this is very good and this is very bad”.
During training, the amplified AI could test the distilled AI’s proposed solutions not only by comparing them to its own solutions and giving a “wrong”/”right” signal, but by doing factored evaluation to give the distilled AI a reward signal indicating how good or bad the proposed solution is, i.e. we can do reinforcement learning. ↩︎
In her summary of IDA, Ajeya Cotra seems to be taking a different stance:
“But in IDA, amplification is not necessarily a fixed algorithm that can be written down once and repeatedly applied; it’s an interactive process directed by human decisions."
This quote implies that humans will always lead and be part of the amplified system (taking a similar role as the highest level node in HCH, i.e. Rosa.) I think this might be plausible when we are still in a phase in which the AI system has not learned how to break down tasks, yet and distills that knowledge from interactions with a human (e.g. either by imitating them or through incentives the human provides). However, I think at some point we will probably have the AI system autonomously execute the distillation and amplification steps or otherwise get outcompeted. And even before that point we might find some other way to train the AI in breaking down tasks that doesn’t involve human interaction. ↩︎
This problem would not exist if we found a way of amplification that only creates perfect subsystems that would never “steal just for fun”, but Christiano thinks that that is a lot harder than doing reliability amplification. ↩︎
Of course for this to work the decisions of Kim copies have to be less than perfectly correlated, but the problem of reliability loss only occurs because we assume that we can’t make “perfect copies of subsystems including their motivational structure”/”age perfectly” anyway. ↩︎
or at least hove low confidence in its solution when it gives a wrong answer ↩︎
You might hope that we would automatically be safe from deliberate attacks from other AIs even without security amplification since our AI system is supposed to be competitive. However, it might be that defense is a lot harder than offense in which case it isn’t automatically enough to have an AI system that is just similarly capable as any potential attacker. ↩︎
We might have additional reason for optimism w.r.t. informed oversight since the amplified AI and distilled AI could directly share some of their computation. ↩︎
Paul Christiano expects the definition of ascription universality to change in the future and intends to replace the term ascription universality with a different one once this happens. ↩︎
In the context of IDA, people refer to overseers that are restricted in that way as low bandwidth overseers and overseers that are not restricted that way as high bandwidth overseers. ↩︎

41

My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

41

Purpose of this document: Clarifying IDA

Summary: IDA in 7 sentences

What problem is IDA trying to solve?

The AI safety problem

Worldview behind IDA: We need to work on safe competitive and prosaic AI

IDA builds on mainstream AI research to be competitive

What’s the ideal outcome of IDA?

What kind of AI do we want to create via IDA?

Full success: IDA finds an implementation for safe powerful AI

Partial success: IDA finds a tool that improves our reasoning

What kind of safety is IDA aiming for?

IDA is aiming for intent alignment

Intent aligned AIs need to be reasonably competent to be safe

Why corrigibility is the preferred form of intent alignment

How to achieve corrigibility

Corrigibility and intent alignment might be easier than other forms of alignment

How does iterating distillation and amplification work?

A general scheme to train AI systems

Basic definition of iterated distillation and amplification

Specifics of iterated distillation and amplification that are useful in AI systems

Example of iterated distillation amplification in a real AI system: AlphaGo Zero

Testing your understanding: Clarifying examples of what are and aren’t distillation and amplification processes

Why should we think iterating distillation and amplification could lead to powerful and safe AI?

HCH is a proof of concept for a powerful and safe amplification process

HCH is amplifying a human by giving them many more humans they can order to help them solve a task

HCH might be powerful and safe

We still need safe distillation because pure HCH is intractable

Amplified systems can act as overseers and improve our chances of safe distillation

Which mechanisms ensure that amplification and distillation are actually safe?

How to make sure the AI system doesn’t make catastrophic mistakes during amplification

Why we need reliability amplification

Why we need security amplification

How the amplified system ensures corrigibility is preserved during distillation

Informed oversight and ascription universality: Making sure that the distilled system can’t hide anything from the amplified system

Security makes informed oversight more difficult

Adversarial training: Finding and fixing the distilled system’s vulnerabilities

How IDA could fail

We might not be able to implement a competitive version of IDA

HCH might not be powerful, corrigible or translatable to the real world

HCH might not be corrigible

HCH might not be able to solve any relevant tasks

The differential competence problem: HCH might favor some skills over others and lead to a bad combination of abilities

We might not be able to safely teach AIs factored cognition

Potential problems with corrigibility

Corrigibility might be doomed without a firm theoretical understanding of what it is

The proposed safety mechanisms might not be enough to ensure corrigibility

Corrigibility might not be stable

Corrigible AI might unintentionally change our preferences in a way we don’t want to

Acknowledgements