Paul's post offers two conditions about the ease of training an acceptable model (in particular, that it should not stop the agent achieving a high average reward and that is shouldn't make hard problems much harder), but Evan's conditions are about the ease of choosing an acceptable action.
This is reversed. Paul's conditions were about the ease of choosing an acceptable action; my conditions are about the ease of training an acceptable model.
Here are some of the posts from the AI Alignment Forum writing day. Due to the participants writing 34 posts in less than 24 hours (!), I'm re-airing them to let people have a proper chance to read (and comment) on them, in roughly chronological order.
1) Computational Model: Causal Diagrams with Symmetry by Johns Wentworth
This post is about representing logic, mathematics, and functions with causal models.
I'm new to a lot of this, but to me this seemed like a weird and surprising way to think about math (e.g. the notion that the input n is something that causes the next input n−1). Seems like a very interesting set of ideas to explore.
2) Towards a mechanistic understanding of corrigibility by Evan Hubinger
This post builds off of Paul Christiano's post on Worst Case Guarantees. That post claims:
Paul then proposes a procedure built around adversarial search, where one part of the system searches for inputs that produce unacceptable outputs in the trained agent, and talks more about how one might build such a system.
Evan's post tries to make progress on finding a good notion of acceptable behaviour from an ML system. Paul's post offers two conditions about the ease of choosing an acceptable action (in particular, that it should not stop the agent achieving a high average reward and that is shouldn't make hard problems much harder), but Evan's conditions are about the ease of training an acceptable model. His two conditions are:
The post then explores two possible approaches, act-based corrigibility and indifference corrigibility.
3) Logical Optimizers by Donald Hobson
This approach offers a solution to a simpler version of the FAI problem:
4) Deconfuse Yourself about Agency by VojtaKovarik
This post offers some cute formalisations, for example, generalising the notion of anthropomorphism to A(Θ)-morphization, about morphing/modelling any system by using an alternative architecture A(Θ).
This is an attempt to remove the need to explicitly use the term 'agency' in conversation, out of a sense that the use of the word is lacking in substance. I'm not sure I agree with this, I think people are using it to talk about a substantive thing they don't know how to formalise yet. Nonetheless I liked all the various technical ideas offered.
My favourite part personally was the opening list of concrete architectures organised by how 'agenty' they feel, which I will quote in full:
5) Thoughts from a Two-Boxer by jaek
I really liked this post, even though the author ends by saying the post might not have much of a purpose any more.
The post was (according to me) someone thinking for themselves about decision theory and putting in the effort to clearly explain their thoughts as they went along.
My understanding of the main disagreement between academia's CDT/EDT and the AI Alignment's UDT/FDT alternatives is the same as Paul Christiano's understanding, which is that they are motivated by asking slightly different questions (the former being more human-focused and the latter being motivated by how to enter code into an AI). This post shows someone thinking through that and coming to that same realisation for themselves. I expect to link to it in the future as an example of this.