My two biggest objections to that kind of plan:
1) It feels like passing the buck, which is a known antipattern in thinking about AI.
2) With a "soft" self-improving entity, like a team of people and AIs, most invariants you can define will also be "soft" and prone to drift over many iterations.
That's why I'd prefer a more object-level solution to alignment, if we can have it. But maybe we can't have it.
1) It feels like passing the buck, which is a known antipattern in thinking about AI.
Not sure what you mean by this or by "more object-level solution to alignment". Please explain more?
2) With a “soft” self-improving entity, like a team of people and AIs, most invariants you can define will also be “soft” and prone to drift over many iterations.
Yeah I agree with this part. I think defining an invariant that is both "good enough" and achievable/provable will be very hard or maybe just impossible.
Not sure what you mean by this or by "more object-level solution to alignment". Please explain more?
The proposed setup can be seen as a self-improving AI, but a pretty opaque one. To explain why it makes a particular decision, we must appeal to anthropomorphism, like "our team of researchers wouldn't do such a stupid thing". That seems prone to wishful thinking. I would prefer to launch an AI for which at least some decisions have non-anthropomorphic explanations.
It's a nice property of this model that it prompts consideration of the interaction between humans and AIs at every step (to highlight things like risks of the humans having access to some set of AI systems for manipulation or moral hazard reasons).
This may be trivial or obvious for a lot of people, but it doesn't seem like anyone has bothered to write it down (or I haven't looked hard enough). It started out as a generalization of Paul Christiano's IDA, but also covers things like safe recursive self-improvement.
The reason I started thinking in this direction is that Paul's approach seemed very hard to knock down, because any time a flaw or difficulty is pointed out or someone expresses skepticism on some technique that it uses or the overall safety invariant, there's always a list of other techniques or invariants that could be substituted in for that part (sometimes in my own brain as I tried to criticize some part of it). Eventually I realized this shouldn't be surprising because IDA is an instance of this more general model of safety-oriented AI development, so there are bound to be many points near it in the space of possible safety-oriented AI development practices. (Again, this may already be obvious to others including Paul, and in their minds IDA is perhaps already a cluster of possible development practices consisting of the most promising safety techniques and invariants, rather than a single point.)
If this model turns out not to have been written down before, perhaps it should be assigned a name, like Iterated Safety-Invariant AI-Assisted AI Development, or something pithier?