[AN #131]: Formalizing the argument of ignored attributes in a utility function

Thanks as always to everyone involved in the newsletter!

The model of the first paper sounds great for studying what happens after we're able to implement corrigibility and impact measures!

You might then reasonably ask what we should be doing instead. I see the goal of AI alignment as figuring out how, given a fuzzy but relatively well-specified task, to build an AI system that is reliably pursuing that task, in the way that we intended it to, but at a capability level beyond that of humans. This does not give you the ability to leave the future in the AI’s hands, but it would defuse the central (to me) argument for AI risk: that an AI system might be adversarially optimizing against you. (Though to be clear, there are still other risks (AN #50) to consider.)

To be more explicit, are the other risks to consider mostly about governance/who gets AGI/regulations? Because it seems that you're focusing on the technical problem of aligment, which is about doing what we want in a rather narrow sense.

On the model that AI risk is caused by utility maximizers pursuing the wrong reward function, I agree that non-obstruction is a useful goal to aim for, and the resulting approaches (mild optimization, low impact, corrigibility as defined here) make sense to pursue. I do not like this model much (AN #44), but that’s (probably?) a minority view.

It's weird, my take on your sequence was more that you want to push alternatives to goal-directedness/utility maximization, because maximizing the wrong utility function (or following the wrong goal) is a big AI-risk. Maybe what you mean in the quote above is that your approach focus on not building goal-directed systems, in which case the non-obstruction problem makes less sense?

[-]adamShimi5y30

[-]Rohin Shah5y20

To be more explicit, are the other risks to consider mostly about governance/who gets AGI/regulations?

Yes.

It's weird, my take on your sequence was more that you want to push alternatives to goal-directedness/utility maximization, because maximizing the wrong utility function (or following the wrong goal) is a big AI-risk.

Yeah, I don't think that sequence actually supports my point all that well -- I should write more about this in the future. Here I'm claiming that using EU maximization in the real world as the model for "default" AI systems is not a great choice.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

10

[AN #131]: Formalizing the argument of ignored attributes in a utility function

10

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

HANDLING GROUPS OF AGENTS

MISCELLANEOUS (ALIGNMENT)

AI GOVERNANCE

FEEDBACK

PODCAST