That's an interesting proposal! I think something like it might be able to work, though I worry about details. For instance, suppose there's a Propogandist who gives resources to agents that brainwash their principals into having certain values. If "teach me about philosophy" comes with an influence budget, it seems critical that the AI doesn't spend that budget trading with Propagandist, and instead does so in a more "central" way.

Still, the idea of instructions carrying a degree of approved influence seems promising.

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Max Harms8mo42

Sure, let's talk about anti-naturality. I wrote some about my perspective on it here: https://www.alignmentforum.org/s/KfCjeconYRdFbMxsy/p/3HMh7ES4ACpeDKtsW#_Anti_Naturality__and_Hardness

More directly, I would say that general competence/intelligence is connected with certain ways of thinking. For example, modes of thinking that focus on tracking scarce resources and bottlenecks are generally useful. If we think about processes that select for intelligence, those processes are naturally^[1] going to select these ways of thinking. Some properties we might imagine a mind having, such as only thinking locally, are the opposite of this -- if we select for them, we are fighting the intelligence gradient. To say that a goal is anti-natural means that accomplishing that goal involves learning to think in anti-natural ways, and thus training a mind to have that goal is like swimming against the current, and we should expect it to potentially break if the training processes puts too much weight on competence compared to alignment. Minds with anti-natural goals are possible, but harder to produce using known methods, for the most part.

(AFAIK this is the way that Nate Soares uses the term, and I assume the way Eliezer Yudkowsky thinks about it as well, but I'm also probably missing big parts of their perspectives, and generally don't trust myself to pass their ITT.)

^{^}
The term "anti-natural" is bad in that it seems to be the opposite of "natural," but is not a general opposite of natural. While I do believe that the ways-of-thinking-that-are-generally-useful are the sorts of things that naturally emerge when selecting for intelligence, there are clearly plenty of things which the word "natural" describes besides these ways of thinking. The more complete version of "anti-natural" according to me would be "anti-the-useful-cognitive-strategies-that-naturally-emerge-when-selecting-for-intelligence" but obviously we need a shorthand term, and ideally one that doesn't breed confusion.

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Max Harms9mo10

If I'm hearing you right, a shutdownable AI can have a utility function that (aside from considerations of shutdown) just gives utility scores to end-states as represented by a set of physical facts about some particular future time, and this utility function can be set up to avoid manipulation.

How does this work? Like, how can you tell by looking at the physical universe in 100 years whether I was manipulated in 2032?

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Max Harms9mo10

Cool. Thanks for the clarification. I think what you call "anti-naturality" you should be calling "non-end-state consequentialism," but I'm not very interested in linguistic turf-wars.

It seems to me that while the gridworld is very simple, the ability to train agents to optimize for historical facts is not restricted to simple environments. For example, I think one can train an AI to cause a robot to do backflips by rewarding it every time it completes a backflip. In this context the environment and goal are significantly more complex^[1] than the gridworld and cannot be solved by brute-force. But number of backflips performed is certainly not something that can be measured at any given timeslice, including the "end-state."

If caring about historical facts is easy and common, why is it important to split this off and distinguish it?

^{^}
Though admittedly this situation is still selected for being simple enough to reason about. If needed I believe this point holds through AGI-level complexity, but things tend to get more muddled as things get more complex, and I'd prefer sticking to the minimal demonstration.

4. Existing Writing on Corrigibility

Max Harms9mo20

I talk about the issue of creating corrigible subagents here. What do you think of that?

I may not understand your thing fully, but here's my high-level attempt to summarize your idea:

IPP-agents won't care about the difference between building a corrigible agent vs an incorrigible agent because it models that if humans decide something's off and try to shut everything down, it will also get shut down and thus nothing after that point matters, including whether the sub-agent makes a bunch of money or also gets shut down. Thus, if you instruct an IPP agent to make corrigible sub-agents, it won't have the standard reason to resist: that incorrigible sub-agents make more money than corrigible ones. Thus if we build an obedient IPP agent and tell it to make all its sub-agents corrigible, we can be more hopeful that it'll actually do so.

I didn't see anything in your document that addresses my point about money-maximizers being easier to build than IPP agents (or corrigible agents) and thus, in the absence of an instruction to make corrigible sub-agents, we should expect sub-agents that are more akin to money-maximizers.

But perhaps your rebuttal will be "sure, but we can just instruct/train the AI to make corrigible sub-agents". If this is your response, I am curious how you expect to be able to do that without running into the misspecification/misgeneralization issues that you're so keen to avoid. From my perspective it's easier to train an AI to be generally corrigible than to create corrigible sub-agents per se (and once the AI is generally corrigible it'll also create corrigible sub-agents), which seems like a reason to focus on corrigibility directly?

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Max Harms9mo30

In the Corrigibility (2015) paper, one of the desiderata is:

(2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so.

I think you may have made an error in not listing this one in your numbered list for the relevant section.

Additionally, do you think that non-manipulation is a part of corrigibility, do you think it's part of safe exploration, or do you think it's a third thing. If you think it's part of corrigibility, how do you square that with the idea that corrigibility is best reflected by shutdownability alone?

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Max Harms9mo10

Follow-up question, assuming anti-naturality goals are "not straightforwardly captured in a ranking of end states": Suppose I have a gridworld and I want to train an AI to avoid walking within 5 spaces (manhattan distance) from a flag, and to (less importantly) eat all the apples in a level. Is this goal anti-natural? I can't think of any way to reflect it as a straightforward ranking of end states, since it involves tracking historical facts rather than end-state facts. My guess is that it's pretty easy to build an agent that does this (via ML/RL approaches or just plain programming). Do you agree? If this goal is anti-natural, why is the anti-naturality a problem or otherwise noteworthy?

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Max Harms9mo10

I'm curious what you mean by "anti-natural." You write:

Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can’t be straightforwardly captured in a ranking of end states.

My understanding of anti-naturality used to resemble this, before I had an in-depth conversation with Nate Soares and updated to see anti-naturality to be more like "opposed to instrumental convergence." My understanding is plausibly still confused and I'm not trying to be authoritative here.

If you mean "not straightforwardly captured in a ranking of end states" what does "straightforwardly" do in that definition?

4. Existing Writing on Corrigibility

Max Harms9mo20

Again, responding briefly to one point due to my limited time-window:

> While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.
Can you say more about this? It doesn't seem likely to me.

Suppose I am an agent which wants paperclips. The world is full of matter and energy which I can bend to my will in the service of making paperclips. Humans are systems which can be bent towards the task of making paperclips, and I want to manipulate them into doing my bidding not^[1] because they might turn me off, but because they are a way to get more paperclips. When I incinerate the biosphere to gain the energy stored inside, it's not^[1] because it's trying to stop me, but because it is fuel. When my self-replicating factories and spacecraft are impervious to weaponry, it is not^[1] because I knew I needed to defend against bombs, but because the best factory/spacecraft designs are naturally robust.

^{^}
(just)