Conditions under which misaligned subagents can (not) arise in classifiers

anon1

Core claim: Misaligned subagents are very unlikely to arise in a classification algorithm unless that algorithm is directly or indirectly (e.g. in a subtask) modeling interactions through time at a significant level of complexity.

Definition 1: Agent - a function from inputs and internal state (or memory) to an output / action and new internal state. Note that this includes things that would not usually be considered as "agents" - e.g. plants or bacteria. Also note that not all "agents" of this type have consistent (or even coherent) "goals"

This definition of agent might be considered too broad; the reason I have decided to use it is that I believe it covers basically everything that could be dangerous - if an AI is not an agent under this definition, then I think it is extremely likely that this AI would be safe.

Definition 2: A function that was selected by an optimization procedure has a misaligned subagent if it spawns a subprocess that is an agent whose "goals" are different from (and potentially in conflict with) the optimization criteria.

Example: Consider an optimization process that selects for functions that accurately predict human actions, and assume that this optimization process finds a function that does this prediction by creating an extremely accurate simulations of humans. These simulations would be misaligned subagents, since humans are agents and the goals of the simulations would likely be very different from "predict human actions accurately".

For brevity, let us abbreviate classifiers with misaligned subagents as CWMS. Note that I might use classifier a bit more broadly than the strict definition - for example, I may call certain more general question-answering machines "classifiers". I do not believe this directly affects the general argument.

Claim 1: An agent will "perform" "better" than a memoryless function given the same sequence of inputs only if (almost) every input is highly correlated with the previous input. To phrase this in a different way, having a "memory" only helps if your "memory" gives good evidence for either what is going to happen next or what to do next.

Claim / Assumption 2: For most optimization processes powerful enough to find a classifier that performs well, we have:

$P (Optimization process finds a CWMS) \approx \frac{Density of CWMS that "perform well"}{Density of all classifiers that "perform well"}$

There might be certain optimization processes that lean more towards CWMS (or toward classifiers without misaligned subagents), but I think this is a reasonable base assumption given our current level of information.

Claim 3: For a certain task, if most good classifiers for that task are CWMS, then there exists one or more subtasks that involve processing highly correlated inputs, and doing some of these subtasks very well is important for being a good classifier. This follows from Claims 1 and 2.

Conversely, if such key subtasks do not exist, most good classifiers will not be CWMS

Intuition 4: For most tasks which have key subtasks of the type mentioned in Claim 3, those subtasks are very likely to involve some sort of modeling how things change / interact through time (example: understanding the contents of a video or the meaning of a paragraph in a book require this type of modeling).

Intuition 5: There are a lot of interesting, difficult tasks where modeling how things change through time is not key for solving the task.

Claim / Intuition 6: Optimizers for the tasks mentioned in Intuition 5 have very low probability of finding a CWMS. This follows from Claims 2 and 3 and Intuition 4.

Thanks to Scott Garrabrant and Evan Hubinger for discussing some the ideas in this post with me.

[-]Evan Hubinger7y20

There are a couple of pieces of this that I disagree with:

I think claim 1 is wrong because even if the memory is unhelpful, the agent which uses it might be simpler, and so you might still end up with an agent. My intuition is that just specifying a utility function and an optimization process is often much easier than specifying the complete details of the actual solution, and thus any sort of program-search-based optimization process (e.g. gradient descent in a nn) has a good chance of finding an agent.
I think claim 3 is wrong because agenty solutions exist for all tasks, even classification tasks. For example, take the function which spins up an agent, tasks that agent with the classification task, and then takes that agent's output. Unless you've done something to explicitly remove agents from your search space, this sort of solution always exists.
Thus, I think claim 6 is wrong due to my complaints about claims 1 and 3.

[-]anon17y20

Re: first point, I think this is a difference in intuition about how simple / easy to find agents are in search space. My intuition is that they are would be harder to find than regular functions doing something - I think this is generated by a more general intuition that finding a function that does A is easier than finding a function that does both A and B.

Re: second point, I agree - there will be some agents in the search space. Claim 3 is that if claim 1 and 2 are true, then (for the specified type of task) it is very unlikely that the optimization process will find an agent; however, there is still a nonzero probability that it does.