Really enjoying this sequence! For the purposes of relaxed adversarial training, I definitely want something closer to absolute myopia than just dynamic consistency. However, I think even absolute myopia is insufficient for the purposes of preventing deceptive alignment.[1] For example, as is mentioned in the post, an agent that manipulates the world via self-fulfilling prophecies could still be absolutely myopic. However, that's not the only way in which I think an absolutely myopic agent could still be a problem. In particular, I think there's an important distinction which I think about a lot when I think about myopia which I think is missing here regarding the nature of the agent's objective function.
In particular, I think there's a distinction between agents with objective functions over states of the world vs. their own subjective experience vs. their output.[2] For the sort of myopia that I want, I think you basically need the last thing; that is, you need an objective function which is just a function of the agent's output that completely disregards the consequences of that output. Having an objective function over the state of the world bleeds into full agency too easily and having an objective function over your own subjective experience leads to the possibility of wanting to gather resources to run simulations of yourself to capture your own subjective experience. If your agent simply isn't considering/thinking about how its output will affect the world at all, however, then I think you might be safe.[3]
I mentioned this to Abram earlier and he agrees with this, though I think it's worth putting here as well. ↩︎
Note that I don't think that these possibilities are actually comprehensive; I just think that they're some of the most salient ones. ↩︎
Instead, I want it to just be thinking about making the best prediction it can. Note that this still leaves open the door for self-fulfilling prophecies, though if the selection of which self-fulfilling prophecy to go with is non-adversarial (i.e. there's no deceptive alignment), then I don't think I'm very concerned about that. ↩︎
In particular, I think there’s a distinction between agents with objective functions over states the world vs. their own subjective experience vs. their output.
This line of thinking seems to me very important!
The following point might be obvious to Evan, but it's probably not obvious to everyone: Objective functions over the agent's output should probably not be interpreted as objective functions over the physical representation of the output (e.g. the configuration of atoms in certain RAM memory cells). That would just be a special case of objective functions over world states. Rather, we should probably be thinking about objective functions over the output as it is formally defined by the code of the agent (when interpreting "the code of the agent" as a mathematical object, like a Turing machine, and using a formal mapping from code to output).
Perhaps the following analogy can convey this idea: think about a human facing Newcomb's problem. The person has the following instrumental goal: "be a person that does not take both boxes" (because that makes Omega put $1,000,000 in the first box). Now imagine that that was the person's terminal goal rather than an instrumental goal. That person might be analogous to a program that "wants" to be a program that its (formally defined) output maximizes a given utility function.
I like the way you are almost able to turn this into a 'positive' account (the way generalized objectives are a positive account of myopic goals, but speaking in terms of failure to make certain pareto improvements is not). However, I worry that any goal over stated can be converted to a goal over outputs which amounts to the same thing, by calculating the expected value of the action according to the old goal. Presumably you mean some sufficiently simple action-goal so as to exclude this.
Yeah, I agree. I almost said "simple function of the output," but I don't actually think simplicity is the right metric here. It's more like "a function of the output that doesn't go through the consequences of said output."
(note, this comment is kinda grumpy but, to be clear, comes from the context of me generally quite respecting you as a writer. :P)
I can't remember if I've complained about this elsewhere, but I have no idea what you mean by myopia, and I was about to comment (on another post) asking if you could write a post that succinctly defined what you meant by myopia (or if the point is that it's hard to define, say that explicitly and give a few short attempted descriptions that could help me triangulate it).
Then I searched to see if you'd already done that, and found this post, which seems like it really wants to open with a succinct description of what myopia is, but doesn't, and leaves me even more confused about it by the end.
(I can see the dictionary definition of myopia is "lack of imagination, foresight, or intellectual insight", but I don't know exactly how that's connecting to the overall model or ideas your building towards here)
Sorry for somehow missing/ignoring this comment for about 5 months. The short answer is that I've been treating "myopia" as a focusing object, and am likely to think any definitions (including my own definitions in the OP) are too hasty and don't capture everything I want to point at. In fact I initially tried to use the new term "partial agency" to make sure people didn't think I was talking about more well-defined versions.
My attempt to give others a handle for the same focusing object was in the first post of the sequence, where I try to triangulate what I'm getting at with a number of examples right at the start:
Perhaps I could give a better short/informal definition now if I sat down to think about it.
[EDIT: 2019-11-09: The argument I made here seems incorrect; see here (H/T Abram for showing me that my reasoning was wrong).]
Conjecture: It is not possible to set up a learning system which gets you full agency in the sense of eventually learning to take all the Pareto improvements.
.
There's also reason to suspect the conjecture to be false. There's a natural instrumental convergence toward dynamic consistency; a system will self-modify to greater consistency in many cases. If there's an attractor basin around full agency, one would not expect it to be that hard to set up incentives which push things into that attractor basin.
Apart from this, it seems to me that some evolutionary computation algorithms tend to yield models that take all the Pareto improvements, given sufficiently long runtime. The idea is that at any point during training we should expect a model to outperform another model—that takes one less Pareto improvement—on future fitness evaluations (all other things being equal).
Any global optimization technique can find the global optimum of a fixed evaluation function given time. This is a different problem. As I mentioned before, the assumption of simulable environments which you invoke to apply evolutionary algorithms to RL problems assumes too much; it fundamentally changes the problem from a control problem to a selection problem. This is exactly the kind of mistake which prompted me to come up with the selection/control distinction.
How would you propose to apply evolutionary algorithms to online learning? How would you propose to apply evolutionary algorithms to non-episodic environments? I'm not saying it can't be done, but in doing so, your remark will no longer apply. For online non-episodic problems, you don't get to think directly in terms of climbing a fitness landscape.
Taking a step back, I want to note two things about my model of the near future (if your model disagrees with those things, that disagreement might explain what's going on in our recent exchanges):
(1) I expect many actors to be throwing a lot of money on selection processes (especially unsupervised learning), and I find it plausible that such efforts would produce transformative/dangerous systems.
(2) Suppose there's some competitive task that is financially important (e.g. algo-trading), for which actors build systems that use a huge neural network trained via gradient descent. I find it plausible that some actors will experiment with evolutionary computation methods, trying to produce a component that will outperform and replace that neural network.
Regarding the questions you raised:
How would you propose to apply evolutionary algorithms to online learning?
One can use a selection process—say, some evolutionary computation algorithm—to produce a system that performs well in an online learning task. The fitness metric would be based on the performance in many (other) online learning tasks for which training data is available (e.g. past stock prices) or for which the environment can be simulated (e.g. Atari games, robotic arm + boxes).
How would you propose to apply evolutionary algorithms to non-episodic environments?
I'm not sure whether this refers to non-episodic tasks (the issue being slower/sparser feedback?) or environments that can't be simulated (in which case the idea above seems to apply: one can use a selection process, using other tasks for which there's training data or for which the environment can be simulated).
(1) I expect many actors to be throwing a lot of money on selection processes (especially unsupervised learning), and I find it plausible that such efforts would produce transformative/dangerous systems.
Sure.
(2) Suppose there's some competitive task that is financially important (e.g. algo-trading), for which actors build systems that use a huge neural network trained via gradient descent. I find it plausible that some actors will experiment with evolutionary computation methods, trying to produce a component that will outperform and replace that neural network.
Maybe, sure.
There seems to be something I'm missing here. What you said earlier:
Apart from this, it seems to me that some evolutionary computation algorithms tend to yield models that take all the Pareto improvements, given sufficiently long runtime. The idea is that at any point during training we should expect a model to outperform another model—that takes one less Pareto improvement—on future fitness evaluations (all other things being equal).
is an essentially mathematical remark, which doesn't have a lot to do with AI timelines and projections of which technologies will be used. I'm saying that this remark strikes me as a type error, because it confuses what I meant by "take all the Pareto improvements" -- substituting the (conceptually and technologically difficult) control concept for the (conceptually straightforward, difficult only because of processing power limitations) selection concept.
I interpret you that way because your suggestion to apply evolutionary algorithms appears to be missing data. We can apply evolutionary algorithms if we can define a loss function. But the problem I'm pointing at (off full vs partial agency) has to do with difficulties of defining a loss function.
>How would you propose to apply evolutionary algorithms to online learning?
One can use a selection process—say, some evolutionary computation algorithm—to produce a system that performs well in an online learning task. The fitness metric would be based on the performance in many (other) online learning tasks for which training data is available (e.g. past stock prices) or for which the environment can be simulated (e.g. Atari games, robotic arm + boxes).
So, what is the argument that you'd tend to get full agency out of this? I think the situation is not very different from applying gradient descent in a similar way.
I'm not saying I'd be happily confident that such a procedure would produce partial agents (therefore avoiding AI risk). And indeed, there are differences between doing this with gradient descent and evolutionary algorithms. One of the things I focused on in the post, time-discounting, becomes less relevant -- but only because it's more natural to split things into episodes in the case of evolutionary algorithms, which still creates myopia as a side effect.
What I'm saying is there's a real credit assignment problem here -- you're trying to pick between different policies (ie the code which the evolutionary algorithms are selecting between), based on which policy has performed better in the past. But you've taken a lot of actions in the past. And you've gotten a lot of individual pieces of feedback. You don't know how to ascribe success/failure credit -- that is, you don't know how to match individual pieces of feedback to individual decisions you made (and hence to individual pieces of code).
So you solve the problem in a basically naive way: you assume that the feedback on "instance n" was related to the code you were running at that time. This is a myopic assumption!
>How would you propose to apply evolutionary algorithms to non-episodic environments?
I'm not sure whether this refers to non-episodic tasks (the issue being slower/sparser feedback?) or environments that can't be simulated (in which case the idea above seems to apply: one can use a selection process, using other tasks for which there's training data or for which the environment can be simulated).
The big thing with environments that can't be simulated is that you don't have a reset button, so you can't back up and try again; so, episodic and simulable are pretty related.
Sparse feedback is related to what I'm talking about, but feels like a selection-oriented way of understanding the difficulty of control; "sparse feedback" still applies to very episodic problems such as chess. The difficulty with control is that arbitrarily long historical contexts can sometimes matter, and you have to learn anyway. But I agree that it's much easier for this to present real difficulty if the rewards are sparse.
I suspect I made our recent discussions unnecessarily messy by simultaneously talking about: (1) "informal strategic stuff" (e.g. the argument that selection processes are strategically important, which I now understand is not contradictory to your model of the future); and (2) my (somewhat less informal) mathematical argument about evolutionary computation algorithms.
The rest of this comment involves only the mathematical argument. I want to make that argument narrower than the version that perhaps you responded to: I want it to only be about absolute myopia, rather than more general concepts of myopia or full agency. Also, I (now) think my argument applies only to learning setups in which the behavior of the model/agent can affect what the model encounters in future iterations/episodes. Therefore, my argument does not apply to setups such as unsupervised learning for past stock prices or RL for Atari games (when each episode is a new game).
My argument is (now) only the following: Suppose we have a learning setup in which the behavior of the model at a particular moment may affect the future inputs/environments that the model will be trained on. I argue that evolutionary computation algorithms seem less likely to yield an absolute myopic model, relative to gradient decent. If you already think that, you might want to skip the rest of this comment (in which I try to support this argument).
I think the following property might make a learning algorithm more likely to yield models that are NOT absolute myopic:
During training, a parameter's value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future iterations/episodes.
I think that this property tends to apply to evolutionary computation algorithms more than it applies to gradient descent. I'll use the following example to explain why I think that:
Suppose we have some online supervised learning setup. Suppose that during iteration 1 the model needs to predict random labels (and thus can't perform better than chance), however, if parameter has a large value then the model makes predictions that cause the examples in iteration 2 to be more predictable. By assumption, during iteration 2 the value of does not (directly) affect predictions.
How should we expect our learning algorithm to update the parameter at the end of iteration 2?
If our learning algorithm is gradient decent, it seems that we should NOT expect to increase, because there is no iteration in which the relevant component of the gradient (i.e. the partial derivative of the objective with respect to ) is expected to be positive.
In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of (over the model population).
Sorry for taking so long to respond to this one.
I don't get the last step in your argument:
In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which θ8 happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of θ8 (over the model population).
Why do those models outperform? I think you must be imagining a different setup, but I'm interpreting your setup as:
In other words, many members of the population can swoop in and reap the benefits caused by high- members. So high- carriers do not specifically benefit.
In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.
Oops! I agree. My above example is utterly confused and incorrect (I think your description of my setup matches what I was imagining).
Backtracking from this, I now realize that my core reasoning behind my argument—about evolutionary computation algorithms producing non-myopic behavior by default—was incorrect.
(My confusion stemmed from thinking that an argument I made about a theoretical training algorithm also applies to many evolutionary computation algorithms; which seems incorrect!)
IID vs Myopia
In a comment to Partial Agency, Rohin summarized his understanding of the post. He used the iid assumption as a critical part of his story. Initially, I thought that this was a good description of what was going on; but I soon realized that iid isn't myopia at all (and commented as such). This post expands on the thought.
My original post conflated episodic (which is basically 'iid') with myopic.
In an episodic setting, it makes sense to be myopic about anything beyond the current episode. There's no benefit to cross-episode strategies, so, no need to learn them.
This is true at several levels (which I mention in the hopes of avoiding later confusion):
However, it is also possible to consider myopia in the absence of episodic structure, and not just as a mistake. We might want an ML algorithm to learn myopic strategies, as is the case with predictive systems. (We don't want them to learn to manipulate the data; and even though that failure mode is far-fetched for most modern systems, there's no point setting up learning procedures which would incentivise it. Indeed, learning procedures seem to mostly encourage myopia, though the full situation is still unclear to me.)
These myopic strategies aren't just "strategies which behave as if there were an episodic assumption", either. For example, sequential prediction is myopic (the goal is to predict each next item accurately, not to get the most accuracy overall -- if this is unclear, hopefully it will become clearer in the next section).
So, there's a distinction between not remembering the past vs not looking ahead to the future. In episodic settings, the relevant parts of past and future are both limited to the duration of the episode. However, the two come apart in general. We can have/want myopic agents with memory; or, we can have/want memoryless agents which are not myopic. (The second seems somewhat more exotic.)
Game-Theoretic Myopia Definition
So far, I've used 'myopia' in more or less two ways: an inclusive notion which encompasses a big cluster of things, and also the specific thing of only optimizing each output to maximize the very next reward. Let's call the more specific thing "absolute" myopia, and try to define the more general thing.
Myopia can't be defined in terms of optimizing an objective in the usual sense -- there isn't one quantity being optimized. However, it seems like most things in my 'myopia' cluster can be described in terms of game theory.
Let's put down some definitions:
Sequential decision scenario: An interactive environment which takes in actions and outputs rewards and observations. I'm not trying to deal with embeddedness issues; this is basically the AIXI setup. (I do think 'reward' is a very restrictive assumption about what kind of feedback the system gets, but talking about other alternatives seems like a distraction from the current post.)
(Generalized) objective: A generalized objective assigns, to each action n, a function fn:N→R . The quantity fn(i) is how much the nth decision is supposed to value the ith reward. Probably, we require the sum Σifn(i) to exist.
Some examples:
A generalized objective could be called 'myopic' if it is not dynamically consistent; ie, if there's no way to write fn(i) as a function of i alone, eliminating the dependence on n.
This notion of myopia does not seem to include 'directionality' or 'stop-gradients' from my original post. In particular, if we try to model pure prediction, absolute myopia captures the idea that you aren't supposed to have manipulative strategies which lie (throw out some reward for one instance in order to get more overall). However, it does not rule out manipulative strategies which select self-fulfilling prophecies strategically; those achieve high reward on instance i by choice of output n=i, which is what a myopic agent is supposed to do.
There are also non-myopic objectives which we can't represent here but might want to represent more generally: there isn't a single well-defined objective corresponding to 'maximizing average reward' (the limit of exponential discounting as c→1).
Vanessa recently mentioned using game-theoretic models like this for the purpose of modeling inconsistent human values. I want to emphasize that (1) I don't want to think of myopia as necessarily 'wrong'; it seems like sometimes a myopic objective is a legitimate one, for the purpose of building a system which does something we want (such as make non-manipulative predictions). As such, (2) myopia is not just about bounded rationality.
I also don't necessarily want to think of myopia as multi-agent, even when modeling it with multi-agent game theory like this. I'd rather think about learning one myopic policy, which makes the appropriate (non-)trade-offs based on f.
In order to think about a system behaving myopically, we need to use an equilibrium notion (such as Nash equilibria or correlated equilibria), not just fi(n). However, I'm not sure quite how I want to talk about this. We don't want to think in terms of a big equilibrium between each decision-point n; I think of that as a selection-vs-control mistake, treating the sequential decision scenario as one big thing to be optimized. Or, putting it another way: the problem is that we have to learn; so we can't talk about everything being in equilibrium from the beginning.
Perhaps we can say that there should be some n such that each decision after that is in approximate equilibrium with each other taking the decisions before as given.
(Aside -- What we definitely don't want (if we want to describe or engineer legitimately myopic behavior) is a framework where the different decision-points end up bargaining with each other (acausal trade, or mere causal trade), in order to take pareto improvements and thus move toward full agency. IE, in order to keep our distinctions from falling apart, we can't apply a decision theory which would cooperate in Prisoner's Dilemma or similar things. This could present difficulties.)
Let's move on to a different way of thinking about myopia, through the language of Pareto-optimality.
Pareto Definition
We can think of myopia as a refusal to take certain Pareto improvements. This fits well with the previous definition; if an agent takes all the Pareto improvements, then its behavior must be consistent with some global weights f(i) not a function of n. However, not all myopic strategies in the Pareto sense have nice representations in terms of generalized objectives.
In particular: I mentioned that generalized objectives couldn't rule out manipulation through selection of self-fulfulling prophecies; so, only capture part of what seems implied by map/territory directionality. Thinking in terms of Pareto-failures, we can also talk about failing to reap the gains from selection of manipulative self-fulfilling prophecies.
However, thinking in these terms is not very satisfying. It allows a very broad notion of myopia, but has few other virtues. Generalized objectives let me talk about myopic agents trying to do a specific thing, even though the thing they're trying to do isn't a coherent objective. Defining myopia as failure to take certain Pareto improvements doesn't give me any structure like that; a myopic agent is being defined in the negative, rather than described positively.
Here, as before, we also have the problem of defining things learning-theoretically. Speaking purely in terms of whether the agent takes certain Pareto improvements doesn't really make sense, because it has to learn what situation it is in. We want to talk about learning processes, so we need to talk about learning to take the Pareto improvements, somehow.
(Bayesian learning can be described in terms of Pareto optimality directly, because using a prior over possible environments allows Pareto-optimal behavior in terms of those environments. However, working that way requires realizability, which isn't realistic.)
Decision Theory
In the original partial agency post, I described full agency as an extreme (perhaps imaginary) limit of less and less myopia. Full agency is like Cartesian dualism, sitting fully outside the universe and optimizing.
Is full agency that difficult? From the generalized-objective formalism, one might think that ordinary RL with exponential discounting is sufficient.
The counterexamples to this are MIRI-esque decision problems, which create dynamic inconsistencies for otherwise non-myopic agents. (See this comment thread with Vanessa for more discussion of several of the points I'm about to make.)
To give a simple example, the version of Newcomb's Problem where the predictor knows about as much about your behavior as you do. (The version where the predictor is nearly infallible is easily handled by RL-like learning; you need to specifically inject sophisticated CDT-like thinking to mess that one up.)
In order to have good learning-theoretic properties at all, we need to have epsilon exploration. But if we do, then we tend to learn to 1-box, because (it will seem) doing so is independent of the predictor's predictions of us.
Now, it's true that in a sequential setting, there will be some incentive to 2-box not for the payoff today, but for the future; establishing a reputation of 1-boxing gets higher payoffs in iterated Newcomb in a straightforward (causal) way.
However, that's not enough to entirely avoid dynamic inconsistency. For any discounting function, we need only to assume that the instances of Newcomb's problem are spaced out far enough over time so that 2-boxing in each individual case is appealing.
Now, one might argue that in this case, the agent is correctly respecting its generalized objective; it's supposed to sacrifice future value for present according to the discounting function. And that's true, if we want myopic behavior. But it is dynamically inconsistent -- the agent wishes to 2-box in each individual case, but with respect to future cases, would prefer to 1-box. It would happily bind its future actions given an opportunity to do so.
Like the issue with self-fulfilling prophecies, this creates a type of myopia which we can't really talk about within the formalism of generalized objectives. Even with an apparently dynamically consistent discounting function, the agent is inconsistent. As mentioned earlier, we need generalized-objective systems to fail to coordinate with themselves; otherwise, their goals collapse into regular objectives. So this is a type of myopia which all generalized objectives possess.
As before, I'd really prefer to be able to talk about this with specific types of myopia (as with myopic generalized objectives), rather than just pointing to a dynamic inconsistency and classifying it with myopia.
(We might think of the fully non-myopic agent as the limit of less and less discounting, as Vanessa suggests. This has some problems of convergence, but perhaps that's in line with non-myopia being an extreme ideal which doesn't always make sense. Alternately, we might thing of this as a problem of decision theory, arguing that we should be able reap the advantages of 1-boxing despite our values temporally discounting. Or, there might be some other wilder generalization of objective functions which lets us represent the distinctions we care about.)
Mechanism Design Analogy
I'll close this post with a sketchy conjecture.
Although I don't want to think of generalized objectives as truly multi-agent in the one-'agent'-per-decision sense, learning algorithms will typically have a space of possible hypotheses which are (in some sense) competing with each other. We can analogize that to many competing agents (keeping in mind that they may individually be 'partial agents', ie, we can't necessarily model them as coherently pursuing a utility function).
For any particular type of myopia (whether or not we can capture it in terms of a generalized objective), we can ask the question: is it possible to design a training procedure which will learn that type of myopia?
(We can approach this question in different ways; asymptotic convergence, bounded-loss (which may give useful bounds at finite time), or 'in-practice' (which fully accounts for finite-time effects). As I've mentioned before, my thoughts on this are mostly asymptotic at the moment, that being the easier theoretical question.)
We can think of this question -- the question of designing training procedures -- as a mechanism-design question. Is it possible to set up a system of incentives which encourages a given kind of behavior?
Now, mechanism design is a field which is associated with negative results. It is often not possible to get everything you want. As such, a natural conjecture might be:
Conjecture: It is not possible to set up a learning system which gets you full agency in the sense of eventually learning to take all the Pareto improvements.
This conjecture is still quite vague, because I have not stated what it means to 'learn to take all the Pareto improvements'. Additionally, I don't really want to assume the AIXI-like setting which I've sketched in this post. The setting doesn't yield very good learning-theoretic results anyway, so getting a negative result here isn't that interesting. Ideally the conjecture should be formulated in a setting where we can contrast it to some positive results.
There's also reason to suspect the conjecture to be false. There's a natural instrumental convergence toward dynamic consistency; a system will self-modify to greater consistency in many cases. If there's an attractor basin around full agency, one would not expect it to be that hard to set up incentives which push things into that attractor basin.