A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.
Stuart Russell
Think of an AI directing a car, given the instructions to get someone to the airport as fast as possible (optimised variables include "negative of time taken to airport") with some key variables left out - such as a maximum speed, maximum acceleration, respect for traffic rules, and survival of the passengers and other humans.
Call these other variables "unstated objectives" (UO), as contrasted with the "stated objectives" (SO) such as the time to the airport. In the normal environments in which we operate and design our AIs, the UOs are either correlated with the SOs (consider the SO "their heart is beating" and the UO "they're alive and healthy") or don't change much at all (the car-directing AI could have been trained on many examples of driving-to-the-airport, none of which included the driver killing their passengers).
Typically, SOs are easy to define, and the UOs are the more important objectives, left undefined either because they are complex, or because they didn't occur to us in this context (just as we don't often say "driver, get me to the airport as fast a possible, but alive and not permanently harmed, if you please. Also, please obey the following regulations and restrictions: 1.a.i.α: Non-destruction of the Earth....").
The control problem, in a nutshell, is that optimising SOs will typically set other variables to extreme values, including the UOs. The more extreme the optimisation, and the furthest from the typical environment, the more likely this is to happen.
One of the ideas of reduced impact is to preserve the typical environment, so that the unchanging UOs are left unchanged and the correlation remains between the changing UOs and the changing SOs. I've had some ideas for optimisation that "doesn't optimise too much". Jessica Taylor has been formalising similar ideas; the idea for both is to preserve the connection between SOs and UOs by preserving certain assumptions about in the standard environment.
#Inverse Reinforcement Learning
But those methods are somewhat indirect. What we would ideally want is an AI that can deduce the UOs that we haven't stated, and act on them instead. Then we wouldn't need to try and preserve the environment, nor need to hope that the unseen correlations continue to stand up.
This is similar to the approach taken by inverse reinforcement learning (IRL) - the problem of deducing the correct reward function (equivalent, in our setup, to the objectives SO+UO we want the AI to maximise), given observed behaviour by an expert. In our example above, the AI could observe many human drivers going to the airport, and deduce the reward function corresponding to their behaviour. A good summary can be found in the paper "Inverse Reinforcement Learning in Partially Observed Environments", that extend the results to - you guessed it - partially observed environments.
There seem to be two major classes of IRL algorithms. The first one deduces the reward function for a given expert policy (as an aside, the important thing is to find a reward function that distinguishes that policy from other policies; "always get the same reward, whatever you do" is a reward function that is compatible with all policies). The second class is given "sampled trajectories" (equivalent with observed behaviours) and tries to deduce the reward from that.
In that second case, some linearity assumptions are made, in that the reward is assumed to be of the type:
R(s,a) = α . Φ(s,a),
where R is the reward from taking action a in state s, Φ is a fixed vector-valued function mapping (s,a) into some d-dimensional space, and α is a d-dimensional vector (without loss of generality, α∈[-1,1]^d). Then R is given by the inner product of α and Φ. The IRL algorithm then attempts to compute the values of the components of α. Note that if d is larger than the size of the state × action space, then any reward can be expressed in this form (for spanning Φ); it's only for smaller d that the above is a simplification.
Then there are various algorithms that can deduce the values of α from the sampled trajectories. The algorithms often have a sparseness parameter that prefers simpler values (generally those where R is zero most of the time).
It seems likely that the two approaches - learning from explicit policy information and learning from observed trajectories - can be combined. The SO can form constraints on the value of R, with the algorithms finding α subject to those constraints - in other words, to start from the SO and fill in the UO that the human behaviour is secretly revealing. I'll soon try and set this up (if it hasn't been done already in the literature, which is quite likely).
#Variables and the environment
Now, the original algorithms prefer sparse reward functions (R's that are often non-zero). We don't want to bake in sparseness requirements - cutting down to too few objectives is part of the problem with optimisers. Ideally we'd want to get some estimate as to the complexity or size of the set of true human preferences, and aim the AI at developing rewards at around that complexity.
Then, with enough data, the AI should be able to find all the UOs that are correlated with the SOs in the sampled trajectories it is given. There are two problems: first, it will add objectives that are entirely spurious and coincidental. If we train the AI on only positive examples, it could deduce that increased entropy and increased CO2 in the atmosphere are both UO, since they appear in all successful examples. The second problem is that for the UOs that are not changed in any of the examples, the AI can't deduce anything about them. If there are no fatal crashes (or fatal crashes carefully averted at the cost of going slower), the AI will not see the necessity of keeping the passengers alive.
The first issue can be addressed by incorporating explicitly bad trajectories into the training set, showing that certain behaviours should be specifically avoided. Then, since entropy increase and CO2 emissions are also present in the bad examples, the AI can realise that they are not likely to be desired outcomes. I'm not sure if IRL has been done with bad examples yet, but it needs to be done with care - we need to encode somehow that these examples are bad, but not maximally bad. Maybe the driver made a mistake, but they weren't actively trying to kill their passengers. In a sense we need the AI to try to extend to a full model of the all possible environments, dividing a full description into a set of variable: SO+UO+FV. The FV are free variables - values we explicitly don't care about. For instance, we don't actually intrinsically care about entropy and CO2 levels - we care about some consequences they have on other objectives.
The second issue is more tricky. There is no way of establishing it directly from the trajectories given. One approach would be for the AI to ask humans about candidate UOs (with all the usualprecautions on value learning). This would allow some extension of the SOs, that could cover more cases. However, this wouldn't allow the mastery of complicated UOs (ie you can learn about hearts beating, but not about health and well-being). The best approach would be to get examples from very different environments. This could happen either by humans generating more trajectories in new environments of their choice, by humans generating more trajectories in new environments of the AI's choice, or humans defining hypothetical trajectories in new hypothetical environments. All of these can allow the AI to hone in on the correct class of UOs.
#Ontological crisis
It seems that this method could be used similarly in the case of an ontological crisis (a situation where the fundamental assumptions about reality are changed or turn out to be different). Now, there is a sense in which ontological crisis and extensions beyond the training data can be seen as the same thing. But even without that assumption, it seems that the methods above could deal with an ontological crisis.
First, imagine that the AI has a bunch of SO+UO correctly and fully encoding human preferences. Then, an ontological change strikes - the laws of reality are different (maybe a classical AI discovers quantum physics). The AI can keep its SOs/UOs if they can be phrased in terms of the new laws, and then attempt to flesh them out with new UOs to correspond to the new reality. The AI has another tool at is disposal, though: it can generate new sample trajectories. It can use its old model to generate behaviour it would have taken, back when it believed the model was true (we can imagine an internal process in the AI's brain, where one module is saying "Warning! Ontological change imminent!" and another reacts by quickly generating hypothetical sample trajectories according to its current understanding). Then, as long as those sample trajectories are still possible in the new model, it can use them as examples for finding UOs. In fact, that last method can work even if the initial SOs/UOs cannot be explicitly extended to the new ontology.
#Complexity, bias, error, and value
In all the previous, I've been skirting a crucial issue. Humans make mistakes [citation, as they say, needed]. The sample trajectories will include examples of error, suboptimal choices, and biases. Dealing with these is a subtle issue I won't be going into here. We have to somehow distinguish between noise, errors, errors due to bounded rationality, errors due to inaccurate information, biases, preferences, preferences we'd think of as biases until they were fully explained to us, biases we'd think of as preferences until they were fully explained to us, issues where our ultimate conclusion depends strongly on which explanation we see first, and issues that humans can be left to sort out on their own.
But, laying that to one side for the moment, there is one aspect I want to bring up: I vaguely mentioned that we wanted an AI to develop a reward function or preference set around the size of the complexity of human preferences.
Now, it would be very hard to estimate this size, but it's clear there is a tension. A large size allows many biases to creep in and be regarded as preferences. A small size would relegate some preferences to being biases. It seems to me that we should be very liberal in this regard: biases which remain in the AIs model of SO+UO are likely to be minor biases (as major biases will have been stated by humans as things to avoid). These are biases so small that we're probably not aware of them. Compared with the possibility of losing something human-crucial we didn't think of explicitly stating, I'd say the case is strong to err on the size of increased complexity/more biases and preferences allowed. Essentially, we're unlikely to have missed some biases we'd really care about eliminating, but very likely to have missed some preference we'd really miss if it were gone.
There are some subtleties with this picture. Consider our current bias preferring omission over commission. We really do blame people less for not doing something (that allows a bad thing to happen) than for actively doing something bad. It's very plausible that such "preferences" cannot easily be justified, no matter how much the AI tries. However, a root extraction of this bias may remove many crucial aspects of humanity. It seems that the ideal might be preserving this human preference, but restructuring the world so that it has little detrimental impact. That sort of "ideal", however, is not one that can be reached by the process described above (the same argument can be applied to many examples "idealised preferences"). Further research and clarification is needed.
Can you expand on "a root extraction of this bias may remove many crucial aspects of humanity"? Are you saying that we don't want to edit this bias out of the human brain or that the AI has to display this bias itself for some reason?
A putative new idea for AI control; index here.
Stuart Russell
Think of an AI directing a car, given the instructions to get someone to the airport as fast as possible (optimised variables include "negative of time taken to airport") with some key variables left out - such as a maximum speed, maximum acceleration, respect for traffic rules, and survival of the passengers and other humans.
Call these other variables "unstated objectives" (UO), as contrasted with the "stated objectives" (SO) such as the time to the airport. In the normal environments in which we operate and design our AIs, the UOs are either correlated with the SOs (consider the SO "their heart is beating" and the UO "they're alive and healthy") or don't change much at all (the car-directing AI could have been trained on many examples of driving-to-the-airport, none of which included the driver killing their passengers).
Typically, SOs are easy to define, and the UOs are the more important objectives, left undefined either because they are complex, or because they didn't occur to us in this context (just as we don't often say "driver, get me to the airport as fast a possible, but alive and not permanently harmed, if you please. Also, please obey the following regulations and restrictions: 1.a.i.α: Non-destruction of the Earth....").
The control problem, in a nutshell, is that optimising SOs will typically set other variables to extreme values, including the UOs. The more extreme the optimisation, and the furthest from the typical environment, the more likely this is to happen.
One of the ideas of reduced impact is to preserve the typical environment, so that the unchanging UOs are left unchanged and the correlation remains between the changing UOs and the changing SOs. I've had some ideas for optimisation that "doesn't optimise too much". Jessica Taylor has been formalising similar ideas; the idea for both is to preserve the connection between SOs and UOs by preserving certain assumptions about in the standard environment.
#Inverse Reinforcement Learning
But those methods are somewhat indirect. What we would ideally want is an AI that can deduce the UOs that we haven't stated, and act on them instead. Then we wouldn't need to try and preserve the environment, nor need to hope that the unseen correlations continue to stand up.
This is similar to the approach taken by inverse reinforcement learning (IRL) - the problem of deducing the correct reward function (equivalent, in our setup, to the objectives SO+UO we want the AI to maximise), given observed behaviour by an expert. In our example above, the AI could observe many human drivers going to the airport, and deduce the reward function corresponding to their behaviour. A good summary can be found in the paper "Inverse Reinforcement Learning in Partially Observed Environments", that extend the results to - you guessed it - partially observed environments.
There seem to be two major classes of IRL algorithms. The first one deduces the reward function for a given expert policy (as an aside, the important thing is to find a reward function that distinguishes that policy from other policies; "always get the same reward, whatever you do" is a reward function that is compatible with all policies). The second class is given "sampled trajectories" (equivalent with observed behaviours) and tries to deduce the reward from that.
In that second case, some linearity assumptions are made, in that the reward is assumed to be of the type:
where R is the reward from taking action a in state s, Φ is a fixed vector-valued function mapping (s,a) into some d-dimensional space, and α is a d-dimensional vector (without loss of generality, α∈[-1,1]^d). Then R is given by the inner product of α and Φ. The IRL algorithm then attempts to compute the values of the components of α. Note that if d is larger than the size of the state × action space, then any reward can be expressed in this form (for spanning Φ); it's only for smaller d that the above is a simplification.
Then there are various algorithms that can deduce the values of α from the sampled trajectories. The algorithms often have a sparseness parameter that prefers simpler values (generally those where R is zero most of the time).
It seems likely that the two approaches - learning from explicit policy information and learning from observed trajectories - can be combined. The SO can form constraints on the value of R, with the algorithms finding α subject to those constraints - in other words, to start from the SO and fill in the UO that the human behaviour is secretly revealing. I'll soon try and set this up (if it hasn't been done already in the literature, which is quite likely).
#Variables and the environment
Now, the original algorithms prefer sparse reward functions (R's that are often non-zero). We don't want to bake in sparseness requirements - cutting down to too few objectives is part of the problem with optimisers. Ideally we'd want to get some estimate as to the complexity or size of the set of true human preferences, and aim the AI at developing rewards at around that complexity.
Then, with enough data, the AI should be able to find all the UOs that are correlated with the SOs in the sampled trajectories it is given. There are two problems: first, it will add objectives that are entirely spurious and coincidental. If we train the AI on only positive examples, it could deduce that increased entropy and increased CO2 in the atmosphere are both UO, since they appear in all successful examples. The second problem is that for the UOs that are not changed in any of the examples, the AI can't deduce anything about them. If there are no fatal crashes (or fatal crashes carefully averted at the cost of going slower), the AI will not see the necessity of keeping the passengers alive.
The first issue can be addressed by incorporating explicitly bad trajectories into the training set, showing that certain behaviours should be specifically avoided. Then, since entropy increase and CO2 emissions are also present in the bad examples, the AI can realise that they are not likely to be desired outcomes. I'm not sure if IRL has been done with bad examples yet, but it needs to be done with care - we need to encode somehow that these examples are bad, but not maximally bad. Maybe the driver made a mistake, but they weren't actively trying to kill their passengers. In a sense we need the AI to try to extend to a full model of the all possible environments, dividing a full description into a set of variable: SO+UO+FV. The FV are free variables - values we explicitly don't care about. For instance, we don't actually intrinsically care about entropy and CO2 levels - we care about some consequences they have on other objectives.
The second issue is more tricky. There is no way of establishing it directly from the trajectories given. One approach would be for the AI to ask humans about candidate UOs (with all the usual precautions on value learning). This would allow some extension of the SOs, that could cover more cases. However, this wouldn't allow the mastery of complicated UOs (ie you can learn about hearts beating, but not about health and well-being). The best approach would be to get examples from very different environments. This could happen either by humans generating more trajectories in new environments of their choice, by humans generating more trajectories in new environments of the AI's choice, or humans defining hypothetical trajectories in new hypothetical environments. All of these can allow the AI to hone in on the correct class of UOs.
#Ontological crisis
It seems that this method could be used similarly in the case of an ontological crisis (a situation where the fundamental assumptions about reality are changed or turn out to be different). Now, there is a sense in which ontological crisis and extensions beyond the training data can be seen as the same thing. But even without that assumption, it seems that the methods above could deal with an ontological crisis.
First, imagine that the AI has a bunch of SO+UO correctly and fully encoding human preferences. Then, an ontological change strikes - the laws of reality are different (maybe a classical AI discovers quantum physics). The AI can keep its SOs/UOs if they can be phrased in terms of the new laws, and then attempt to flesh them out with new UOs to correspond to the new reality. The AI has another tool at is disposal, though: it can generate new sample trajectories. It can use its old model to generate behaviour it would have taken, back when it believed the model was true (we can imagine an internal process in the AI's brain, where one module is saying "Warning! Ontological change imminent!" and another reacts by quickly generating hypothetical sample trajectories according to its current understanding). Then, as long as those sample trajectories are still possible in the new model, it can use them as examples for finding UOs. In fact, that last method can work even if the initial SOs/UOs cannot be explicitly extended to the new ontology.
#Complexity, bias, error, and value
In all the previous, I've been skirting a crucial issue. Humans make mistakes [citation, as they say, needed]. The sample trajectories will include examples of error, suboptimal choices, and biases. Dealing with these is a subtle issue I won't be going into here. We have to somehow distinguish between noise, errors, errors due to bounded rationality, errors due to inaccurate information, biases, preferences, preferences we'd think of as biases until they were fully explained to us, biases we'd think of as preferences until they were fully explained to us, issues where our ultimate conclusion depends strongly on which explanation we see first, and issues that humans can be left to sort out on their own.
But, laying that to one side for the moment, there is one aspect I want to bring up: I vaguely mentioned that we wanted an AI to develop a reward function or preference set around the size of the complexity of human preferences.
Now, it would be very hard to estimate this size, but it's clear there is a tension. A large size allows many biases to creep in and be regarded as preferences. A small size would relegate some preferences to being biases. It seems to me that we should be very liberal in this regard: biases which remain in the AIs model of SO+UO are likely to be minor biases (as major biases will have been stated by humans as things to avoid). These are biases so small that we're probably not aware of them. Compared with the possibility of losing something human-crucial we didn't think of explicitly stating, I'd say the case is strong to err on the size of increased complexity/more biases and preferences allowed. Essentially, we're unlikely to have missed some biases we'd really care about eliminating, but very likely to have missed some preference we'd really miss if it were gone.
There are some subtleties with this picture. Consider our current bias preferring omission over commission. We really do blame people less for not doing something (that allows a bad thing to happen) than for actively doing something bad. It's very plausible that such "preferences" cannot easily be justified, no matter how much the AI tries. However, a root extraction of this bias may remove many crucial aspects of humanity. It seems that the ideal might be preserving this human preference, but restructuring the world so that it has little detrimental impact. That sort of "ideal", however, is not one that can be reached by the process described above (the same argument can be applied to many examples "idealised preferences"). Further research and clarification is needed.