To get around the impossibility result, we need "normative assumptions": assumptions about the preferences (or beliefs, or biases) of the agent that cannot be deduced fully from observations.
Under the optimistic scenario, we don't need many of these, at least for identifying human preferences. We can label a few examples ("the anchoring bias, as illustrated in this scenario, is a bias"; "people are at least weakly rational"; "humans often don't think about new courses of action they've never seen before", etc...). Call this labelled data[2] D.
The algorithm now constructs categories preferences*, beliefs*, and biases* - these are the generalisations that it has achieved from D
Yes, even on the 'optimistic scenario' we need external information of various kinds to 'debias'. However, this external information can come from a human interacting with the AI, in the form of human approval of trajectories or actions taken or proposed by an AI agent, on the assumption that since our stated and revealed preferences diverge, there will sometimes be differences in what we approve of and what we do that are due solely to differences in bias.
This is still technically external to observing the human's behaviour, but it is essentially a second input channel for information about human preferences and biases. This only works, of course, if humans tend to approve different things to the things that they actually do in a way influenced by bias (otherwise you have the same information as you'd get from actions, which helps with improving accuracy but not debiasing, see here), which is the case at least some of the time.
In other words, the beliefs and preferences are unchanged when the agent acts or approves but the 'approval selector' is different from the 'action selector' sometimes and, based on what does and does not change, you can try to infer what originated from legitimate beliefs and preferences and what originated from variation between the approval and action selector, which must be bias.
So, for example, if we conducted a principle component analysis on π, we would expect that the components would all be mixes of preferences/beliefs/biases.
So a PCA performed on the approval would produce a mix of beliefs, preferences and (different) biases. Underlying preferences are, by specification, equally represented either by human actions or by human approval of actions taken (since no matter what they are your preferences), but many biases don't exhibit this pattern - for example, we discount more over time in our revealed preferences than in our stated preferences. What we approve of typically represents a less (or at least differently) biased response than what we actually do.
There has already been research on combining information on reward models from multiple sources, to infer a better overall reward model but not as far as I know on specifically actions and approval as differently biased sources of information.
CIRL ought to extract our revealed preferences (since it's based on behavioural policy) while a method like reinforcement learning from human preferences should extract our stated preferences - that might be a place to start, at least on validating that there actually are relevant differences caused by differently strong biases in our stated vs revealed preferences, and that the methods actually do end up with different policies.
The goal here would be to have some kind of 'dual channel' preference learner that extracts beliefs and preferences from biased actions and approval by examining what varies. I'm sure you'd still need labelling and explicit information about what counts as a bias, but there might need to be a lot less than with single information sources. How much less (how much extra information you get from such divergences) seems like an empirical question. Finding out how common divergences between stated and revealed preferences that actually influence the learned policies of agents designed to infer human preferences from actions vs approval are would be useful as a first step.
In the pessimistic scenario, human preferences, biases, and beliefs are twisted together is a far more complicated way, and cannot be separated by a few examples.
In contrast, take examples of racial bias, hindsight bias, illusion of control, or naive realism. These biases all seem to be of quite different from the anchoring bias, and quite different from each other. At the very least, they seem to be of different "type signature".
So, under the pessimistic scenario, some biases are much closer to preferences that generic biases (and generic preferences) are to each other.
What I've suggested should still help at least somewhat in the pessimistic scenario - unless preferences/beliefs vary when you switch between looking at approval vs actions more than biases vary, you can still gain some information on underlying preferences and beliefs by seeing how approval and actions differ.
Of the difficult examples you gave, racial bias at least varies between actions and approval. Implementing different reward modelling algorithms and messing around with them to try and find ways to extract unbiased preferences from multiple information sources might be a useful research agenda.
There has already been research done on using multiple information sources to improve the accuracy of preference learning - Reward-rational implicit choice, but not specifically on using the divergences between different sources of information from the same agent to learn things about the agents unbiased preferences.
Glad you think so! I think that methods like using multiple information sources might be a useful way to reduce the number of (potentially mistaken) normative assumptions you need in order to model a single human's preferences.
The other area of human preference learning where you seem, inevitably, to need a lot of strong normative assumptions is in preference aggregation. If we assume we have elicited the preferences of lots of individual humans, and we're then trying to aggregate their preferences (with each human's preference represented by a separate model) I think the same basic principle applies, that you can reduce the normative assumptions you need by using a more complicated voting mechanism, in this case one that considers agents' ability to vote strategically as an opportunity to reach stable outcomes.
I talk about this idea here. As with using approval/actions to improve the elicitation of an individual's preferences, you can't avoid making any normative assumptions by using a more complicated aggregation method, but perhaps you end up having to make fewer of them. Very speculatively, if you can combine a robust method of eliciting preferences with few inbuilt assumptions with a similarly robust method of aggregating preferences, you're on your way to a full solution to ambitious value learning.
Thanks, I found this helpful.
If you had a complete perfect model of the human brain, would it help? I'm guessing you'll say "not unless you also have a function that inputs a snapshot of your brain model and outputs the associated beliefs / preferences / biases." Is that right?
In this post and the next, I try and clarify - for myself and for others - the precise practical implications of the "Occam's razor is insufficient to infer the preferences of irrational agents " paper.
Time and again, I've had trouble getting others to understand what that paper implies, and what it doesn't. It's neither irrelevant (like many no-free-lunch theorems), nor is it a radical skepticism/nothing is real/we can't really know anything paper.
I've been having productive conversations with Rebecca Gorman, whom I want to thank for her help (and who phrased things well in terms of latent variables)!
A simple biased agent
Consider the following simple model of an agent:
The agent's actions can be explained by their beliefs and preferences[1], and by their biases: by this, we mean the way in which the action selector differs from an unboundedly rational expected preference maximiser.
The results of the Occam's razor paper imply that preferences (and beliefs, and biases) cannot be deduced separately from knowing the agent's policy (and hence, a fortiori, from any observations of the agent's behaviour).
Latent and "natural" variables
Let x be a latent variable of the policy π - or a some variable that can be deduced from π in some simple or natural way.
A consequence of the Occam's razor result is that any such x will typically be a mixture of preferences, beliefs, and biases. For if the x tended to be restricted to one of these three components, that would mean that separating them would be possible via latent or simple variables.
So, for example, if we conducted a principle component analysis on π, we would expect that the components would all be mixes of preferences/beliefs/biases.
The optimistic scenario
To get around the impossibility result, we need "normative assumptions": assumptions about the preferences (or beliefs, or biases) of the agent that cannot be deduced fully from observations.
Under the optimistic scenario, we don't need many of these, at least for identifying human preferences. We can label a few examples ("the anchoring bias, as illustrated in this scenario, is a bias"; "people are at least weakly rational"; "humans often don't think about new courses of action they've never seen before", etc...). Call this labelled data[2] D.
The algorithm now constructs categories preferences*, beliefs*, and biases* - these are the generalisations that it has achieved from D. Optimistically, these correspond quite closely to what we mean by these categories, at least when combined with the information of πh, the policy of human h. It is now possible for the algorithm to identify latent or natural variables that lie along the "preferences", "beliefs", and "biases" axes, thus identifying and isolating human preferences.
It seems there's a contradiction here - by definition, D does not contain much information, yet separating preferences may require a lot of information. The hope is that D acts as a doorway to other sources of information - such as human psychology papers, Wikipedia, human fiction, and so on. Call this other data D′.
The Occam's razor result still applies to D′+πh: one of the simplest explanations for πh is to assume that h is always rational and that D′ consists of "speech acts" (think of a dishonest politician's speech - you would not want to take the literal content of the speech as correct information). The result still applies even to D′+∑h∈Hπh, where we take the policies of every human in the set H of all humans.
However, it is hoped that D+D′+πh will allow the algorithm to effectively separate preferences from biases and beliefs. The hope is that D acts as key to unlock the vast amount of information in D′ - that once the algorithm has a basic idea what a preference is, then all the human literature on the meaning of preference becomes usable. As more than just speech acts, but as actual sources of information, as the algorithm realises the meaning of D′ the way we want it to, and realises what is lies/metaphors/exaggerations.
This is what we would hope would happen. Guided by our own intuitions - which have no problem distinguishing preferences in other humans and in ourselves, at least roughly - we may feel that this is likely.
The pessimistic scenario
In the pessimistic scenario, human preferences, biases, and beliefs are twisted together is a far more complicated way, and cannot be separated by a few examples.
Consider for example the anchoring bias. I've argued that the anchoring bias is formally very close to being a taste preference.
In contrast, take examples of racial bias, hindsight bias, illusion of control, or naive realism. These biases all seem to be of quite different from the anchoring bias, and quite different from each other. At the very least, they seem to be of different "type signature".
So, under the pessimistic scenario, some biases are much closer to preferences that generic biases (and generic preferences) are to each other. It's not uncommon for parts of the brain to reuse other parts for different purposes; the purity moral preference, for example, recycles part of the emotion of disgust. Individual biases and preferences probably similarly use a lot of the same machinery in the brain, making it hard to tell the differences between them.
Thus providing a few examples of preferences/beliefs/biases, D, is not enough to disentangle them. Here D fails to unlock the meaning of D′ - when reading psychology papers, the algorithm sees a lot of behaviour ("this human wrote this paper; I could have predicted that"), but not information relevant to the division between preferences/beliefs/biases.
Pessimism, information, and circular reasoning
It's worth digging into that last point a bit more, since it is key to many people's intuitions in this area. On this website, we find a quote:
Taken literally, this would mean civil strife << war << peace. But no-one sensible would take it literally; first of all, we'd want to know if the quote was genuine, we'd want to figure out a bit about Herodotus's background, we'd want to see whether his experience is relevant, what has changed in warfare and human preferences over the centuries, and so on.
So we'd be putting the information into context, and, to do so, we'd be using our own theory of mind, our own knowledge of what a preference is, what beliefs and biases humans typically have...
There's a chicken and egg problem: it's not clear that extra information is much use to the algorithm, without a basic understanding of what preferences/beliefs/biases are. So without a good grasp to get started, the algorithm may not be able to use the extra information - even all the world's information - to get a further understanding. And human outputs - such as psychology literature - are written to be understood unambiguously (-ish) by humans. Thus interpreting it in the human fashion, may rely on implicit assumptions that the algorithm doesn't have access to.
It's important to realise that this is not a failure of intelligence on the part of the algorithm. AIXI, the idealised uncomputable superintelligence, will fail at image classification tasks if we give it incorrectly labelled data or don't give it enough ambiguous examples to resolve ambiguous cases.
Failure mode of pessimistic scenario
So the failure mode, in the pessimistic scenario, is that the algorithm generates the categories preferences*, beliefs*, and biases*, but that these don't correspond well to actual preferences, beliefs, or biases - at least not as we get beyond the training examples provided (it doesn't help that humans themselves have trouble distinguishing these in many situations!).
Sp, what the algorithm thinks is a preference may well be a mixture of all three categories. We might correct it by pointing out its mistakes and adding some more examples, but this might only carry it a bit further: whenever it gets to an area where we haven't provided labels, it starts to make large categorisation errors or stumbles upon adversarial examples.
This may feel counter-intuitive, because, for us, extracting preferences feels easy. I'll address that point in the next section, but I'll first note that algorithms finding tasks hard that we find easy is not unusual.
To reiterate: making the algorithm smarter would not solve the problem; the issue (in the pessimistic scenario) is that the three categories are not well-defined nor well-separated.
Pessimism: humans interpreting other humans
We know that humans can interpret the preferences, beliefs, and biases of other humans, at least approximately. If we can do it so easily, how could it be hard for a smart algorithm to do so?
Moravec's paradox might imply that it would be difficult for an algorithm to do so, but that just means we need a smart enough algorithm.
But the question might be badly posed, in which case infinite smartness would not be enough. For example, imagine that humans looked like this, with the "Human Agency Interpreter" (basically the theory of mind) doing the job of interpreting other humans. The green arrows are there to remind us how much of this is done via empathy: by projecting our own preferences/beliefs onto the human we are considering.
This setup also has an optimistic and a pessimistic scenario. They involve how feasible it is for the algorithm to isolate the "Human Agency Interpreter". In the optimistic scenario, we can use a few examples, point to the Wikipedia page on theory of mind, and the algorithm will extract a reasonable facsimile of the human agency interpreter, and then use that to get a reasonable decomposition of the human algorithm into beliefs/preferences/biases.
In the pessimistic scenario, the Human Agency Interpreter is also twisted up with everything else in the human brain, and our examples are not enough to disentangle it, and the same problem re-appears at this level: there is no principled way of figuring out the human theory of mind, without starting from the human theory of mind.
It may seem odd that there is an arrow going from observations to preferences, but a) human preferences do seem to vary in time and circumstances, and b) there is no clear distinction between observation-dependent and observation-independent preferences. For example, you could have a preference for eating when you're hungry; is this an eating preference that is hunger-dependent, or a eating-when-hungry preference that is independent of any observations? Because of these subtleties, I've preferred to draw the arrow unambiguously going into the preferences node, from the observations node, so that there is no confusion. ↩︎
This data may end up being provided implicitly, by programmers correcting "obvious mistakes" in the algorithm. ↩︎