A putative new idea for AI control; index here.

I feel Inverse Reinforcement Learning (IRL) and Cooperative Inverse Reinforcement Learning (CIRL) are very good ideas, and will likely be essential for safe AI if we can't come up with some sort of sustainable low impact, modular, or Oracle design. But IRL and CIRL have a weakness. In a nutshell:

#. The models (C)IRL uses for humans are underspecified. #. This should cause CIRL to have motivated and manipulative learning. #. Even without that, (C)IRL can end up fitting a terrible model to humans. #. To solve those issues, (C)IRL will need to make creative modelling decisions that go beyond (standard) learning.


In a nutshell within the nutshell, (C)IRL doesn't avoid the main problems that other learning approaches have. Let's look at each of these points in turn.

The models (C)IRL uses for humans are underspecified

This shouldn't be in doubt. CIRL doesn't have a proper model of a human, beyond an agent that "knows the reward function" (). Standard IRL has even less: an expert policy, or a set of sampled trajectories (examples of human performance). There have been efforts to add noise to the model of human behaviour, but only in a very simplistic way, that doesn't model the full range of human irrationality (see some examples here).

Of course, given a diverse enough prior, a correct model of human irrationality will be included, but the human remains underspecified.

This should cause CIRL to have motivated and manipulative learning

The CIRL is not immune to the usual pressures towards manipulative learning of any agent whose goal is specified in terms of what the agent learns.

To illustrate with an example: suppose first that the CIRL models the human as being perfectly rational, free of error or bias. Then, assuming the CIRL can also predict and manipulate human behaviour, it can force the human to confirm (through action or speech) that some particularly-easy-to-maximise is the correct one.

But the CIRL agent is unlikely to have only this "rationality model". It may have a large variety of models, and maybe some explicit meta-preferences. But the same pressure applies: the agent will attempt to manipulate the update of the models similarly, all to force towards something particularly easy to maximise.

Partially defining terms like bias and bounded rationality don't help here; since the agent is corrupting (from our perspective, thought not from a formal persective) the learning process, it will fix its formal "bias" and "bounded rationality" terms to mean whatever it can make them mean.

Consider the concept of alief. An alief is an automatic or habitual belief-like attitude. For example, a person standing on a transparent balcony may believe that they are safe, but alieve that they are in danger.

This is the sort of concept that a purely learning AI would come up with if it were observing human behaviour, and would allow it to model us better. But with the AI's learning corrupted, aliefs and other concepts would merely allow it equivocate between what is knowledge, what is bias, and what is preference.

Again, the corruption of the AI's learning does not come from any explicit anti-learning programming, but merely from underspecified models and a desire to maximise the learnt reward.

Even without that, (C)IRL can end up fitting a terrible model to humans

AIXI has an incorrect self model, so can end up destroying itself. Similarly, if the space of possible models the AI considers is too narrow, it can end up fitting a model to human behaviour that wildly inappropriate, forcing it to fit as much as it can (this mis-fit has a similarity to AIs handling ontology shifts badly).

Even if the AI's priors include an acceptable model of humans, it may still end up fitting different ones. It could model humans as a mix of conflicting subagents, or even something like "the hypothalamus is the human, the rest of the brain is this complicated noise", and the model could fit - and fit very well, depending on what "complicated noise" it is allowed to consider.

To solve those issues, (C)IRL will need to make creative modelling decisions that go beyond (standard) learning

Imagine we that we have somehow solved all the issues above - the CIRL agent is motivated to learn, correctly, about human values (and then end up maximising them). Somehow, we've ensured that it will consistently use definitional concepts like "bias" and "human knowledge" in the ways we would like it to.

It still has to resolve a lot of issues that we ourselves haven't solved. Such as the tension between procrastination and obsessive focus. Or what population ethics it should use. Or how to resolve stated versus revealed preferences, and how to deal with beliefs in belief and knowledge that people don't want to know.

Essentially, the AI has to be able to do moral philosophy exactly as a human would, and to do it well. Without us being able to define what "exactly as a human would" means. And it has to continue this, as both it and humans change and we're confronted by a world completely transformed, and situations we can't currently imagine.

New Comment
29 comments, sorted by Click to highlight new comments since:

Of course, given a diverse enough prior, a correct model of human irrationality will be included, but the human remains underspecified.

More specifically, it seems like the biggest problem with having a diverse prior is that the correct (utility function, irrationality model) pair might not be able to be learned from any amount of data. For example, perhaps humans like apples, or perhaps they don't but act like they do, due to irrationality; either way they behave the same. See also Paul's post on this.

Thanks - Paul's post is useful, and I'm annoyed I didn't know about it, it would have avoided me rediscovering the same ideas. That's a failure of communication; what should I do to avoid these in future (simply reading all of Paul's and MIRI's stuff seems unfeasible). Maybe talk with people from MIRI more often?

If you don't read everything I write, then you certainly can't know everything I've written :)

The normal approach is to talk with people about a particular question before spending time on it. Someone can hopefully point you to relevant things that have been written.

That said, I think it takes less than 10 minutes a day to read basically everything that gets written about AI control, so it seems like we should all probably just do that. Does it seem infeasible because of the time requirement, or for some other reason? Am I missing some giant body of sensible writing on this topic?

Stuart did make it easier for many of us to read his recent ideas by crossposting them here. I'd like there to be some central repository for the current set of AI control work, and I'm hoping that the forum could serve as that.

Is there a functionality that, if added here, would make it trivial to crosspost when you wrote something of note?

Essentially, the AI has to be able to do moral philosophy exactly as a human would, and to do it well. Without us being able to define what “exactly as a human would” means. And it has to continue this, as both it and humans change and we’re confronted by a world completely transformed, and situations we can’t currently imagine.

Despite AI safety becoming a more mainstream topic, I still haven't seen a lot of people outside of FHI/MIRI/LessWrong acknowledge or discuss this part of the problem. (An alternative to AI being able to do moral philosophy correctly is developing an AI/human ecosystem that somehow preserves our collective ability to eventually discover our values and optimize for them, while not having a clear specification of what our values are or how to do moral philosophy in the meantime. But that doesn't seem any easier and I haven't seen people outside of FHI/MIRI/LessWrong talk about that either.)

I'm curious, since you probably have a much better idea of this than I do, do people who for example proposed (C)IRL without acknowledging the difficulties you described in this post actually understand these difficulties and just want to write papers that show some sort of forward progress, or are they not aware of them?

[...] do people who for example proposed (C)IRL without acknowledging the difficulties you described in this post actually understand these difficulties and just want to write papers that show some sort of forward progress, or are they not aware of them?

As someone who has worked on IRL a little bit, my impression is that such algorithms are not intended to capture human value to its full extent, but rather to learn shorter-term instrumental preferences. Paul gives some arguments for such "narrow value learning" here. This scenario, where human abilities are augmented using AI assistants, falls under your AI/human ecosystem category. I don't think many people view "moral philosophy" as a separate type of activity that differentially benefits less from augmentation. Rather, AI assistants are seen as helping with essentially all tasks, including analyzing the consequences of decisions that have potentially far-reaching impacts, deciding when to keep our options open, and engineering the next-generation AI/human system in a way that maintains alignment. I don't think this sort of bootstrapping process is understood very well, though.

(I saw your comment several days ago but couldn't reply until now. Apparently it was in some sort of moderation state.)

I don’t think many people view “moral philosophy” as a separate type of activity that differentially benefits less from augmentation.

This is what worries me though. It seems obvious to me that AI will augment some activities more than others, or earlier than others, and looking at the past, it seems that the activities that benefit most or earliest from AI augmentation are the ones we understand best in a computational or mathematical sense. For example, scientific computations, finding mathematical proofs, chess. I'd expect moral philosophy to be one of the last activities to benefit significantly from AI augmentation, since it seems really hard to understand what it is we're doing when we're trying to figure out our values, or even how to recognize a correct solution to this problem.

So in this approach we have to somehow build an efficient/competitive aligned system out of a core who doesn't know what their values are, and doesn't explicitly know how to find out what their values are, or worse, think they do know but is just plain wrong. (The latter perhaps applies for the great majority of the world's population.) I'd feel a lot better if people recognized this as a core difficulty, instead of brushing it away by assuming that moral philosophy won't differentially benefit less from augmentation (if that is indeed what they're doing). BTW, I think Paul does recognize this, but I'm talking about people outside of FHI/MIRI/LessWrong.

Do you think that we can consider this as its own problem, of technology outpacing philosophy, which we can evaluate separately of other aspects of AI risk? Or are these problems tied together in a critical way?

In the past people have argued that we needed to resole a wide range of philosophical questions prior to constructing AI because we would need to lock in answers to those questions at that point. I would like to push back against that view, while acknowledging that there may be object-level issues where we pay a costs because we lack philosophical understanding (e.g. how to trade off haste vs. extinction risk, how to deal with the possibility of strange physics, how to bargain effectively...). And I would further acknowledge that AI may have a differential effect on progress in physical technology vs. philosophy.

My current tentative view is that the total object-level cost from philosophical error is modest over the next subjective century. I also believe that you overestimate the differential effects of AI, but that's also not very firm. If my view changed on these points it might make me more enthusiastic about philosophy or metaphilosophy as research projects.

I have a much stronger belief that we should treat metaphilosophy and AI control as separate problems, and in particular that these concerns about metaphilosophy should not significantly dampen my enthusiasm for my current approach to resolving control problems.

I agree with the sentiment that there are philosophical difficulties that AI needs to take into account, but that very likely take far too long to formulate. Simpler kinds of indirect normativity that involve prediction of uploads allow delaying that work to after AI.

So this issue doesn't block all actionable work, as its straightforward form would suggest. There might be no need for the activities to be in this order in physical time. Instead it motivates work on the simpler kinds of indirect normativity that would allow such philosophical investigations to take place inside AI's values. In particular, it motivates figuring out what kind of thing AI's values are, in sufficient generality so that it would be able to represent the results of unexpected future philosophical progress.

If we could model humans as having well-defined values but irrational in predictable ways (e.g., due to computational constraints or having a limited repertoire of heuristics), then some variant of CIRL might be sufficient (along with solving certain other technical problems such as corrigibility and preventing bugs) for creating aligned AIs. I was (and still am) worried that some researchers think this is actually true, or by not mentioning further difficulties, give the wrong impression to policymakers and other researchers.

If you are already aware of the philosophical/metaphilosophical problems mentioned here, and have an approach that you think can work despite them, then it's not my intention to dampen your enthusiasm. We may differ on how much expected value we think your approach can deliver, but I don't really know another approach that you can more productively spend your time on.

Stuart Russell's view seems to be similar to the one described by 180 in another comment: humans have preferences about how to do moral deliberation, and an IRL agent ought to learn to deliberate in a way that humans endorse, and then actually execute that deliberation, rather than directly learning arbitrarily complex values about e.g. population ethics.

(At least, I discussed this issue with him once and this was the impression I got, but I may have misunderstood.)

This view looks very reasonable to me. You and I have gone back and forth on this point a little bit but I don't understand your position as well as I would like.

The authors of the CIRL paper are in fact aware of them, and are pondering them for future work. I've had fruitful conversations with Dylan Hadfield-Menell (one of the authors), talking about how a naive implementation goes wrong for irrational humans, and about what a tractable non-naive implementation might look like (trying to model probabilities of a human's action under joint hypotheses about the correct reward function and about the human's psychology); he's planning future work relevant to that question.

Also note Dylan's talk on CIRL, value of information, and the shutdown problem, which doesn't solve the problem entirely but which significantly improved my opinion of the usefulness of approaches like CIRL. (The writeup of this result is forthcoming.)

An alternative to AI being able to do moral philosophy correctly is developing an AI/human ecosystem that somehow preserves our collective ability to eventually discover our values and optimize for them, while not having a clear specification of what our values are or how to do moral philosophy in the meantime.

That's what I hope the various low-impact ideas will do.

[...] actually understand these difficulties

I think they do, partially. CIRL is actually a decent step forwards, but I think they thought it was more of step forward than it was.

Or maybe they thought that a little bit of extra work (a bit of meta-preferences, for instance) would be enough to make CIRL work.

This should cause CIRL to have motivated and manipulative learning

Have you ever seen an implementation of CIRL that would exhibit this behavior? I think you'd have to really stretch to write down an implementation with these problems, and that if you did it would look kind of silly.

Relatedly, in this post I describe two basic approaches to having an IRL agent reason sensibly about its own future learning. I think that neither of those approaches has this particular problem.

The problem you're discussing there is the same as the naive cake or death problem. You can avoid that by shoving an indicator function into the utility function: , with (and and corresponding to home or office delivery).

The definitions of indicator function(s) contain the details of the learning process. But calling this a learning process doesn't make it unbiased. This leads to the sophisticated version of the cake or death problem. In terms of your setup, we can imagine that going to work requires more energy, and the robot has an energy penalty. Then the AI can ask the human to clarify; but if, say = "the human says home delivery, if asked", then the AI will, if it can, force the human to say "home delivery".

Avoiding these kinds of value learning problems is what I've been trying to do in recent posts.

Learning processes are unbiased when they are a martingale for any action sequence ("conservation of expected evidence," like Bayesian updating). In the case of value learning with a causal model, this just requires the values to not be causally downstream of the AI's actions, e.g. for them to be fixed before the first action of the agent. This is usually what people assume.

Then the AI can ask the human to clarify; but if, say = “the human says home delivery, if asked”, then the AI will, if it can, force the human to say “home delivery”.

I strongly believe that you should get more precise about exactly what various possible systems actually do, and exactly how you would set up the model, before trying to fix the problem. I think that if you formally write down the model you are imagining, it will (1) become obvious that it is a super weird model, (2) become obvious that there are more natural models that don't have the problem. The model you have in mind here seems to require totally pinning down that it means for the human to "say home delivery," while it is going to be way more natural to set up a causal model in which the human's utterances (and the system's observations of human utterances) are downstream of some latent human preferences.

If you want to give up on the usual Bayesian approach to value learning, in which values are latent structure that is fixed at the beginning of the AI's life, I think you should say something about why you are giving up on it.

If the point is just to have extra options, in case the Bayesian approach turns out to be prohibitively difficult, then you should probably call that out explicitly so that it is clear what situation you are addressing. You should also probably say something about why you are imagining the Bayesian approach doesn't work, since your posts still impose most of the same technical requirements and at face value don't look any easier to implement. How are you going to define the indicator function in terms of observations, except by specifying a probabilistic model and conditioning it on observations?

Even worse, having the conservation of expected evidence for every action sequence is not enough to make the AI behave well. Jessica's example of an AI that (to re-use the "human says" example for the moment) forces the human to randomly answer a question, has conservation of expected evidence... But not the other properties we want, such as conditional conservation of expected evidence (this is related to the ultra-sophisticated Cake or Death problem).

(“conservation of expected evidence,” like Bayesian updating). In the case of value learning with a causal model, this just requires the values to not be causally downstream of the AI’s actions, e.g. for them to be fixed before the first action of the agent. This is usually what people assume.

Yes, I've posted on that. But getting that kind of causal downstreaming is not easy (and most things that people have proposed for value learning violate those assumptions; I'm pretty sure approval based methods do as well). Stratification is one way you can get this.

So I'm not avoiding the Bayesian approach because I want more options, but because I haven't seen a decent Bayesian approach proposed.

In order to do value learning we need to specify what the AI is supposed to infer from some observations. The usual approach is to specify how the observations depend on the human's preferences, and then have the AI do bayesian updating. If we are already in the business of explicitly specifying a causal model that links latent preferences to observations, we will presumably specify a model where latent preferences are upstream of observations and not downstream of the AI's actions.

At some points it seems like you are expressing concerns about model misspecification, but I don't see how this would cause the problem either.

For example, suppose that I incorrectly specify a model where the human is perfectly reliable, such that if at any time they say they like death, then they really do. And suppose that the AI can easily intervene to cause the human to say they like death. You seem to imply that the AI would take the action to cause the human to say they like death, if death is easier to achieve. But I don't yet see why this would happen.

If the AI updates from the human saying that they like death, then it's because the AI doesn't recognize the impact of its own actions on the human's utterances. And if the AI doesn't recognize the impact of its own action on the human's utterances, then it won't bother to change its actions in order to influence the human's utterances.

I don't see any in-between regime where the AI will engage in this kind of manipulation, even if the model is completely misspecified. That is, I literally cannot construct any Bayesian agent that exhibits this behavior.

It seems like the only way it can appear is if we either (1) directly specify how the AI ought to update on observations, rather than specifying a model, or (2) specify a model in which the user's preferences are causally downstream of the AI's actions. But neither of those seems like things we would do.

because I haven’t seen a decent Bayesian approach proposed.

In some sense I agree with this. Specifying a model of how observations relate to preferences is very difficult! But both IRL and your writing seem to take as given such a model, and people who work on IRL in fact believe that we'll be able to construct good-enough models. So if you are objecting to this leg of the proposal, that would be a much more direct criticism of IRL on its own terms. (And this is what I meant by saying "give up on the Bayesian approach.")

For example, if you assume "Anything humans say about their preferences is true," that's basically giving up on the Bayesian approach as usually imagined (which would be to directly specify a model that relates preferences to utterances, and then to update on utterances) and replacing it with an ad-hoc algorithm for making inferences from human utterances (namely, accept them at face value). In the usual Bayesian setting, "humans are perfectly reliable" corresponds to believing that human utterances correctly track (fixed) human preferences, i.e. believing that it is impossible to influence those utterances.

For example, if you assume “Anything humans say about their preferences is true,” that’s basically giving up on the Bayesian approach as usually imagined

More formally, what I mean by that is "assume humans are perfectly rational, and fit a reward/utility function given those assumptions". This is a perfectly Bayesian approach, and will always produce a (over-complicated) utility function that fits with the observed behaviour.

In the usual Bayesian setting, “humans are perfectly reliable” corresponds to believing that human utterances correctly track (fixed) human preferences, i.e. believing that it is impossible to influence those utterances.

Yes and no. Under the assumption that humans are perfectly reliable, influencing human preferences and utterances is impossible. But this leads to behaviour that resembles influencing human utterances under other assumptions.

eg if you threaten a human with a gun and ask them to report they are maximally happy, a sensible model of human preferences will say they are lying. But the "humans are rational" model will simply conclude that humans really like being threatened in this way.

Then, assuming the CIRL can also predict and manipulate human behaviour, it can force the human to confirm (through action or speech) that some particularly-easy-to-maximise R is the correct one.

This seems incorrect to me; my understanding is that CIRL treats the true utility function as a historical fact. What's your model of CIRL such that this could happen?

If it models the human as always perfectly rational, then this is valid (eg if the human confirms they want heroin after being forcibly given heroin, then, since they are rational, they must really like the combination (having heroin)&(being forcibly given heroin); this must therefore be part of ).

All it needs is to model the human decision in that specific instance as rational, for that to work.

I agree that, if the AI has already manipulated the human into liking heroin, then the AI will confirm that the human in fact wants heroin. However, with CIRL the AI will not predict in advance that giving the human heroin would cause them to say they like heroin.

Specifically, we could consider a prior over utility functions where there's a 40% chance that the human likes heroin. The irrationality model is that, after being administered heroin or not, the human will say what they like, and will be correct with 80% probability. Say that, if the human likes heroin, they assign utility 5 to heroin and 0 to no heroin; while if the human dislikes heroin, they assign utility -5 to heroin and 0 to no heroin.

In this case the AI will not administer heroin, since it thinks that the human probably does not like heroin (according to the prior over utility functions), and will gain the same information either way. The AI (incorrectly) predicts that if it administered heroin to the human, the human would probably still say they don't like heroin. So at least in this case, the problem does not appear.

Are you concerned this combination of (prior over utility functions, irrationality model) is going to make bad predictions, so over time the AI will replace it with some other model?

Here I'm allowing the AI to predict in advance what giving the heroin would do. And the AI doesn't predict "the human likes heroin" but "contingent on some minor fact that is true in the world where the human is forced to take heroin, the human likes heroin".

Via tricks like that, the human's behaviour is seen to be perfectly rational.

Ok, here's my reconstruction of what model you're using:

The AI will take an action ; then the human will take action ; then a world state will result.

The human utility function is taken to be a function from to .

The AI has a "prior" over utility functions, , and a "rationality model", , saying what action a human would take given that they have a given utility function and given the action the AI took (in this case, let's say the human directly observes the AI's action). For example, could say that the human takes the optimal action 80% of the time and otherwise takes a suboptimal action.

Separately, the AI has a predictive model , saying what action the human is actually going to take. The predictive model is "good" is the sense that, for example, it should predict that the human will say they want heroin if they are given it. The AI also has a predictive model for the world .

Notably, is inconsistent with . If we define , then in general . For example, correctly predicts that the human will say they like heroin if the AI administers it; while says that the human will probably not (since they're correct 80% of the time, and dislike heroin with 60% probability).

The AI's "estimate" for the human's utility function, given action , is

.

The AI's objective is to optimize its "estimate" of expected utility:

And it scores actions by taking the expectation of this, using :

Correct me if I'm wrong about the model you're using.

If this is the model you're using, then it is (a) inconsistent with the IRL/CIRL literature, and (b) looks really weird (as Paul points out). In this case you should be clear that you're not criticizing IRL/CIRL, you're criticizing a different model which, as far as I know, no one has advocated as a good way of learning human values.

No, that's not it! Let me try to make it clearer.

Let be quite simple: the human always takes the optimal action. The predictive model predicts that, if given heroin, the human will take more, and, if not, will not.

It seems that and are in contradiction with each other, but that is because we are using a different model of human preferences. A valid model of human preferences, under , is that humans like heroin if it's forced on them. Or, if you want to isolate this from the AI's direct actions, that humans like heroin if happens, where is some unrelated event that happens if the AI forces heroin on the human, but not otherwise.

Having separate models and is already quite weird; usually there would be a single model where values appear as latent structure.

You could legitimately complain that it seems very hard to construct such a model. And indeed I am skeptical that it will be possible. But if you want to fix problems arising from specifying rather than , it seems like you should say something about why specifying a separate is easier, or why someone would do it. At face value it looks equally difficult.

(Also, it is definitely not clear what algorithm you are referring to in this comment. Can you specify what computation the AI actually does / what kind of objects this and are? The way I can see to make it work, is a distribution over observations and is a distribution over values conditioned on observations. Is that right?)

The model is simply a model of human behaviour. It's objective in the sense that it simply attempts to predict what humans will do in practice. It is, however, useless for figuring out what human values are, as it's purely predictive of observations.

The model is an explanation/model for deducing human preferences or values, from observations (or predicted observations). Thus, given and , you can construct , the human reward function (note that , , and are all very different types of objects).

Simple possible 's would be = "everything the human does is rational" or = "everything the human does is random".

So each contains estimates of rationality, noise, bias, amount of knowledge, and so on. Generally you'd want to have multiple 's and update them in terms of observations as well.

What kind of object is ? (I assume its not a string.) Are you directly specifying a distribution of preferences conditioned on observations? Are you specifying a distribution over observations conditioned on preferences and then using inference?

I assume the second case. So given that is a predictive model, why wouldn't you also use as your model for planning? What is the advantage of using two separate models? Has anyone proposed using separate models in this way?

To the extent that your model is bad, it seems like you are just doomed to perform badly, and the you either need to abandon the model-based approach or come up with a better model. Adding a second model doesn't sound promising at face value.

It may be interesting or useful to have two models in this way, but I think it's an unusual architecture that requires some discussion.