Frontpage

Here is a proposal for Inverse Reinforcement Learning in General Environments. (2 1/2 pages; very little math).

Copying the introduction here:

The eventual aim of IRL is to understand human goals. However, typical algorithms for IRL assume the environment is finite-state Markov, and it is often left unspecified how raw observational data would be converted into a record of human actions, alongside the space of actions available. For IRL to learn human goals, the AI has to consider general environments, and it has to have a way of identifying human actions. Lest these extensions appear trivial, I consider one of the simplest proposals, and discuss some difficulties that might arise.

New Comment
19 comments, sorted by Click to highlight new comments since: Today at 8:07 PM
My main point is that IRL, as it is typically described, feels nearly complete: just throw in a more advanced RL algorithm as a subroutine and some narrow-AI-type add-on for identifying human actions from a video feed, and voila, we have a superhuman human helper.
[...]
But maybe we could be spending more effort trying to follow through to fully specified proposals which we can properly put through the gauntlet.

Regardless of whether it is intended or not, this sounds like a dig at CHAI's work. I do not think that IRL is "nearly complete". I expect that researchers who have been at CHAI for at least a year do not think that IRL is "nearly complete". I wrote a sequence partly for the purpose of telling everyone "No, really, we don't think that we just need to run IRL to get the one true utility function; we aren't even investigating that plan".

(Sorry, this shouldn't be directed just at you in particular. I'm annoyed at how often I have to argue against this perception, and this paper happened to prompt me to actually write something.)

Also, I don't agree that "see if an AIXI-like agent would be aligned" is the correct "gauntlet" to be thinking about; that kind of alignment seems doomed to me, but in any case the AI systems we actually build are not going to look anything like that.

Regardless of whether it is intended or not, this sounds like a dig at CHAI’s work. I do not think that IRL is “nearly complete”. I expect that researchers who have been at CHAI for at least a year do not think that IRL is “nearly complete”. I wrote a sequence partly for the purpose of telling everyone “No, really, we don’t think that we just need to run IRL to get the one true utility function; we aren’t even investigating that plan”.

I think Stuart Russell still gives this impression in his (many) articles and interviews. I remember getting this impression listening to a recent interview, but will quote this Nov 2018 article instead since many of his interviews don't have transcripts:

Machines are beneficial to the extent that their actions can be expected to achieve our objectives [...]

It turns out, however, that it is possible to define a mathematical framework leading to machines that are provably beneficial in this sense. That is, we define a formal problem for machines to solve, and, if they solve it, they are guaranteed to be beneficial to us. In its simplest form, it goes like this:

  • The world contains a human and a machine.
  • The human has preferences about the future and acts (roughly) in accordance with them.
  • The machine’s objective is to optimise for those preferences.
  • The machine is explicitly uncertain as to what they are. [...]

There are two primary sources of difficulty that we are working on right now: satisfying the preferences of many humans and understanding the preferences of real humans. [...]

Machines will need to “invert” actual human behaviour to learn the underlying preferences that drive it.

Does this not sound like a plan of running (C)IRL to get the one true utility function?

Does this not sound like a plan of running (C)IRL to get the one true utility function?

I do not think that is actually his plan, but I agree it sounds like it. One caveat is that I think the uncertainty over preferences/rewards is key to this story, which is a bit different from getting a single true utility function.

But really my answer is, the inferential distance between Stuart and the typical reader of this forum is very large. (The inferential distance between Stuart and me is very large.) I suspect he has very different empirical beliefs, such that you could reasonably say that he's working on a "different problem", in the same way that MIRI and I work on radically different stuff mostly due to different empirical beliefs.

But really my answer is, the inferential distance between Stuart and the typical reader of this forum is very large. (The inferential distance between Stuart and me is very large.)

I would be interested to better understand Stuart Russell's perspective. What would you recommend that I read or watch in order to do that?

I suspect he has very different empirical beliefs, such that you could reasonably say that he’s working on a “different problem”, in the same way that MIRI and I work on radically different stuff mostly due to different empirical beliefs.

How many "different problems" would you say that people at CHAI are working on? (Are there more besides yours and Russell's?) How many people are working on each "different problem"?

I would be interested to better understand Stuart Russell's perspective. What would you recommend that I read or watch in order to do that?

Sadly I don't have any recommendations.

How many "different problems" would you say that people at CHAI are working on? (Are there more besides yours and Russell's?) How many people are working on each "different problem"?

That's... hard to answer. I feel like most graduate students at CHAI have a somewhat different opinion of what causes AI risk / what needs to be done to solve it, such that everyone is working on a "different problem". So really I should be trying to quantify how different they are... but that seems hard to do.

To be clear, I think we all basically agree on high-level aspects, for example that it would be worrying if we had a very intelligent agent that we couldn't understand, or that a true expected utility maximizer with some simple utility function would likely have convergent instrumental subgoals.

Sadly I don’t have any recommendations.

This seems like a strange state of affairs. If he thinks there's an important problem to be solved, and he has a unique perspective on what solving that problem involves, why hasn't he produced a paper or blog post or talk to explain what that perspective is? Is he expecting to solve the problem all by himself? Can you share your model of what's going on?

That’s… hard to answer. I feel like most graduate students at CHAI have a somewhat different opinion of what causes AI risk / what needs to be done to solve it, such that everyone is working on a “different problem”.

Same question here. Aside from yourself, the other CHAI grad students don't seem to have written up their perspectives of what needs to be done about AI risk. Are they content to just each work on their own version of the problem? Are they trying to work out among themselves which "different problem" is the real one?

Maybe one reason to not write up one's own "different problem" is that one doesn't expect to be able to convince anyone else to work on it or to receive useful feedback. If that's the main reason, I argue that it's still important to write it up in order to provide information to funders, strategists and policy makers about how much disagreement there is among AI safety researchers, and how much resources are need to "cover all the bases" in technical AI safety research. If this seems like a reasonable argument, maybe you could help convey it to your professors and fellow students?

This seems like a strange state of affairs. If he thinks there's an important problem to be solved, and he has a unique perspective on what solving that problem involves, why hasn't he produced a paper or blog post or talk to explain what that perspective is? Is he expecting to solve the problem all by himself? Can you share your model of what's going on?

I mean, he has, see Research Priorities for Robust and Beneficial Artificial Intelligence, and the articles you quote. What he hasn't done is a) read the counterarguments from LessWrongers and b) responded to those counterarguments in particular. When I say I don't have any recommendations, I mean I don't have any recommendations of writing that give responses to typical LessWrong counterarguments.

My model is very simple -- he's very busy and LessWrongers are at best a small fraction of the people he's trying to coordinate with, so writing up a response is not worth his time.

For a perhaps easier-to-relate-to example, this is approximately my model for why Eliezer doesn't respond to critiques of his arguments (1, 2).

Another example: the actual view I wanted to get across with the Value Learning sequence is Chapter 3. Chapters 1 and 2, and parts of Chapter 3, were primarily written in anticipation of counterarguments from LessWrongers, and made the Value Learning sequence require significantly more effort on my part.

Same question here. Aside from yourself, the other CHAI grad students don't seem to have written up their perspectives of what needs to be done about AI risk. Are they content to just each work on their own version of the problem? Are they trying to work out among themselves which "different problem" is the real one?

There is Mechanistic Transparency. But overall I agree that there aren't many such writeups. I think there's a combination of factors:

  • Expecting a failure to communicate. For example, after I wrote the Value Learning sequence, one of the grad students told me that they learned something from it, because it pinpointed the reason why the argument "the AGI must have a utility function" didn't work -- they already knew that the argument was sketchy, but they couldn't point at a particular flaw before. If they had tried to write about the reasons for their choice of research, depending on how it was written I'd expect the response from LW would be "but none of this matters; superintelligent AI will be an expected utility maximizer", and the discussion would stall.
  • Relatedly, not expecting useful feedback because of differing assumptions.
  • Many intuitions about what research is useful to do are not easy to express explicitly. It's very possible to think that a particular area is worth investigating, without being able to explain exactly why you think it is worth investigating.
  • Some are probably still trying to figure out what they do / don't believe about AI safety, and so are working on things that other people think are important.
  • Ryan's point below that writing blog posts on LW is not great for career capital.
  • I've also previously sent you an email about why people at CHAI don't use the Alignment Forum as much; many of those reasons will apply. (Not copying them here because I didn't ask them for permission to post publicly.)

I mean, he has, see Research Priorities for Robust and Beneficial Artificial Intelligence,

Thanks for this reference, but it's co-authored with Daniel Dewey and Max Tegmark and seems to serve as an overview of AI safety research agendas that existed in 2015 rather than Stuart Russell's personal research priorities. (It actually seems to cite MIRI and Bostrom more than anyone else.)

and the articles you quote.

The ones I looked at all seemed to be written at a very high level for a general (not even ML/AI researchers) audience (and as you noted seem to be overly simplified compared to his actual views). What is the best reference for explaining his personal view of AI risk/safety? I'm happy to read something that's written for a non-LW research audience.

(EDIT: Removed part about grad students, as it seems more understandable at this point for them to not have written up their views yet.)

seems to serve as an overview of AI safety research agendas that existed in 2015 rather than Stuart Russell's personal research priorities.

Fair point (I just skimmed it again, I last read it over a year ago). In that case I don't think there is such a reference, which I agree is confusing. He is working on a book about AI safety that is supposed to be published soon, but I don't know any details about it.

Aside from yourself, the other CHAI grad students don't seem to have written up their perspectives of what needs to be done about AI risk. Are they content to just each work on their own version of the problem?

I think this is actually pretty strategically reasonable.

CHAI students would have high returns to their probability of attaining a top professorship by writing papers, which is quite beneficial for later recruiting top talent to work on AI safety, and quite structurally beneficial for the establishment of AI safety as a field of research. The time they might spend writing up their research strategy does not help with their this, nor with recruiting help with their line of work (because other nearby researchers face similar pressures, and because academia is not structured to have PhD students lead large teams).

Moreover, if they are pursuing academic success, they face strong incentives to work on particular problems, and so their research strategies may be somewhat distorted by these incentives, decreasing the quality of a research agenda written in that context.

When I look at CHAI research students, I see some pursuing IRL, some pursuing game theory, some pursuing the research areas of their supervisors (all of which could lead to professorships), and some pursuing projects of other research leaders like MIRI or Paul. This seems healthy to me.

I'm sorry it sounded like a dig at CHAI's work, and you're right that "typically described" is at best a generalization over too many people, and worst, wrong. It would be more accurate to say that when people describe IRL, I get the feeling that it's nearly complete--I don't think I've seen anyone presenting an idea about IRL flag the concern that the issue of recognizing the demonstrator's action might jeopardizing the whole thing.

I did intend to cast some doubt on whether the IRL research agenda is promising, and whether inferring a utility function from a human's actions instead of from a reward signal gets us any closer to safety, but I'm sorry to have misrepresented views. (And maybe it's worth mentioning that I'm fiddling with something that bears strong resemblance to Inverse Reward Design, so I'm definitely not that bearish on the whole idea).

Also, I don't agree that "see if an AIXI-like agent would be aligned" is the correct "gauntlet" to be thinking about; that kind of alignment seems doomed to me, but in any case the AI systems we actually build are not going to look anything like that.

I'm going to do my best to describe my intuitions around this.

Proposition 1: an agent will be competent at achieving goals in our environment to the extent that its world-model converges to the truth. It doesn't have to converge all the way, but the KL-divergence from the true world-model to its world-model should reach the order of magnitude of the KL-divergence from the true world-model to a typical human world-model.

Proposition 2: The world-model resulting from Bayesian reasoning with a sufficiently large model class does converge to the truth, so from Proposition 1, any competent agent's world-model will converge as close to the Bayesian world-model as it does to the truth.

Proposition 3: If the version of an "idea" that uses Bayesian reasoning (on a model class including the truth) is unsafe, then the kind of agent we actually build that is "based on that idea" will either a) not be competent, or b) roughly approximate the Bayesian version, and by default, be unsafe as well (in the absence of some interesting reason why a small confusion about future events will lead to a large deprioritization of dangerous plans).

Letting F be a failure mode that arises when an idea is implemented in the framework of Bayesian agent with a model class including the truth, I expect in the absence of arguments otherwise, that the same failure mode will appear in any competent agent which also implements the idea in some way. However, it can be much harder to spot it, so I think one of the best ways to look for possible failure modes in the sort of AI we actually build is to analyze the idealized version, i.e. an agent it's approximating, i.e. a Bayesian agent with a model class including the truth. And then on the flip side, if the idea still seems to have real value when formalized in a Bayesian agent with a large model class, tractable approximations thereof seem (relatively) likely to work similarly well.

Maybe you can point me toward the steps that seem the most opaque/fishy.

Sorry in advance for how unhelpful this is going to be. I think decomposing an agent into "goals", "world-model", and "planning" is the wrong way to be decomposing agents. I hope to write a post about this soon.

No, that's helpful. If it were the right way, do you think this reasoning would apply?

Edit: alternatively, if a proposal does decompose an agent into world-model/goals/planning (as IRL does), does the argument stand that we should try to analyze the behavior of a Bayesian agent with a large model class which implements the idea?

... Plausibly? Idk, it's very hard for me to talk about the validity of intuitions in an informal, intuitive model that I don't share. I don't see anything obviously wrong with it.

There's the usual issue that Bayesian reasoning doesn't properly account for embeddedness, but I don't think that would make much of a difference here.

IRL to get the one true utility function

I think I'm understanding you to be conceptualizing a dichotomy between "uncertainty over a utility function" vs. "looking for the one true utility function". (I'm also getting this from your comment below:

One caveat is that I think the uncertainty over preferences/rewards is key to this story, which is a bit different from getting a single true utility function.

).

I can't figure out on my own a sense in which this dichotomy exists. To be uncertain about a utility function is to believe there is one correct one, while engaging in the process of updating probabilities about its identity.

Also, for what it's worth, in the case where there is an unidentifiability problem, as there is here, even in the limit, a Bayesian agent won't converge to certainty about a utility function.

I think I'm understanding you to be conceptualizing a dichotomy between "uncertainty over a utility function" vs. "looking for the one true utility function".

Well, I don't personally endorse this. I was speculating on what might be relevant to Stuart's understanding of the problem.

I was trying to point towards the dichotomy between "acting while having uncertainty over a utility function" vs. "acting with a known, certain utility function" (see e.g. The Off-Switch Game). I do know about the problem of fully updated deference and I don't know what Stuart thinks about it.

Also, for what it's worth, in the case where there is an unidentifiability problem, as there is here, even in the limit, a Bayesian agent won't converge to certainty about a utility function.

Agreed, but I'm not sure why that's relevant. Why do you need certainty about the utility function, if you have certainty about the policy?

Okay maybe we don't disagree on anything. I was trying to make different point with the unidentifiability problem, but it was tangential to begin with, so never mind.

A good starting point. I'm reminded of an old Kaj Sotala post (which then later provided inspiration for me writing a sort of similar post) about trying to ensure that the AI has human-like concepts. If the AI's concepts are inhuman, then it will generalize in an inhuman way, so that something like teaching a policy though demonstrations might not work.

But of course having human-like concepts is tricky and beyond the scope of vanilla IRL.