Planned summary for the Alignment Newsletter:
This post lays out a pathway by which an AI-induced existential catastrophe could occur. The author suggests that AGI will be built via model-based reinforcement learning: that is, given a reward function, we will learn a world model, a value function, and a planner / actor. These will learn online, that is, even after being deployed these learned models will continue to be updated by our learning algorithm (gradient descent, or whatever replaces it). Most research effort will be focused on learning these models, with relatively less effort applied to choosing the right reward function.
There are then two alignment problems: the _outer_ alignment problem is whether the reward function correctly reflects the designer's intent, and the _inner_ alignment problem is whether the value function accurately represents the expected reward obtained by the agent over the long term. On the inner alignment side, the value function may not accurately capture the reward for several reasons, including ambiguity in the reward signals (since you only train the value function in some situations, and many reward functions can then produce the same value function), manipulation of the reward signal, failures of credit assignment, ontological crises, and having mutually contradictory "parts" of the value function (similarly to humans). On the outer alignment side, we have the standard problem that the reward function may not reflect what we actually want (i.e. specification gaming or Goodhart's Law). In addition, it seems likely that many capability enhancements will be implemented through the reward function, e.g. giving the agent a curiosity reward, which increases outer misalignment.
Planned opinion:
While I disagree on some of the details, I think this is a good threat model to be thinking about. Its main virtue is that it has a relatively concrete model for what AGI looks like, and it provides a plausible story for both how that type of AGI could be developed (the development model) and how that type of AGI would lead to problems (the risk model). Of course, it is still worth clarifying the plausibility of the scenario, as updates to the story can have significant implications on what research we do. (Some of this discussion is happening in [this post](https://www.alignmentforum.org/posts/pz7Mxyr7Ac43tWMaC/against-evolution-as-an-analogy-for-how-humans-will-create).)
Great post!
Do you like football? Well “football” is a learned concept living inside your world-model. Learned concepts like that are the only kinds of things that it’s possible to “like”. You cannot like or dislike [nameless pattern in sensory input that you’ve never conceived of]. It’s possible that you would find this nameless pattern rewarding, were you to come across it. But you can’t like it, because it’s not currently part of your world-model. That also means: you can’t and won’t make a goal-oriented plan to induce that pattern.
This was a ‘click’ for me, thanks.
It seems to me that deliberation can expand the domain of the value function. If I don’t know of football per se, but I’ve played a sport before, then I can certainly imagine a new game and form opinions about it. so I’m not sure how large the minimal set of generator concepts is, or if that’s even well-defined.
(Much of this has been touched on already in our Discord conversation:)
Inner alignment problem: The value function might be different from the reward function.
In fact that’s an understatement: The value function will be different from the reward function. Why? Among other things, because they have different type signatures—they accept different input!
Surely this isn't relevant! We don't by any means want the value function to equal the reward function. What we want (at least in standard RL) is for the value function to be the solution to the dynamic programming problem set up by the reward function and world model (or, more idealistically, the reward function and the actual world).
- The value function is a function of the latent variables in the world-model—thus, even abstract concepts like “differentiate both sides of the equation” are assigned values. The value function is updated by the reward signals, using (I assume) some generalization of TD learning (definition).
While something like this seems possible, it strikes me as a better fit for systems that do explicit probabilistic reasoning, as opposed to NNs. Like, if we're talking about predicting what ML people will do, the sentence "the value function is a function of the latent variables in the world model" makes a lot more sense than the clarification "even abstract concepts are assigned values". Because it makes more sense for the value to be just another output of the same world-model NN, or perhaps, to be a function of a "state vector" produced by the world-model NN, or maybe a function taking the whole activation vector of the world-model NN at a time-step as an input, as opposed to a value function which is explicitly creating output values for each node in the value function NN (which is what it sounds like when you say even abstract concepts are assigned values).
I assume that the learned components (world-model, value function, planner / actor) continue to be updated in deployment—a.k.a. online learning (definition). This is important for the risk model below, but seems very likely—indeed, unavoidable—to me:
This seems pretty implausible to me, as we've discussed. Like, yes, it might be a good research direction, and it isn't terribly non-prosaic. However, the current direction seems pretty focused on offline learning (even RL, which was originally intended specifically for online learning, has become a primarily offline method!!), and GPT-3 has convinced everyone that the best way to get online learning is to do massive offline training and rely on the fact that if you train on enough variety, learning-to-learn is inevitable.
- Online updating of the world-model is necessary for the AGI to have a conversation, learn some new idea from that conversation, and then refer back to that idea perpetually into the future.
I think my GPT-3 example adequately addresses the first two points, and memory networks adequately address the third.
- Online updating of the value function is then also necessary for the AGI to usefully employ those new concepts. For example, if the deployed AGI has a conversation in which it learns the idea of “Try differentiating both sides of the equation”, it needs to be able to assign and update a value for that new idea (in different contexts), in order to gradually learn how and when to properly apply it.
- Online updating of the value function is also necessary for the AGI to break down problems into subproblems. Like if “inventing a better microscope” is flagged by the value function as being high-value, and then the planner notices that “If only I had a smaller laser, then I’d be able to invent a better microscope”, then we need a mechanism for the value function to flag “inventing a smaller laser” as itself high-value.
These points are more interesting, but I think it's plausible that architectural innovations could deal with them w/o true online learning.
RE online learning, I acknowledge that a lot of reasonable people agree with you on that, and it's hard to know for sure. But I argued my position in Against evolution as an analogy for how humans will build AGI.
I have not properly read all of that yet, but my very quick take is that your argument for a need for online learning strikes me as similar to your argument against the classic inner alignment problem applying to the architectures you are interested in. You find what I call mesa-learning implausible for the same reasons you find mesa-optimization implausible.
Personally, I've come around to the position (seemingly held pretty strongly by other folks, eg Rohin) that mesa-learning is practically inevitable for most tasks.
if we're talking about predicting what ML people will do, the sentence "the value function is a function of the latent variables in the world model" makes a lot more sense than the clarification "even abstract concepts are assigned values".
OK sure, that's fair. Point well taken. I was thinking about more brain-like neural nets that parse things into compositional pieces. If I wanted to be more prosaic maybe I would say something like: "She is differentiating both sides of the equation" could have a different value than "She is writing down a bunch of funny symbols", even if both are coming from the exact same camera inputs.
Thanks!!
> The value function might be different from the reward function.
Surely this isn't relevant! We don't by any means want the value function to equal the reward function. What we want (at least in standard RL) is for the value function to be the solution to the dynamic programming problem set up by the reward function and world model (or, more idealistically, the reward function and the actual world).
Hmm. I guess I have this ambiguous thing where I'm not specifying whether the value function is "valuing" world-states, or actions, or plans, or all of the above, or what. I think there are different ways to set it up, and I was trying not to get bogged down in details (and/or not being very careful!)
Like, here's one extreme: imagine that the "planner" does arbitrarily-long-horizon rollouts of possible action sequences and their consequences in the world, and then the "value function" is looking at that whole future rollout and somehow encoding how good it is, and then you can choose the best rollout. In this case we do want the value function to converge to be (for all intents and purposes) a clone of the reward function.
On the opposite extreme, when you're not doing rollouts at all, and instead the value function is judging particular states or actions, then I guess it should be less like the reward function and more like "expected upcoming reward assuming the current policy", which I think is what you're saying.
Incidentally, I think the brain does both. Like, maybe I'm putting on my shoes because I know that this is the first step of a plan where I'll go to the candy store and buy candy and eat it. I'm motivated to put on my shoes by the image in my head where, a mere 10 minutes from now, I'll be back at home eating yummy candy. In this case, the value function is hopefully approximating the reward function, and specifically approximating what the reward function will do at the moment where I will eat candy. But maybe eventually, after many such trips to the candy store, it becomes an ingrained habit. And then I'm motivated to put on my shoes because my brain has cached the idea that good things are going to happen as a result—i.e., I'm motivated even if I don't explicitly visualize myself eating candy soon.
I guess I spend more time thinking about the former (the value function is evaluating the eventual consequences of a plan) than the latter (the value function is tracking the value of immediate world-states and actions), because the former is the component that presents most of the x-risk. So that's what was in my head when I wrote that.
(It's not either/or; I think there's a continuum between those two poles. Like I can consequentialist-plan to get into a future state that has a high cached value but no immediate reward.)
As for prosaic RL systems, they're set up in different ways I guess, and I'm not an expert on the literature. In Human Compatible, if I recall, Stuart Russell said that he thinks the ability to do flexible hierarchical consequentialist planning is something that prosaic AI doesn't have yet, but that future AGI will need. If that's right, then maybe this is an area where I should expect AGI to be different from prosaic AI, and where I shouldn't get overly worried about being insufficiently prosaic. I dunno :-P
Well anyway, your point is well taken. Maybe I'll change it to "the value function might be misaligned with the reward function", or "incompatible", or something...
Hmm. I guess I have this ambiguous thing where I'm not specifying whether the value function is "valuing" world-states, or actions, or plans, or all of the above, or what. I think there are different ways to set it up, and I was trying not to get bogged down in details (and/or not being very careful!)
Sure, but given most reasonable choices, there will be an analogous variant of my claim, right? IE, for most reasonable model-based RL setups, the type of the reward function will be different from the type of the value function, but there will be a "solution concept" saying what it means for the value function to be correct with respect to a set reward function and world-model. This will be your notion of alignment, not "are the two equal".
Like, here's one extreme: imagine that the "planner" does arbitrarily-long-horizon rollouts of possible action sequences and their consequences in the world, and then the "value function" is looking at that whole future rollout and somehow encoding how good it is, and then you can choose the best rollout. In this case we do want the value function to converge to be (for all intents and purposes) a clone of the reward function.
Well, there's still a type distinction. The reward function gives a value at each time step in the long rollout, while the value function just gives an overall value. So maybe you mean that the ideal value function would be precisely the sum of rewards.
But if so, this isn't really what RL people typically call a value function. The point of a value function is to capture the potential future rewards associated with a state. For example, if your reward function is to be high up, then the value of being near the top of a slide is very low (because you'll soon be at the bottom), even if it's still generating high reward (because you're currently high up).
So the value of a history (even a long rollout of the future) should incorporate anticipated rewards after the end of the history, not just the value observed within the history itself.
In the rollout architecture you describe, there wouldn't really be any point to maintaining a separate value function, since you can just sum the rewards (assuming you have access to the reward function).
On the opposite extreme, when you're not doing rollouts at all, and instead the value function is judging particular states or actions, then I guess it should be less like the reward function and more like "expected upcoming reward assuming the current policy", which I think is what you're saying.
It doesn't seem to me like there is any "more/less like reward" spectrum here. The value function is just different from the reward function. In an architecture where you have a "value function" which operates like a reward function, I would just call it the "estimated reward function" or something along those lines, because RL people invented the value/reward distinction to point at something important (namely the difference between immediate reward and cumulative expected reward), and I don't want to use the terms in a way which gets rid of that distinction.
Like, maybe I'm putting on my shoes because I know that this is the first step of a plan where I'll go to the candy store and buy candy and eat it. I'm motivated to put on my shoes by the image in my head where, a mere 10 minutes from now, I'll be back at home eating yummy candy. In this case, the value function is hopefully approximating the reward function, and specifically approximating what the reward function will do at the moment where I will eat candy.
How is this "approximating the reward function"?? Again, if you feed both the value and reward function the same thing (the imagined history of going to the store and coming back and eating candy), you hope that they produce very different results (the reward function produces a sequence of individual rewards for each moment, including a high reward when you're eating the candy; the value function produces one big number accounting for the positives and negatives of the plan, including estimated future value of the post-candy-eating crash, even though that's not represented inside the history).
Well anyway, your point is well taken. Maybe I'll change it to "the value function might be misaligned with the reward function", or "incompatible", or something...
I continue to feel like you're not seeing that there is a precise formal notion of "the value function is aligned with the reward function", namely, that the value function is the solution to the value iteration equation (the bellman equation) wrt a given reward function and world model.
So maybe you mean that the ideal value function would be precisely the sum of rewards.
Yes, thanks, that's what I should have said.
In the rollout architecture you describe, there wouldn't really be any point to maintaining a separate value function, since you can just sum the rewards (assuming you have access to the reward function).
For "access to the reward function", we need to predict what the reward function will do (which may involve hard-to-predict things like "the human will be pleased with what I've done"). I guess your suggestion would be to call the thing-that-predicts-what-the-reward-will-be a "reward function model", and the thing-that-predicts-summed-rewards the "value function", and then to change "the value function may be different from the reward function" to "the value function may be different from the expected sum of rewards". Something like that?
If so, I agree, you're right, I was wrong, I shouldn't be carelessly going back and forth between those things, and I'll change it.
For "access to the reward function", we need to predict what the reward function will do (which may involve hard-to-predict things like "the human will be pleased with what I've done"). I guess your suggestion would be to call the thing-that-predicts-what-the-reward-will-be a "reward function model", and the thing-that-predicts-summed-rewards the "value function", and then to change "the value function may be different from the reward function" to "the value function may be different from the expected sum of rewards". Something like that?
Ah, that wasn't quite my intention, but I take it as an acceptable interpretation.
My true intention was that the "reward function calculator" should indeed be directly accessible rather than indirectly learned via reward-function-model. I consider this normative (not predictive) due to the considerations about observation-utility agents discussed in Robust Delegation (and more formally in Daniel Dewey's paper). Learning the reward function is asking for trouble.
Of course, hard-coding the reward function is also asking for trouble, so... *shrug*
Hi again, I finally got around to reading those links, thanks!
I think what you're saying (and you can correct me) is: observation-utility agents are safer (or at least less dangerous) than reward-maximizers-learning-the-reward, because the former avoids falling prey to what you called "the easy problem of wireheading".
So then the context was:
First you said, If we do rollouts to decide what to do, then the value function is pointless, assuming we have access to the reward function.
Then I replied, We don't have access to the reward function, because we can't perfectly predict what will happen in a complicated world.
Then you said, That's bad because that means we're not in the observation-utility paradigm.
But I don't think that's right, or at least not in the way I was thinking of it. We're using the current value function to decide which rollouts are good vs bad, and therefore to decide which action to take. So my "value function" is kinda playing the role of a utility function (albeit messier), and my "reward function" is kinda playing the role of "an external entity that swoops in from time to time and edits the utility function". Like, if the agent is doing terrible things, then some credit-assignment subroutine goes into the value function, looks at what is currently motivating the agent, and sets that thing to not be motivating in the future.
The closest utility function analogy would be: you're trying to make an agent with a complicated opaque utility function (because it's a complicated world). You can't write the utility function down. So instead you code up an automated utility-function-editing subroutine. The way the subroutine works is that sometimes the agent does something which we recognize as bad / good, and then the subroutine edits the utility function to assign lower / higher utility to "things like that" in the future. After many such edits, maybe we'll get the right utility function, except not really because of all the problems discussed in this post, e.g. the incentive to subvert the utility-function-editing subroutine.
So it's still in the observation-utility paradigm I think, or at least it seems to me that it doesn't have an automatic incentive to wirehead. It could want to wirehead, if the value function winds up seeing wireheading as desirable for any reason, but it doesn't have to. In the human example, some people are hedonists, but others aren't.
Sorry if I'm misunderstanding what you were saying.
So it's still in the observation-utility paradigm I think, or at least it seems to me that it doesn't have an automatic incentive to wirehead. It could want to wirehead, if the value function winds up seeing wireheading as desirable for any reason, but it doesn't have to. In the human example, some people are hedonists, but others aren't.
All sounds perfectly reasonable. I just hope you recognize that it's all a big mess (because it's difficult to see how to provide evidence in a way which will, at least eventually, rule out the wireheading hypothesis or any other problematic interpretations). As I imagine you're aware, I think we need stuff from my 'learning normativity' agenda to dodge these bullets.
In particular, I would hesitate to commit to the idea that rewards are the only type of feedback we submit.
FWIW, I'm now thinking of your "value function" as expected utility in Jeffrey-Bolker terms. We need not assume a utility function to speak of expected utility. This perspective is nice in that it's a generalization of what RL people mean by "value function" anyway: the value function is exactly the expected utility of the event "I wind up in this specific situation" (at least, it is if value iteration has converged). The Jeffrey-Bolker view just opens up the possibility of explicitly representing the value of more events.
So let's see if we can pop up the conversational stack.
I guess the larger topic at hand was: how do we define whether a value function is "aligned" (in an inner sense, so, when compared to an outer objective which is being used for training it)?
Well, I think it boils down to whether the current value function makes "reliably good predictions" about the values of events. Not just good predictions on average, but predictions which are never catastrophically bad (or at least, catastrophically bad with very low probability, in some appropriate sense).
If we think of the true value function as V(x), and our approximation as V(x), we want something like: under some distance metric, if there is a modification of V*(x) with catastrophic downsides, V(x) is closer to V*(x) than that modification. (OK that's a bit lame, but hopefully you get the general direction I'm trying to point in.)
Something like that?
it's all a big mess
Yup! This was a state-the-problem-not-solve-it post. (The companion solving-the-problem post is this brain dump, I guess.) In particular, just like prosaic AGI alignment, my starting point is not "Building this kind of AGI is a great idea", but rather "This is a way to build AGI that could really actually work capabilities-wise (especially insofar as I'm correct that the human brain works along these lines), and that people are actively working on (in both ML and neuroscience), and we should assume there's some chance they'll succeed whether we like it or not."
FWIW, I'm now thinking of your "value function" as expected utility in Jeffrey-Bolker terms.
Thanks, that's helpful.
how do we define whether a value function is "aligned" (in an inner sense, so, when compared to an outer objective which is being used for training it)?
One way I think I would frame the problem differently than you here is: I'm happy to talk about outer and inner alignment for pedagogical purposes, but I think it's overly constraining as a framework for solving the problem. For example, (Paul-style) corrigibility is I think an attempt to cut through outer and inner alignment simultaneously, as is interpretability perhaps. And like you say, rewards don't need to be the only type of feedback.
We can also set up the AGI to NOOP when the expected value of some action is <0, rather than having it always take the least bad action. (...And then don't use it in time-sensitive situations! But that's fine for working with humans to build better-aligned AGIs.) So then the goal would be something like "every catastrophic action has expected value <0 as assessed by the AGI (and also, the AGI will not be motivated to self-modify or create successors, at least not in a way that undermines that property) (and also, the AGI is sufficiently capable that it can do alignment research etc., as opposed to it sitting around NOOPing all day)".
So then this could look like a pretty weirdly misaligned AGI but it has a really effective "may-lead-to-catastrophe (directly or indirectly) predictor circuit" attached. (The circuit asks "Does it pattern-match to murder? Does it pattern-match to deception? Does it pattern-match to 'things that might upset lots of people'? Does it pattern-match to 'things that respectable people don't normally do'?...") And the circuit magically never has any false-negatives. Anyway, in that case the framework of "how well are we approximating the intended value function?" isn't quite the right framing, I think.
I think we need stuff from my 'learning normativity' agenda to dodge these bullets.
Yeah I'm very sympathetic to the spirit of that. I'm a bit stumped on how those ideas could be implemented, but it's certainly in the space of things that I continue to brainstorm about...
Some nitpicks about your risk model slash ways in which my risk model differs from yours:
1. I think AIs are more likely to be more homogenous on Earth; even in a slow takeoff they might be all rather similar to each other. Partly for the reasons Evan discusses in his post, and partly because of acausal shenanigans. I certainly think that, unfortunately, given all the problems you describe, we should count ourselves lucky if any of the contending AI factions are aligned to our values. I think this is an important research area.
2. I am perhaps more optimistic than you that if at least one of the contending AI factions is aligned to our values, things will work out pretty well for us. I'm hopeful that the AI factions will negotiate and compromise rather than fight. (Even though, as you point out, the unaligned ones may have various advantages) I feel about 80% confident, how about you? I'd love to hear more about this, I think it's an important research area.
3. You speak of AGI's escaping into the wild. I think that's a possibility, but I'm somewhat more concerned about AGI's taking over the institutions that built them. Institutions that build AIs will presumably be trying to use them for some intellectual purpose, and presumably be somewhat optimistic (too optimistic, IMO) that the AI they built is aligned to them. So rather than escaping and setting up shop on some hacked server somewhere, I expect the most likely scenario to be something like "The AI is engaging and witty and sympathetic and charismatic, and behaves very nicely, and gradually the institution that built it comes to rely more and more on its suggestions and trust it more and more, until eventually it is the power behind the throne basically, steering the entire institution even as those inside think that they are still in charge."
Thanks!
For homogeneity, I guess I was mainly thinking that in the era of not-knowing-how-to-align-an-AGI, people would tend to try lots of different new things, because nothing so far has worked. I agree that once there's an aligned AGI, it's likely to get copied, and if new better AGIs are trained, people may be inclined to try to keep the procedure as close as possible to what's worked before.
I hadn't thought about whether different AGIs with different goals are likely to compromise vs fight. There's Wei Dai's argument that compromise is very easy with AGIs because they can "merge their utility functions". But at least this kind of AGI doesn't have a utility function ... maybe there's a way to do something like that with multiple parallel value functions, but I'm not sure that would actually work. There are also old posts about AGIs checking each other's source code for sincerity, but can they actually understand what they're looking at? Transparency is hard. And how do they verify that there isn't a backup stashed somewhere else, ready to jump out at a later date and betray the agreement? Also, humans have social instincts that AGIs don't, which pushes in both directions I think. And humans are easier to kill / easier to credibly threaten. I dunno. I'm not inclined to have confidence in any direction.
I agree that if a sufficiently smart misaligned AGI is running on a nice supercomputer somewhere, it would have every reason to try to stay right there and pursue its goals within that institution, and it would have every reason to try to escape and self-replicate elsewhere in the world. I guess we can be concerned about both. :-/
Great post! I think many of the things you say apply equally well to broader categories of scenario too, e.g. your AGI risk model stuff works (with some modification) for different AGI development models than the one you gave. I'd love to see people spell that out, lest skeptics read this post and reply "but that's not how AGI will be made, therefore this isn't a serious problem."
Assuming slow takeoff (again, fast takeoff is even worse), it seems to me that under these assumptions there would probably be a series of increasingly-worse accidents spread out over a decade or two, culminating in irreversible catastrophe, with humanity unable to coordinate to avoid that outcome—due to the coordination challenges in Assumptions 2-4.
This seems too optimistic to me. Even on slow takeoff, things won't take more than a decade. (Paul is Mr. Slow Takeoff and even he seems to think it would be more like a decade) Even if a slow takeoff takes more than a decade, the accidents wouldn't be spread out that much. Early AI systems will be too stupid to do anything that counts as an accident in the relevant sense (people will just think of it as like the tesla self-driving car crashes, or the various incidents of racial bias in image recognition AI) and later AI systems will be smart enough to be strategic, waiting to strike until the right moment when they can actually succeed instead of just causing an "accident." (They might do other, more subtle things prior to that time, but they would be subtle, and thus not "accidents" in the relevant sense. They wouldn't be fire alarms, for example.) Or maybe I am misunderstanding what you mean by accidents?
I haven't thought very much about takeoff speeds (if that wasn't obvious!). But I don't think it's true that nobody thinks it will take more than a decade... Like, I don't think Paul Christiano is the #1 slowest of all slow-takeoff advocates. Isn't Robin Hanson slower? I forget.
Then a different question is "Regardless of what other people think about takeoff speeds, what's the right answer, or at least what's plausible?" I don't know. A key part is: I'm hazy on when you "start the clock". People were playing with neural networks in the 1990s but we only got GPT-3 in 2020. What were people doing all that time?? Well mostly, people were ignoring neural networks entirely, but they were also figuring out how to put them on GPUs, and making frameworks like TensorFlow and PyTorch and making them progressively easier to use and scale and parallelize, and finding all the tricks like BatchNorm and Xavier initialization and Transformers, and making better teaching materials and MOOCs to spread awareness of how these things work, developing new and better chips tailored to these algorithms (and vice-versa), waiting on Moore's law, and on and on. I find it conceivable that we could get "glimmers of AGI" (in some relevant sense) in algorithms that have not yet jumped through all those hoops, so we're stuck with kinda toy examples for quite a while as we develop the infrastructure to scale these algorithms, the bag of tricks to make them run better, the MOOCs, the ASICs, and so on. But I dunno.
Or maybe I am misunderstanding what you mean by accidents?
Yeah, sorry, when I said "accidents" I meant "the humans did something by accident", not "the AI did something by accident".
Thanks! Yeah, there are plenty of people who think takeoff will take more than a decade--but I guess I'll just say, I'm pretty sure they are all wrong. :) But we should take care to define what the start point of takeoff is. Traditionally it was something like "When the AI itself is doing most of the AI research," but I'm very willing to consider alternate definitions. I certainly agree it might take more than 10 years if we define things in such a way that takeoff has already begun.
Yeah, sorry, when I said "accidents" I meant "the humans did something by accident", not "the AI did something by accident".
Wait, uhoh, I didn't mean "the AI did something by accident" either... can you elaborate? By "accident" I thought you meant something like "Small-scale disasters, betrayals, etc. caused by AI that are shocking enough to count as warning shots / fire alarms to at least some extent."
Oh sorry, I misread what you wrote. Sure, maybe, I dunno. I just edited the article to say "some number of years".
I never meant to make a claim "20 years is definitely in the realm of possibility" but rather to make a claim "even if it takes 20 years, that's still not necessarily enough to declare that we're all good".
I never meant to make a claim "20 years is definitely in the realm of possibility" but rather to make a claim "even if it takes 20 years, that's still not necessarily enough to declare that we're all good".
Ah, OK. We are on the same page then.
, it seems to me that under these assumptions there would probably be a series of increasingly-worse accidents spread out over some number of years, culminating in irreversible catastrophe, with humanity unable to coordinate to avoid that outcome—due to the coordination challenges in Assumptions 2-4.
I'm not seeing quite what the bad but not existential catastrophes would look like. I also think the AI has an incentive not to do this. My world model (assuming slow takeoff) goes more like this.
AI created in lab. Its a fairly skilled programmer and hacker. Able to slowly self improve. Escapes from the lab, ideally without letting its creators know. Then there are several years where the AI hangs out on the internet, slowly self improving and gaining power. It tries to shut down other AI's if it can. It might be buying compute, or stealing it, or persuading people to run it. It is making sure its existence and malevolence isn't known to humans. Until finally it has the resources to wipe out humanity before we can respond.
It is much easier to contain something on one computer in a lab, than to catch it once its all over the internet.
Lying and cheating and power seeking behaviour are only a good idea if you can get away with them. If you can't break out the lab, you probably can't get away with much uncouragable behaviour.
There is a scenario where the AI escapes in a way that makes its escape "obvious". Or at least obvious to an AI researcher. Expect any response to be delayed, half-hearted, mired by accusations that the whole thing is a publicity stunt, and dragged down by people who don't want to smash their hard drives full of important important work just because there might be a rouge AI on them. The AI has an incentive to confuse and sabotage any step it can. And many human organizations seem good at confusing and sabotaging themselves in the face of a virus. The governments would have to coordinate the shutdown of prettymuch all the worlds computers, without computers to coordinate it. Even just a few hours delay for the researchers to figure out what the AI did, and get the message passed up through government machinery may be enough time for the AI to have got to all sorts of obscure corners of the web.
Hmm, I dunno, I haven't thought it through very carefully. But I guess an AGI might require a supercomputer of resources and maybe there are only so many hackable supercomputers of the right type, and the AI only knows one exploit and leaves traces of its hacking that computer security people can follow, and meanwhile self-improvement is hard and slow (for example, in the first version you need to train for two straight years, and in the second self-improved version you "only" need to re-train for 18 months). If the AI can run on a botnet then there are more options, but maybe it can't deal with latency / packet loss / etc., maybe it doesn't know a good exploit, maybe security researchers find and take down the botnet C&C infrastructure, etc. Obviously this wouldn't happen with a radically superhuman AGI but that's not what we're talking about.
But from my perspective, this isn't a decision-relevant argument. Either we're doomed in my scenario or we're even more doomed in yours. We still need to do the same research in advance.
Lying and cheating and power seeking behaviour are only a good idea if you can get away with them. If you can't break out the lab, you probably can't get away with much uncouragable behaviour.
Well, we can be concerned about non-corrigible systems that act deceptively (cf. "treacherous turn"). And systems that have close-but-not-quite-right goals such that they're trying to do the right thing in test environments, but their goals veer away from humans' in other environments, I guess.
Rohin Shah advocates a vigorous discussion of “Threat Models”, i.e. stories for how AGI is developed, what the AGI then looks like, and then what might go catastrophically wrong.
Ben Garfinkel likewise wants to see a “a picture of the risk...grounded in reality”. Richard Ngo recently had a go at answering this call with AGI Safety From First Principles, which is excellent and full of valuable insights, but less specific than what I have in mind. So here’s my story, going all the way from how we’ll make AGI to why it may cause catastrophic accidents, and what to do about it.
My intended audience for this post is “people generally familiar with ML and RL, and also familiar with AGI-risk-related arguments”. (If you’re in the first category but not the second, read Stuart Russell’s book first.) I'll try to hyperlink jargon anyway.
My AGI development model
I assume that we’ll wind up building an AGI that looks more-or-less like this:
Why do I think this model is likely? To make a long story short:
More details about the model
What about the reward function calculator?
The above discussion was all about the learning algorithm (the three boxes on the top of the diagram above). The other part of my presumed AGI architecture is the reward function calculator box at the bottom of the diagram. Here’s a currently-open research question:
(see Stuart Russell’s book). There are some small number of researchers working on this problem, including many people reading this post, and others, and me. Go us! Let’s figure it out! But in the grand scheme of things this remains a niche research topic, we have no clue whether this research program will succeed, and if it does eventually succeed, we have no clue how much time and effort is needed before we get there. Meanwhile, orders of magnitude more people are working on the other parts of the diagram, i.e. the three interconnected learning algorithms at the top.
So my assumption is that by default, by the time the “learning” parts of the AGI diagram above are getting really good, and scaling all the way to AGI, most people will still be putting very simple things into the “reward function calculator” box, things like “get correct answers on these math problems”.
(I make an exception for capability-advancing aspects of reward function engineering, like reward shaping, curiosity-like drives for novelty, etc. People already have a lot of techniques like that, and I assume those will continue to develop. I really meant to say: I’m assuming no breakthrough solution to the AGI-safety-relevant aspects of reward function engineering—more on which in the next section.)
To be clear, these simple reward functions will certainly lead to misaligned AIs (i.e., AIs trying to accomplish goals that no human would want them to accomplish, at least not if the AI is sufficiently competent). Such AIs will not be suitable for applications like robots and consumer products. But they will be very suitable to the all-important tasks of getting high-impact publications, getting funding, and continuing to improve the learning algorithms.
That said, sooner or later, more and more researchers will finally turn their attention to the question of what reward function to use, in order to reliably get an aligned / human-controllable system.
And then—unless we AI alignment researchers have a solution ready to hand them on a silver platter—I figure lots of researchers will mainly just proceed by trial-and-error, making things up as they go along. Maybe they’ll use reasonable-seeming reward functions like “get the human to approve of your output text”, or “listen to the following voice command, and whatever concepts it activates in your world-model, treat those as high-value”, etc. And probably also some people will play around with reward functions that are not even superficially safe, like “positive reward when my bank account goes up, negative reward when it goes down”. I expect a proliferation of dangerous experimentation. Why dangerous? That brings us to...
My AGI risk model
The AGI is by assumption making foresighted, strategic plans to accomplish goals. Those goals would be things flagged as high-value by its value function. Therefore, there are two “alignment problems”, outer and inner:
Inner alignment problem: The value function might be different from the (sum of the) reward function.
In fact that’s an understatement: The value function will be different from the (sum of the) reward function. Why? Among other things, because they have different type signatures—they accept different input!
The input to the reward function calculator is, well, whatever we program it to be. Maybe it would be a trivial calculation, that simply answers the question: “Is the remote control Reward Button currently being pressed?” Maybe it would look at the learning algorithm’s actions, and give rewards when it prints the correct answers to the math problems. Maybe it would take the camera and microphone data and run it through a trained classifier. It could be anything.
(In the brain, the reward function calculator includes a pain-detector that emits negative reward, and a yummy-food-detector that emits positive reward, and probably hundreds of other things, some of which may be quite complicated, and which may involve interpretability-like learned feature vectors, and so on.)
The input to the value function is specifically “the latent variables in the learned world-model”, as mentioned above.
Do you like football? Well “football” is a learned concept living inside your world-model. Learned concepts like that are the only kinds of things that it’s possible to “like”. You cannot like or dislike [nameless pattern in sensory input that you’ve never conceived of]. It’s possible that you would find this nameless pattern rewarding, were you to come across it. But you can’t like it, because it’s not currently part of your world-model. That also means: you can’t and won’t make a goal-oriented plan to induce that pattern.
“Nameless pattern in sensory input that you’ve never conceived of” is a case where something is in-domain for the reward function but (currently) out-of-domain for the value function. Conversely, there are things that are in-domain for your value function—so you can like or dislike them—but wildly out-of-domain for your reward function! You can like or dislike “the idea that the universe is infinite”! You can like or dislike “the idea of doing surgery on your brainstem in order to modify your own internal reward function calculator”! A big part of the power of intelligence is this open-ended ever-expanding world-model that can re-conceptualize the world and then leverage those new concepts to make plans and achieve its goals. But we cannot expect those kinds of concepts to be evaluable by the reward function calculator.
(Well, I guess “the idea that the universe is infinite” and so on could be part of the reward function calculator. But now the reward function calculator is presumably a whole AGI of its own, which is scrutinizing the first AGI using interpretability tools. Maybe there’s a whole tower of AGIs-scrutinizing-AGIs! That’s all very interesting to think about, but until we flesh out the details, especially the interpretability part, we shouldn’t assume that there’s a good solution along these lines.)
So leaving that aside, the value function and reward function are necessarily misaligned functions. How misaligned? Will the value function converge to a better and better approximation of the summed reward function (at least where the domains overlap)? I’m inclined to answer: “Maybe sometimes (especially with simple reward functions), but not reliably, and maybe not at all with the techniques we’ll have on hand by the time we’re actually doing this.” Some potential problems (which partially overlap) are:
Some of these problems are especially problematic problems because you don’t know when they will strike. For example, ontological crises: Maybe you’re seven years into deployment, and the AGI has been scrupulously helpful the whole time, and we've been trusting the AGI with more and more autonomy, and then the AGI then happens to be reading some new philosophy book, and it converts to panpsychism (nobody’s perfect!), and as it maps its existing values onto its reconceptualized world, it finds itself no longer valuing the lives of humans over the lives of ants, or whatever.
Outer alignment problem: The reward function might be different than the thing we want.
Here there are problems as well, such as:
By the way, in terms of solving the alignment problem, I’m not sure that splitting things up into “outer alignment” and “inner alignment” is actually that helpful! After all, the reward function will diverge from the thing we want, and the value function will diverge from the reward function. The most promising solution directions that I can think of seem to rely on things like interpretability, “finding human values inside the world-model”, corrigible motivation, etc.—things which cut across both layers, bridging all the way from the human’s intentions to the value function.
So then what happens? What’s the risk model?
I’ll go with a slow takeoff (definition) risk scenario. (If we’re doomed under slow takeoff then we’re even more doomed under fast takeoff.) A particularly bad case—which I see as plausible in all respects—would be something like this:
Assuming slow takeoff (again, fast takeoff is even worse), it seems to me that under these assumptions there would probably be a series of increasingly-worse accidents spread out over some number of years, culminating in irreversible catastrophe, with humanity unable to coordinate to avoid that outcome—due to the coordination challenges in Assumptions 2-4.
Well, maybe humans and/or aligned AGIs would be able to destroy the unaligned AGIs, but that would be touch-and-go under the best of circumstances (see Assumption 6)—and the longer into this period that it takes us to solve the alignment problem (if indeed we do at all), the worse our prospects get. I’d rather have a plan ready to go in advance! That brings us to...
If so, what now?
So that’s my AGI threat model. (To be clear, avoiding this problem is only one aspect of getting to Safe & Beneficial AGI—necessary but by no means sufficient.)
If you buy all that, then some of the implications include:
…And there you have it—that’s what I'm doing every day; that's my current research agenda in a nutshell!
Well, I’m doing that plus the meta-task of refining and discussing and questioning my assumptions. Hence this post! So leave a comment or get in touch. What do you think?