If I wanted to play fast and loose, I would claim that our sense of ourselves as having a first-person at all is part of an evolutionary solution to the problem of learning from other peoples's experiences (wait, wasn't there a post like that recently? Or was that about empathy...). It merely seems like a black box to us because we're too good at it, precisely because it's so important.
Somehow we develop a high-level model of the world with ourselves and other people in it, and then this level of abstraction actually gets hooked up to our motivations - making this a subset of social instincts.
When imagining hooking up abstract learned world models to motivation for AI like this, I sometimes imagine something much less "fire and forget" than the human brain, something more like people testing, responding to, and modifying an AI that's training or pre-training on real-world data. Evolution doesn't get to pause me at age 4 and rummage around in my skull.
Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well.
I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”.
With my current best formalization, the "objects" in the world are not necessarily localized in 3D space. Indeed, one of the main things which makes an abstraction "natural" is that the relevant information is redundantly represented in many places in the physical world.
"Daytime" is a good example: I can measure light intensity at lots of different places in my general area, at lots of different times, and find that they all strongly correlate. The information about light intensity is redundant across all those locations: if I measure high light intensity outside my house, then I'm pretty confident that a measurement taken outside the office at the same time will also have high intensity. The latent variable representing that redundant information (as a function of time) is what we call "daytime".
Thanks! Follow-up question: Do you see yourself as working towards “Proof Strategy 2”? Or “none of the above”?
This part of Proof Strategy 1 is a basically-accurate description of what I'm working towards:
We try to come up with an unambiguous definition of what [things] are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
... it's just not necessarily about objects localized in 3D space.
Also, there's several possible paths, and they don't all require unambiguous definitions of all the "things" in a human's ontology. For instance, if corrigibility turns out to be a natural "thing", that could short-circuit the need for a bunch of other rigorous concepts.
Thanks! One of my current sources of mild skepticism right now (which again you might talk me out of) is:
I expect that there will be concepts the AI finds useful which humans don't already understand. But these concepts should still be of the same type as human concepts - they're still the same kind of natural abstraction. Analogy: a human who grew up in a desert tribe with little contact with the rest of the world may not have any concept of "snow", but snow is still the kind-of-thing they're capable of understanding if they're ever exposed to it. When the AI uses concepts humans don't already have, I expect them to be like that.
As long as the concepts are the type of thing humans can recognize/understand, then it should be conceptually straightforward to model how humans would reason about those concepts or value them.
should be conceptually straightforward to model how humans would reason about those concepts or value them
Let’s say that the concept of an Em had never occurred to me before, and now you knock on my door and tell me that there’s a thing called Ems, and you know how to make them but you need my permission, and now I have to decide whether or not I care about the well-being of Ems. What do I do? I dunno, I would think about the question in different ways, I would try to draw analogies to things I already knew about, maybe I would read some philosophy papers, and most of all I would be implicitly probing my own innate "caring" reaction(s) and seeing exactly what kinds of thoughts do or don't trigger it.
Can we make an AGI that does all that? I say yes: we can build an AGI with human-like “innate drives” such that it has human-like moral intuitions, and then it applies those human-like intuitions in a human-like way when faced with new out-of-distribution situations. That’s what I call the “Social-Instinct AGI” research path, see Post #12.
But if we can do that, we’ve already arguably solved the whole AGI safety problem. I suspect you have something different in mind?
We don't necessarily need the AGI itself to have human-like drives, intuitions, etc. It just needs to be able to model the human reasoning algorithm well enough to figure out what values humans assign to e.g. an em.
(I expect an AI which relied heavily on human-like reasoning for things other than values would end up doing something catastrophically stupid, much as humans are prone to do.)
I don’t think “the human is deciding whether or not she cares about Ems” is a different set of mental activities from “the human is trying to make sense of a confusing topic”, or “the human is trying to prove a theorem”, etc.
So from my perspective, what you said sounds like “Write code for a Social-Instinct AGI, and then stamp the word subroutine on that code, and then make an “outer AI” with the power to ‘query’ that ‘subroutine’.” From that perspective, I would be concerned that if the (so-called) subroutine never wanted to do anything bad or stupid, then the outer AI is redundant, and if the (so-called) subroutine did want to do something bad or stupid, then the outer AI may not be able to recognize and stop it.
Separately, shouldn't “doing something catastrophically stupid” become progressively less of an issue as the AGI gets “smarter”? And insofar as caution / risk-aversion / etc. is a personality type, presumably we could put a healthy dose of it into our AGIs.
An example might be helpful here: consider the fusion power generator scenario. In that scenario, a human thinking about what they want arrives at the wrong answer, not because of uncertainty about their own values, but because they don't think to ask the right questions about how the world works. That's the sort of thing I have in mind.
In order to handle that sort of problem, an AI has to be able to use human values somehow without carrying over other specifics of how a human would reason about the situation.
I don’t think “the human is deciding whether or not she cares about Ems” is a different set of mental activities from “the human is trying to make sense of a confusing topic”, or “the human is trying to prove a theorem”, etc.
I think I disagree with this claim. Maybe not exactly as worded - like, sure, maybe the "set of mental activities" involved in the reasoning overlap heavily. But I do expect (weakly, not confidently) that there's a natural notion of human-value-generator which factors from the rest of human reasoning, and has a non-human-specific API (e.g. it interfaces with natural abstractions).
So from my perspective, what you said sounds like “Write code for a Social-Instinct AGI, and then stamp the word subroutine on that code, and then make an “outer AI” with the power to ‘query’ that ‘subroutine’.”
It sounds to me like you're imagining something which emulates human reasoning to a much greater extent than I'm imagining.
consider the fusion power generator scenario
It's possible that I misunderstood what you were getting at in that post. I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn't care to tell the human, because the human didn't ask. But from what you're saying now, I guess GPT-N has nothing to do with the story? You could have equally well written the post as “Suppose, a few years from now, I set about trying to design a cheap, simple fusion power generator - something I could build in my garage and use to power my house. After years of effort, I succeed….” Is that correct?
If so, I think that’s a problem that can be mitigated in mundane ways (e.g. mandatory inventor training courses spreading best-practices for brainstorming unanticipated consequences, including red-teams, structured interviews, etc.), but can't be completely solved by humans. But it also can’t be completely solved by any possible AI, because AIs aren’t and will never be omniscient, and hence may make mistakes or overlook things, just as humans can.
Maybe you're thinking that we can make AIs that are less prone to human foibles like wishful thinking and intellectual laziness etc.? But I’m optimistic that we can make “social instinct” brain-like AGIs that are also unusually good at avoiding those things (after all, some humans are significantly better than others at avoiding those things, while still having normal-ish social instincts and moral intuitions).
I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn't care to tell the human, because the human didn't ask. But from what you're saying now, I guess GPT-N has nothing to do with the story?
Basically, yeah.
The important point (for current purposes) is that, as the things-the-system-is-capable-of-doing-or-building scale up, we want the system's ability to notice subtle problems to scale up with it. If the system is capable of designing complex machines way outside what humans know how to reason about, then we need similarly-superhuman reasoning about whether those machines will actually do what a human intends. "With great power comes great responsibility" - cheesy, but it fits.
Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well.
I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”.
This approach is probably particularly characteristic of my approach. I've perhaps overstated the similarity of my approach to John Wentworth's 😅 - I think that much of his research is useful to my approach, but there's also lots of positions of disagreement. But I suppose everyone finds his research ultra-promising.
A couple of notes:
I think even if my approach doesn't work out as the sole solution, it seems plausibly complementary to other approaches, including yours. For instance, if you don't do the sort of ontological lock that I'm advocating, then you tend to end up struggling with some basic symbol-reality distinction, e.g. you're likely to associate pictures of happy people with the concept of "happiness", so a happiness maximizer might end up tiling the world with pictures of happy people. My approach can avoid that for free (though the flipside is that it would likely not consider e.g. ems to be people unless explicitly programmed so. but that could probably be achieved.).
I think concepts like "solar cell efficiency" might be very achievable to define by my approach. If you have a clean 3D ontology, you can isolate an object like a solar panel in that ontology, and then counterfactually ask how it would perform under various conditions. So you could say "well how would this object perform if standard sunlight hit it under standard atmospheric conditions? how much power would it produce? would it produce any problematic pollution? etc.". You could be very precise about this.
... which is of course a curse as much as it is a blessing, e.g. you might not want a precise definition of "daytime", and it might not be possible for people to write down a precise definition of "honesty".
This approach is probably particularly characteristic of my approach.
Yeah, you were one of the “couple other people” I alluded to. The other was Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).
my approach … ontological lock …
I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.
Yeah, you were one of the “couple other people” I alluded to. The other was Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).
🤔 I wonder if I should talk with Tan Zhi-Xuan.
I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.
I got the phrase "ontological lock" from adamShimi's post here, but it only comes up very briefly, so it is not helpful for understanding what I mean and is sort of also me assuming that adamShimi meant the same as I did. 😅 I'm not sure if it's a term used elsewhere.
What I mean is forcing the AI to have a specific ontology, such as things embedded in 3D space, so you can directly programmatically interface with the AI's ontology, rather than having to statistically train an interface (which would lead to problems with distribution shift and such).
Great post!
Re: the 1st person problem, if we're thinking of prosaic alignment solutions, a promising one to me is showing the AI labeled videos of itself doing various things, along with whether those things were honest or not.
I think this is basically how I as a human perceive my sense of self? I don't think I have a good pointer to myself (e.g. out-of-body experiences highlight the difference between my physical body and my mind), but I do have a good pointer to what my friends would describe as myself. In that same way, it seems sort of reasonable to train an AI to define "I am being honest" as "AI Joe exists [and happens to be me], my goal is to maximize the probability that humans who see AI Joe taking action X would say that AI Joe is being honest".
Then all that remains is showing the AI lots of different situations in which it takes actions along with human labels that "AI Joe just took that action". Insofar as humans know what constitutes the AI, it seems like the AI could figure out the same definition?
(Last revised: July 2024. See changelog at the bottom.)
14.1 Post summary / Table of contents
Part of the “Intro to brain-like-AGI safety” post series.
Post #12 suggested two paths forward for solving “the alignment problem” for brain-like AGI, which I called “Social-instinct AGI” and “Controlled AGI”. Then Post #13 went into more detail about (one aspect of) “Social-instinct AGI”. And now, in this post, we’re switching over to “Controlled AGI”.
If you haven’t read Post #12, don’t worry, the “Controlled AGI” research path is nothing fancy—it’s merely the idea of solving the alignment problem in the most obvious way possible:
The “Controlled AGI” research path:
This post is about Step 2, whereas Step 1 is out-of-scope for this series. Honestly, I’d be ecstatic if we figured out how to reliably set the AGI’s motivation to any of those things I mentioned under Step 1.
Unfortunately, I don’t know any good plan for Step 2, and (I claim) nobody else does either. But I do have some vague thoughts and ideas, and I will share them here, in the spirit of brainstorming.
If you’re in a hurry and want to read a shorter and self-contained version of my least-bad proposed plan for Step 2, check out my separate post: Plan for mediocre alignment of brain-like [model-based RL] AGI, which basically puts together the most obvious ideas mentioned in Sections 14.2 and 14.3 into an end-to-end framework. I think that plan passes the low bar of “as far as I know, it might turn out OK”—well, I think I’m mildly skeptical, but I go back and forth, and I’m sure how to pin it down with more confidence. But obviously, we should be aiming higher than that! With stakes so high, we should really be starting from “there’s a strong reason to expect the plan to work, if carefully implemented”. And then we can start worrying about what can go wrong in the implementation. So we clearly still have work to do.
This post is not meant to be a comprehensive overview of the whole problem, just what I see as the most urgent missing ingredients.
Out of all the posts in the series, this post is the hands-down winner for “most lightly-held opinions”. For almost anything I say in this post, I can easily imagine someone changing my mind within an hour of conversation. Let that ‘someone’ be you—the comment section is below!
Table of contents:
14.2 Three categories of AGI Thought Assessors
As background, here’s our usual diagram of motivation in the human brain, from Post #6:
And here’s the modification for AGI, from Post #8:
On the center-right side of the diagram, I crossed out the words “cortisol”, “sugar”, “goosebumps”, etc. These correspond to the set of human innate visceral reactions which can be involuntarily triggered by thoughts (see Post #5). (Or in machine learning terms, these are more-or-less the components of a multidimensional value function, similar to what you find in multi-objective / multi-criteria reinforcement learning.)
Clearly, things like cortisol, sugar, and goosebumps are the wrong Thought Assessors for our future AGIs. But what are the right ones? Well, we’re the programmers! We get to decide!
I have in mind three categories to pick from. I’ll talk about how they might be trained (i.e., supervised) in Section 14.3 below.
14.2.1 Safety & corrigibility Thought Assessors
Example thought assessors in this category:
Arguably (cf. this Paul Christiano post), #1 is enough, and subsumes the rest. But I dunno, I figure it would be nice to have information broken down on all these counts, allowing us to change the relative weights in real time (Post #9, Section 9.7), and perhaps giving an additional measure of safety.
Items #2–#3 are there because those are especially probable and dangerous types of thoughts—see discussion of Instrumental Convergence in Post #10, Section 10.3.2.
Item #5 is a bit of a catch-all for the AGI finding weird out-of-the-box solutions to problems, i.e. it’s my feeble attempt to mitigate the so-called “Nearest Unblocked Strategy problem”. Why might it mitigate the problem? Because pattern-matching to “things that an ethical human would plausibly do” is a bit more like a whitelist than a blacklist. I still don’t think that would work on its own, don't get me wrong, but maybe it would work in conjunction with the various other ideas in this post.
Before you jump into loophole-finding mode (“lol an ethical human would plausibly turn the world into paperclips if they’re under the influence of alien mind-control rays”), remember (1) these are meant to be implemented via pattern-matching to previously-seen examples (Section 14.3 below), not literal-genie-style following the exact words of the text; (2) we would hopefully also have some kind of out-of-distribution detection system (Section 14.4 below) to prevent the AGI from finding and exploiting weird edge-cases in that pattern-matching process. That said, as we’ll see, I don’t quite know how to do either of those two things, and even if we figure it out, I don’t have an airtight argument that it would be sufficient to get the intended safe behavior.
14.2.2 Task-related Thought Assessors
Example thought assessors in this category:
This kind of thing is why we built the AGI—what we actually want it to do. (Assuming task-directed AGI for simplicity.)
Basing a motivation system on these kinds of assessments by themselves would be obviously catastrophic. But maybe if we use these as motivations, in conjunction with the previous category, it will be OK. For example, imagine the AGI can only think thoughts that pattern-match to “I am being helpful” AND pattern-match to “there will be less global warming”.
That said, I’m not sure we want this category at all. Maybe the “I am being helpful” Thought Assessor by itself is sufficient. After all, if the human supervisor is trying to reduce global warming, then a helpful AGI would produce a plan to reduce global warming. That’s kinda the approach here, I think.
14.2.3 “Ersatz interpretability” Thought Assessors
(See Post #9, Section 9.6 for what I mean by “Ersatz interpretability”.)
As discussed in Posts #4–#5, each thought assessor is a model trained by supervised learning. Certainly, the more Thought Assessors we put into the AGI, the more computationally expensive it will be. But how much more? It depends. For example, I think the “valence” Thought Assessor in the human brain involves orders of magnitude more neurons than the “salivation” Thought Assessor. On the other hand, I think the “valence” Thought Assessor is far more accurate as a result. Anyway, as far as I know, it’s not impossible that we can put in 107 Thought Assessors, and they’ll work well enough, and this will only add 1% to the total compute required by the AGI. I don’t know. So I’ll hope for the best and take the More Dakka approach: let’s put in 30,000 Thought Assessors, one for every word in the dictionary:
I expect that ML-savvy readers will be able to immediately suggest much-improved versions of this scheme—including versions with even more dakka—that involve things like contextual word embeddings and language models and so on. As one example, if we buy out and open-source Cyc (more on which below), we could use its hundreds of thousands of human-labeled concepts.
14.2.4 Combining Thought Assessors into a reward function
For an AGI to judge a thought / plan as being good, we’d like all the safety & corrigibility Thought Assessors from Section 14.2.1 to have as high a value as possible, and we’d like the task-related Thought Assessor from Section 14.2.2 (if we’re using one) to have as high a value as possible.
(The outputs of the interpretability Thought Assessors from Section 14.2.3 are not inputs to the AGI’s reward function, or indeed used at all in the AGI, I presume. I was figuring that they’d be silently spit out to help the programmers do debugging, testing, monitoring, etc.)
So the question is: how do we combine this array of numbers into a single overall score that can guide what the AGI decides to do?
A probably-bad answer is “add them up”. We don’t want the AGI going with a plan that performs catastrophically badly on all but one of the safety-related Thought Assessors, but so astronomically well on the last one that it makes up for it.
Instead, I imagine we’ll want to apply some kind of nonlinear function with strongly diminishing returns, and/or maybe even acceptability thresholds, before adding up the Thought Assessors into an overall score.
I don’t have much knowledge or opinion about the details. But there is some related literature on “scalarization” of multi-dimensional value functions—see here for some references.
14.3 Supervising the Thought Assessors, and the “first-person problem”
Recall from Posts #4–#6 that the Thought Assessors are trained by supervised learning. So we need a supervisory signal—what I labeled “ground truth in hindsight” in the diagram at the top.
I’ve talked about how the brain generates ground truth in numerous places, e.g. Post #3 Section 3.2.1, Posts #7 & #13. How do we generate it for the AGI?
Well, one obvious possibility is to have the AGI watch YouTube, with lots of labels throughout the video for when we think the various Thought Assessors ought to be active. Then when we’re ready to send the AGI off into the world to solve problems, we turn off the labeled YouTube videos, and simultaneously freeze the Thought Assessors (= set the error signals to zero) in their current state. Well, I’m not sure if that would work; maybe the AGI has to go back and watch more labeled YouTube videos from time to time, to help the Thought Assessors keep up as the AGI’s world-model grows and changes.
One potential shortcoming of this approach is related to first-person versus third-person concepts. We want the AGI to have strong preferences about aspects of first-person plans—hopefully, the AGI will see “I will lie and deceive” as bad, and “I will be helpful” as good. But we can’t straightforwardly get that kind of preference from the AGI watching labeled YouTube videos. The AGI will see YouTube character Alice deceiving YouTube character Bob, but that’s different from the AGI itself being deceptive. And it’s a very important difference! Consider:
It sure would be great if there were a way to transform third-person data (e.g. a labeled YouTube video of Alice deceiving Bob) into an AGI’s first-person preferences (“I don’t want to be deceptive”). I call this the first-person problem.
How do we solve the first-person problem? I’m not entirely sure. Maybe we can apply interpretability tools to the AGI’s world-model, and figure out how it represents itself, and then correspondingly manipulate its thoughts, or something? It’s also possible that further investigation into human social instincts (previous post) will shed some light, as human social instincts do seem to transform the third-person “everyone in my friend group is wearing green lipstick” into the first-person “I want to be wearing green lipstick”.
If the first-person problem is not solvable, we need to instead use the scary method of allowing the AGI to take actions, and putting labels on those actions. Why is that scary? First, because those actions might be dangerous. Second, because it doesn’t give us any good way to distinguish (for example) “the AGI said something dishonest” from “the AGI got caught saying something dishonest”. Conservatism and/or concept extrapolation (Section 14.4 below) could help with that “getting caught” problem—maybe we could manage to get our AGI both motivated to be honest and motivated to not get caught, and that could be good enough—but it still seems fraught for various reasons.
14.3.1 Side note: do we want first-person preferences?
I suspect that “the first-person problem” is intuitive for most readers. But I bet a subset of readers feel tempted to say that the first-person problem is not in fact a problem at all. After all, in the realm of human affairs, there’s a good argument that we could use a lot fewer first-person preferences!
The opposite of first-person preferences would be “impersonal consequentialist preferences”, wherein there’s a future situation that we want to bring about (e.g. “awesome post-AGI utopia”), and we make decisions to try to bring that about, without particular concern over what I-in-particular am doing. Indeed, too much first-person thinking leads to lots of things that I personally dislike in the world—e.g. jockeying for credit, blame avoidance, the act / omission distinction, social signaling, and so on.
Nevertheless, I still think giving AGIs first-person preferences is the right move for safety. Until we can establish super-reliable 12th-generation AGIs, I’d like them to treat “a bad thing happened (which had nothing to do with me)” as much less bad than “a bad thing happened (and it’s my fault)”. Humans have this notion, after all, and it seems at least relatively robust—for example, if I build a bank-robbing robot, and then it robs the bank, and then I protest “Hey I didn’t do anything wrong; it was the robot!”, I wouldn’t be fooling anybody, much less myself. An AGI with such a preference scheme would presumably be cautious and conservative when deciding what to do, and would default to inaction when in doubt. That seems generally good, which brings us to our next topic:
14.4 Conservatism and concept-extrapolation
14.4.1 Why not just relentlessly optimize the right abstract concept?
Let’s take a step back.
Suppose we build an AGI such that it has positive valence on the abstract concept “there will be lots of human flourishing”, and consequently makes plans and take actions to make that concept happen.
I’m actually pretty optimistic that we’ll be able to do that, from a technical perspective. Just as above, we can use labeled YouTube videos and so on to make a Thought Assessor for “this thought / plan will lead to human flourishing”, and then base the reward function purely on that one Thought Assessor (cf. Post #7).
And then we set the AGI loose on an unsuspecting world, to go do whatever it thinks is best to do.
What could go wrong?
The problem is that the concept of “human flourishing” is an abstract concept in the AGI’s world-model—really, it’s just a fuzzy bundle of learned associations. It’s hard to know what actions a desire for “human flourishing” will induce, especially as the world itself changes, and the AGI’s understanding of the world changes even more. In other words, there is no future world that will perfectly pattern-match to the AGI’s current notion of “human flourishing”, and if an extremely powerful AGI optimized the world for the best possible pattern-match, we might wind up with something weird, even catastrophic. (Or maybe not! It’s pretty hard to say, more on which in Section 14.6.)
As some random examples of what might go wrong: maybe the AGI would take over the world and prevent humans and human society from changing or evolving forevermore, because those changes would reduce the pattern-match quality. Or maybe the least-bad pattern-match would be the AGI wiping out actual humans in favor of an endless modded game of The Sims. Not that The Sims is a perfect pattern-match to “human flourishing”—it’s probably pretty bad! But maybe it’s less bad a pattern-match than anything the AGI could feasibly do with actual real-world humans. Or maybe as the AGI learns more and more, its world-model gradually drifts and changes, such that the frozen Thought Assessor winds up pointing at something totally random and crazy, and then the AGI wipes out humans to tile the galaxy with paperclips. I don’t know!
So anyway, relentlessly optimizing a fixed, frozen abstract concept like “human flourishing” seems maybe problematic. Can we do better?
Well, it would be nice if we could also continually refine that concept, especially as the world itself, and the AGI’s understanding of the world, evolves. This idea is what Stuart Armstrong calls Concept Extrapolation, if I understand correctly.
Concept extrapolation is easier said than done—there’s no obvious ground truth for the question of “what is ‘human flourishing’, really?” For example, what would “human flourishing” mean in a future of transhuman brain-computer hybrid people and superintelligent evolved octopuses and god-only-knows-what-else?
Anyway, we can consider two steps to concept extrapolation. First (the easier part), we need to detect edge-cases in the AGI’s preferences. Second (the harder part), we need to figure out what the AGI should do when it comes across such an edge-case. Let’s talk about those in order.
14.4.2 The easier part of concept extrapolation: Detecting edge-cases in the AGI’s preferences
I’m cautiously optimistic about the feasibility of making a simple monitoring algorithm that can watch an AGI’s thoughts and detect that it’s in an edge-case situation—i.e., an out-of-distribution situation where its learned preferences and concepts are breaking down.
(Understanding the contents of the edge-case seems much harder, as discussed shortly, but here I’m just talking about recognizing the occurrence of an edge-case.)
To pick a few examples of possible telltale signs that an AGI is at an edge-case:
14.4.3 The harder part of concept extrapolation: What to do at an edge case
I don’t know of any good answer. Here are some options.
14.4.3.1 Option A: Conservatism—When in doubt, just don’t do it!
A straightforward approach would be that if the AGI’s edge-case-detector fires, it forces the valence signal negative—so that whatever thought the AGI was thinking is taken to be a bad thought / plan. This would loosely correspond to a “conservative” AGI.
(Side note: I think there may be many knobs we can turn in order to make a brain-like AGI more or less “conservative”, in different respects. The above is just one example. But they all seem to have the same issues.)
A failure mode of a conservative AGI is that the AGI just sits there, not doing anything, paralyzed by indecision, because every possible plan seems too uncertain or risky.
An “AGI paralyzed by indecision” is a failure mode, but it’s not a dangerous failure mode. Well, not unless we were foolish enough to put this AGI in charge of a burning airplane plummeting towards the ground. But that’s fine—in general, I think it’s OK to have first-generation AGIs that can sometimes get paralyzed by indecision, and which are thus not suited to solving crises where every second counts. Such an AGI could still do important work like inventing new technology, and in particular designing better and safer second-generation AGIs.
However, if the AGI is always paralyzed by indecision—such that it can’t get anything done—now we have a big problem. Presumably, in such a situation, future AGI programmers would just dial the “conservatism” knob down lower and lower, until the AGI started doing useful things. And at that point, it’s unclear if the remaining conservatism would be sufficient to buy us safety.
I think it would be much better to have a way for the AGI to iteratively gain information to reduce uncertainty, while remaining highly conservative in the face of whatever uncertainty still remains. So how can we do that?
14.4.3.2 Option B: Dumb algorithm to seek clarification in edge-cases
Here’s a slightly-silly illustrative example of what I have in mind. As above, we could have a simple monitoring algorithm that watches the AGI’s thoughts, and detects when it’s in an edge-case situation. As soon as it is, the monitoring algorithm shuts down the AGI entirely, and prints out the AGI’s current neural net activations (and corresponding Thought Assessor outputs). The programmers use interpretability tools to figure out what the AGI is thinking about, and manually assign a valence / value / reward, overriding the AGI’s previous uncertainty with a highly-confident ground-truth.
That particular story seems unrealistic, mainly because we probably won’t have sufficiently reliable and detailed interpretability tools. (Prove me wrong, interpretability researchers!) But maybe there’s a better approach than just printing out billions of neural activations and corresponding Thought Assessors?
The tricky part is that AGI-human communication is fundamentally a hard problem. It’s unclear to me whether it will be possible to solve that problem via a dumb algorithm. The situation here is very different from, say, an image classifier, where we can find an edge-case picture and just show it to the human. The AGI’s thoughts may be much more inscrutable than that.
By analogy, human-human communication is possible, but not by any dumb algorithm. We do it by leveraging the full power of our intellect—modeling what our conversation partner is thinking, strategically choosing words that will best convey a desired message, and learning through experience to communicate more and more effectively. So what if we try that approach?
14.4.3.3 Option C: The AGI wants to seek clarification in edge-cases
If I’m trying to help someone, I don’t need any special monitoring algorithm to prod me to seek clarification at edge-cases. Seeking clarification at edge-cases is just what I want to do, as a self-aware properly-motivated agent.
So what if we make our AGIs like that?
At first glance, this approach would seem to solve all the problems mentioned above. Not only that, but the AGI can use its full powers to make everything work better. In particular, it can learn its own increasingly-sophisticated metacognitive heuristics to flag edge-cases, and it can learn and apply the human’s meta-preferences about how and when the AGI should ask for clarification.
But there’s a catch. I was hoping for a conservatism / concept extrapolation system that would help protect us from misdirected motivations. If we implement conservatism / concept extrapolation via the motivation system itself, then we lose that protection.
More specifically: if we go up a level, the AGI still has a motivation (“seek clarification in edge-cases”), and that motivation is still an abstract concept that we have to extrapolate into out-of-distribution edge cases (“What if my supervisor is drunk, or dead, or confused? What if I ask a leading question?”). And for that concept extrapolation problem, we’re plowing ahead without a safety net.
Is that a problem? Bit of a long story:
Side-debate: Will “helpfulness”-type preferences “extrapolate” safely just by recursively applying to themselves?
In fact, a longstanding debate in AGI safety is whether these kinds of helpful / corrigible AGI preferences (e.g. an AGI’s desire to understand and follow a human’s preferences and meta-preferences) will “extrapolate” in a desirable way without any “safety net”—i.e., without any independent ground-truth mechanism pushing the AGI’s preferences in the right direction.
In the optimistic camp is Paul Christiano, who argued in “Corrigibility” (2017) that there would be “a broad basin of attraction towards acceptable outcomes”, based on, for example, the idea that an AGI’s preference to be helpful will result in the AGI having a self-reflective desire to continually edit its own preferences in a direction humans would like. But I don’t really buy that argument for reasons in my 2020 post—basically, I think there are bound to be sensitive areas like “what does it mean for people to want something” and “what are human communication norms” and “inclination to self-monitor”, and if the AGI’s preferences drift along any of those axes (or all of them simultaneously), I’m not convinced that those preferences would self-correct.
Meanwhile, in the strongly-pessimistic camp is Eliezer Yudkowsky, I think mainly because of an argument (e.g. this post, final section) that we should expect powerful AGIs to have consequentialist preferences, and that consequentialist preferences seem incompatible with corrigibility. But I don’t really buy that argument either, for reasons in my 2021 “Consequentialism & Corrigibility” post—basically, I think there are possible preferences that are reflectively-stable, and that include consequentialist preferences (and thus are compatible with powerful capabilities), but are not purely consequentialist (and thus are compatible with corrigibility). A “preference to be helpful” seems like it could plausibly develop into that kind of hybrid preference scheme.
Anyway, I’m uncertain but leaning pessimistic. For more on the topic, see also Wei Dai’s recent post, and RogerDearnaley’s, and the comment sections of all of the posts linked above.
14.4.3.4 Option D: Something else?
I dunno.
14.5 Getting a handle on the world-model itself
The elephant in the room is the giant multi-gigabyte unlabeled generative world-model that lives inside the Thought Generator. The Thought Assessors provide a window into this world-model, but I’m concerned that it may be a rather small, foggy, and distorted window. Can we do better?
Ideally, we’d like to prove things about the AGI’s motivation. We’d like to say “Given the state of the AGI’s world-model and Thought Assessors, the AGI is definitely motivated to do X” (where X=be helpful, be honest, not hurt people, etc.) Wouldn’t that be great?
But we immediately slam into a brick wall: How do we prove anything whatsoever about the “meaning” of things in the world-model, and thus about the AGI’s motivation? The world is complicated, and therefore the world-model is complicated. The things we care about are fuzzy abstractions like “honesty” and “helpfulness”—see the Pointers Problem. The world-model keeps changing as the AGI learns more, and as it makes plans that would entail taking the world wildly out-of-distribution (e.g. planning the deployment of a new technology). How can we possibly prove anything here?
I still think the most likely answer is “We can’t”. But here are two possible paths anyway. For some related discussion, see Eliciting Latent Knowledge, and especially Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems (Dalrymple et al., 2024).
Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well. (Update: John disagrees with this characterization, see his comment.)
I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”.
Proof strategy #2 would start with a human-legible “reference world-model” (e.g. Cyc). This reference world-model wouldn’t be constrained to be built out of localized objects in a 3D world, so unlike the above, it could and probably would contain things like “honesty” and “solar cell efficiency” and “daytime”.
Then we try to directly match up things in the “reference world-model” with things in the AGI’s world-model.
Will they match up? No, of course not. Probably the best we can hope for is a fuzzy, many-to-many match, with various holes on both sides.
It's hard for me to see a path to rigorously proving anything about the AGI’s motivations using this approach. Nevertheless, I continue to be amazed that unsupervised machine translation is possible at all, and I take that as an indirect hint that if pieces of two world-models match up with each other in their internal structure, then those pieces are probably describing the same real-world thing. So maybe I have the faintest glimmer of hope.
I’m unaware of work in this direction, possibly because it’s stupid and doomed, and also possibly because I don’t think we currently have any really great open-source human-legible world-models to run experiments on. The latter is a problem that I think we should rectify ASAP, perhaps by cutting a giant check to open-source Cyc, or else developing a similarly rich, accurate, and (most importantly) human-legible open-source world-model by some other means.
(See also some helpful discussion in Towards Guaranteed Safe AI about how to build an open-source human-legible world-model, although they have in mind a different end-use for it than I do. Indeed, there are lots of different reasons to want an awesome open-source human-legible world-model! All the more reason to make one!)
14.6 Conclusion: mild pessimism about finding a good solution, uncertainty about the consequences of a lousy solution
I think we have our work cut out figuring out how to solve the alignment problem via the "Controlled AGIs" route (as defined in Post #12). There are a bunch of open problems, and I’m currently pretty stumped. We should absolutely keep looking for good solutions, but right now I’m also open-minded to the possibility that we won’t find any. That’s why I continue to put a lot of my mental energy into the “social-instinct AGIs” path (Posts #12–#13), which seems somewhat less doomed to me, despite its various problems.
I note, however, that my pessimism is not universally shared—for example, as mentioned, Stuart Armstrong at AlignedAI appears optimistic about solving the open problem in Section 14.4, and John Wentworth and the Guaranteed Safe AI people appear optimistic about solving the open problem in Section 14.5. Let's hope they're right, wish them luck, and try to help!
To be clear, the thing I’m feeling pessimistic about is finding a good solution to “Controlled AGI”, i.e., a solution that we can feel extremely confident in a priori. A different question is: Suppose we try to make “Controlled AGI” via a lousy solution, like the Section 14.4.1 example (encapsulated in my post Plan for mediocre alignment of brain-like [model-based RL] AGI) where we imbue a super-powerful AGI with an all-consuming desire for the abstract concept of “human flourishing”, and the AGI then extrapolates that abstract concept arbitrarily far out of distribution in a totally-uncontrolled, totally-unprincipled way. Just how bad a future would such an AGI bring about? I’m very uncertain. Would such an AGI engage in mass torture? Umm, I guess I’m cautiously optimistic that it wouldn’t, absent a sign error from cosmic rays or whatever. Would it wipe out humanity? I think it’s possible!—see discussion in Section 14.4.1. But it might not! Hey, maybe it would even bring about a pretty awesome future! I just really don’t know, and I’m not even sure how to reduce my uncertainty.
In the next post, I will wrap up the series with my wish-list of open problems, and advice on how to get into the field and help solve them!
Changelog
July 2024: Since the initial version, I’ve made only minor changes. Mostly I added links to more recent content, particularly my own Plan for mediocre alignment of brain-like [model-based RL] AGI (which is basically a simpler self-contained version of part of this post), and Dalrymple et al.’s Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems, which is relevant to Section 14.5.