Claim: the core of the alignment problem is conserved across capability levels. If a particular issue only occurs at a particular capability level, then the issue is usually "not really about alignment" in some sense.
Roughly speaking, if I ask a system for something, and then the result is not really what I wanted, but the system "could have" given the result I wanted in some sense, then that's an alignment problem regardless of whether the system is a superintelligent AI or google maps. Whether it's a simple system with a bad user interface, or a giant ML system with an unfriendly mesa-optimizer embedded in it, the conceptual core of the problem isn't all that different.
The difference is mainly in how-bad-it-is for the system to be misaligned (for a given degree-of-misalignment). That does have important implications for how we think about AI safety - e.g. we can try to create systems which are reasonably safe without really solving the alignment problem. But I think it's useful to distinguish safety vs alignment here - e.g. a proposal to make an AI safe by making sure it doesn't do anything very far out of the training distribution might be a reasonable safety proposal without really saying much about the alignment problem.
Similarly, proposals along the lines of "simulate a human working on the alignment problem for a thousand years" are mostly safety proposals, and pass the buck on the alignment parts of the problem. (Which is not necessarily bad!)
The distinction matters because, roughly speaking, alignment advances should allow us to leverage more-capable systems while maintaining any given safety level. On the other hand, safety-without-alignment mostly chooses a point on the safety-vs-capabilities pareto surface without moving that surface. (Obviously this is a severe oversimplification of a problem with a lot more than two dimensions, but I still think it's useful.)
I think this is a reasonable definition of alignment, but it's not the one everyone uses.
I also think that for reasons like the "ability to understand itself" thing, there are pretty interesting differences in the alignment problem as you're defining it between capability levels.
One reason to favor such a definition of alignment might be that we ultimately need a definition that gives us guarantees that hold at human-level capability or greater, and humans are probably near the bottom of the absolute scale of capabilities that can be physically realized in our world. It would (imo) be surprising to discover a useful alignment definition that held across capability levels way beyond us, but that didn't hold below our own modest level of intelligence.
Other examples of problems that people sometimes call alignment problems that aren’t a problem in the limit of competence: avoiding negative side effects, safe exploration...
I don't understand why you think that negative side effect avoidance belongs on that list.
A sufficiently intelligent system will probably be able to figure out when it's having negative side effects. This does not mean that it will—as a matter of fact—avoid having these side effects, and it does not mean that its NegativeSideEffect? predicate is accessible. A paperclip maximizer may realize that humans consider extinction to be a "negative side effect." This consideration does not move it. Increasing agent intelligence does not naturally solve the problem of getting the agent to not do catastrophically impactful things while optimizing its objective.
In contrast, once an agent realizes that an exploration strategy is unsafe, the agent will be instrumentally motivated to find a better one. Increasing agent intelligence naturally solves the problem of safe exploration.
it will massively outperform humans on writing ethics papers or highly upvoted r/AmItheAsshole comments.
Presumably you meant to say "it will be able to massively outperform..."? (I think you did, since you mention a similar consideration under "Ability to understand itself.") A competent agent will understand, but will only act accordingly if so aligned (for either instrumental or terminal reasons).
Re the negative side effect avoidance: Yep, you're basically right, I've removed side effect avoidance from that list.
And you're right, I did mean "it will be able to" rather than "it will"; edited.
That was my reaction when reading the competence subsection too. I'm really confused, because that's quite basic Orthogonality Thesis, so should be quite obvious to the OP. Maybe it's a problem of how the post was written that implies some things the OP didn't meant?
There’s a related dynamic that came up in a convo I just had.
Alice: My current work is exploring if we can solve value loading using reward learning.
Bob: Woah, isn’t that obviously doomed? Didn’t Rohin write a whole sequence on this?
Alice: Well, I don’t want to solve the whole problem for arbitrary difficulty. I just want to know whether we can build something that gets the basics right in distributions that a present day human can understand. For example I reckon we may be able to teach an AI what murder is today, even if we can’t teach it what murder is post-singularity.
Bob: I see. That’s more reasonable. However, there is a school of thought that I suspect you may be a part of, called “slow takeoff”, that looks like over the course of maybe 10 years the world will increasingly be reliant on ML systems whose internals we don’t understand, and whose actions we don’t understand (we just see the metrics go up). It is world is already out of distribution for what a present day human can understand.
Alice: That’s true. I guess I’m interested to know how far we can push it and still build a system that helps us take pivotal acts and doesn’t destroy everything.
(The above is my paraphrase. Both said different things than the above and also I incorporated some things I said.)
The success story you have in mind determines what problem you’re trying to solve, and to some extent which capability regime you’re thinking about.
But in the context of superhuman systems, I think we need to be more concerned by the possibility that it’s performance-uncompetitive to restrict your system to only take actions that can be justified entirely with human-understandable reasoning.
Interestingly, this is already a well known phenomenon in the hedge fund world. In fact, quant funds discovered about 25 years ago that the most consistently profitable trading signals are reliably the ones that are the least human-interpretable. It makes intuitive sense: any signal that can be understood by a human is at risk of being copied by a human, so if you insist that your trading decisions have to be interpretable, you'll pay for that insistence in alpha.
I'd imagine this kind of issue is already top-of-mind for folks who are working on the various transparency agendas, but it does imply that there's a very strong optimization pressure directly against interpretability in many economically relevant contexts. In fact, it could hardly be stronger: your forcing function is literally "Want to be a billionaire? Then you'll have to trade exclusively on the most incomprehensible signals you can find."
(Of course this isn't currently true of all hedge funds, only a few specialized ones.)
Planned summary for the Alignment Newsletter:
One reason that researchers might disagree on what approaches to take for alignment is that they might be solving different versions of the alignment problem. This post identifies two axes on which the “type” of alignment problem can differ. First, you may consider AI systems with differing levels of capability, ranging from subhuman to wildly superintelligent, with human-level somewhere in the middle. Second, you might be thinking about different mechanisms by which this leads to bad outcomes, where possible mechanisms include <@the second species problem@>(@AGI safety from first principles@) (where AIs seize control of the future from us), the “missed opportunity” problem (where we fail to use AIs as well as we could have, but the AIs aren’t themselves threatening us), and a grab bag of other possibilities (such as misuse of AI systems by bad actors).
Depending on where you land on these axes, you will get to rely on different assumptions that change what solutions you would be willing to consider:
1. **Competence.** If you assume that the AI system is human-level or superintelligent, you probably don’t have to worry about the AI system causing massive problems through incompetence (at least, not to a greater extent than humans do).
2. **Ability to understand itself.** With wildly superintelligent systems, it seems reasonable to expect them to be able to introspect and answer questions about its own cognition, which could be a useful ingredient in a solution that wouldn’t work in other regimes.
3. **Inscrutable plans or concepts.** With sufficiently competent systems, you might be worried about the AI system making dangerous plans you can’t understand, or reasoning with concepts you will never comprehend. Your alignment solution must be robust to this.
Planned opinion:
When I talk about alignment, I am considering the second species problem, with AI systems whose capability level is roughly human-level or more (including “wildly superintelligent”).
I agree with [this comment thread](https://www.alignmentforum.org/posts/HHunb8FPnhWaDAQci/the-alignment-problem-in-different-capability-regimes?commentId=mz6hYiWqNMTwHNrhC) that the core _problem_ in what-I-call-alignment stays conserved across capability levels, but the solutions can change across capability levels. (Also, other people mean different things by “alignment”, such that this would no longer be true.)
Similary to johnswentworth: My current impression is core alignment problems are the same and manifest at all levels - often sub-human version just looks like a toy version of the scaled-up problem, and the main difference is, in the sub-human version problem, you can often solve it for practical purposes by plugging in human at some strategic spot. (While I don't think there are deep differences in the alignment problem space, I do think there are differences in the "alignment solutions" space, where you can use non-scalable solutions, or in risk space, where dangers being small due to the systems being stupid.)
I'm also unconvinced about some of practical claims about differences for wildly superintelligent systems.
One crucial concern related to "what people want" is this seems underdefined, un-stable in interactions with wildly superintelligent systems, and prone to problems with scaling of values within systems where intelligence increases. By this line of reasoning, if the wildly superintelligent system is able to answer me these sort of questions "in a way I want", it very likely must be already aligned. So it feels like part of the worries was assumed away. Paraphrasing the questions about human values again, one may ask "how did you get to the state where you have this aligned wildly superintelligent system which is able to answer questions about human values, as opposed to e.g. overwriting what humans believe about themselves by it's own non-human-aligned values?".
Ability to understand itself seems a special case of competence: I can imagine systems which are wildly superhuman in their ability to understand the rest of the world, but pretty mediocre at understanding themselves, e.g. due to some problems with recursion, self-references, reflections, or different kinds of computations being used at various levels of reasoning. As a result, it seems unclear whether the ability to clearly understand itself is a feature of all wildly super-human systems. (Toy counterexample: imagine a device which would connect someone in ancient Greece with our modern civilization, and our civilization dedicating about 10% of global GDP to answering questions from this guy. I would argue this device is for most practical purposes wildly superhuman compared to this individual guy in Greece, but at the same time bad at understanding itself)
Fundamentally inscrutable thoughts seems like something which you can study with present day systems as toy models. E.g., why does AlphaZero believe something is a good go move? Why does a go grand-master believe something is a good move? What counts as a 'true explanation'? Who is the recipient of the explanation? Are you happy with explanation of the algorithm like 'upon playing myriad games, my general functional approximator is approximating the expected value of this branch of an unimaginably large choice tree is larger than for other branches?'? If yes, why? If no, why not?
Inscrutable influence-seeking plans seem also a present problem. Eg, if there are already some complex influence-seeking patterns now, how would we notice?
One crucial concern related to "what people want" is this seems underdefined, un-stable in interactions with wildly superintelligent systems, and prone to problems with scaling of values within systems where intelligence increases.
This is what I was referring to with
by assumption the superintelligence will be able to answer any question you’re able to operationalize about human values
The superintelligence can answer any operationalizable question about human values, but as you say, it's not clear how to elicit the right operationalization.
If you’re trying to align wildly superintelligent systems, you don’t have to worry about any concern related to your system being incompetent.
In general, this seems false. The thing you don't have to worry about is subhuman competence. You may still have to worry about incompetence relative to some highly superhuman competence threshold. (it may be fine to say that this isn't an alignment problem - but it's a worry)
One concern is reaching [competent at X] before [competent at operating safely when [competent at X]].
Here it'd be fine if the system had perfect knowledge of the risks of X, or perfect calibration of its uncertainty around such risks. Replace "perfect" with "wildly superhuman", and you lose the guarantee. If human-level-competence would be wildly unsafe at the [...operating safely...] task, then knowing the system will do better isn't worth much. (we're wildly superchimp at AI safety; this may not be good enough)
I think it can sometimes be misleading to think/talk about issues "in the limit of competence": in the limit you're throwing away information about relative competence levels (at least unless you're careful to take limits of all the important ratios etc. too).
E.g. take two systems:
Alice: [ power, wisdom]
Bob: [ power, wisdom]
We can let tend to infinity and say they're both arbitrarily powerful and arbitrarily wise, but I'd still trust Alice a whole lot more than Bob at any given time (for safe exploration, and many other things).
I don't think it's enough to say "Bob will self-modify to become Alice-like (in a singleton scenario)". The concerning cases are where Bob has insufficient wisdom to notice or look for a desirable [self-modify to Alice] style option.
It's conceivable to me that this is a non-problem in practice: that any system with only modestly super-human wisdom starts to make Wei Dai look like a reckless megalomaniac, regardless of its power. Even if that's true, it seems important to think about ways to train systems such that they acquire this level of wisdom early.
Perhaps this isn't exactly an alignment problem - but it's the kind of thing I'd want anyone aligning wildly superintelligent systems to worry about (unless I'm missing something).
I think the alignment problem looks different depending on the capability level of systems you’re trying to align. And I think that different researchers often have different capability levels in mind when they talk about the alignment problem. I think this leads to confusion. I’m going to use the term “regimes of the alignment problem” to refer to the different perspectives on alignment you get from considering systems with different capability levels.
(I would be pretty unsurprised if these points had all been made elsewhere; the goal of this post is just to put them all in one place. I’d love pointers to pieces that make many of the same points as this post. Thanks to a wide variety of people for conversations that informed this. If there’s established jargon for different parts of this, point it out to me and I’ll consider switching to using it.)
Different regimes:
Two main causes that lead to differences in which regime people focus on:
Different mechanisms by which AI misalignment could be bad from a longtermist perspective:
Here are some aspects of the alignment problem that are different in different regimes. I would love to make a longer list of these.
For example, any concern related to the system not understanding what people will want. Human values are complex, but by assumption the superintelligence will be able to answer any question you’re able to operationalize about human values. Eg it will be able to massively outperform humans on writing ethics papers or highly upvoted r/AmItheAsshole comments.
Other examples of problems that people sometimes call alignment problems that aren’t a problem in the limit of competence include safe exploration and maybe competence-related aspects of robustness to distributional shift (see Concrete Problems in AI Safety).
Obviously there will be things that your system doesn’t know, and it will be prone to mistakes sometimes. But it will be superhumanly good at being calibrated re these things. For example, it should know how vulnerable to adversarial inputs it is, and it should know when it doesn’t understand something about human values, and so on.
I think that this basically definitionally isn’t a problem unless your system is roughly human-level--it will probably have inscrutable plans at lower capability levels, but it probably won’t be powerful enough to surreptitiously gain power. It’s somewhat of a problem in the human-level regime and becomes a much clearer problem when the system is much more capable than humans.
Sub-AGI systems probably will have inscrutable thoughts. But in the context of superhuman systems, I think we need to be more concerned by the possibility that it’s performance-uncompetitive to restrict your system to only take actions that can be justified entirely with human-understandable reasoning.
The “competence” and “ability to understand itself” properties make the problem easier; the latter two make the problem harder.
I’m currently most interested in the second species problem and the missed opportunity problem. These days I usually think about the “wildly superintelligent” and “roughly human-level” regimes. I normally think about alignment research relevant to the situations where systems are arbitrarily competent, are prone to inscrutable influence-seeking plans, and don’t have inscrutable thoughts (mostly because I don’t know of good suggestions for getting around the inaccessible information problem).