Planned summary for the Alignment Newsletter:
At a very high level, we can model powerful AI systems as moving closer and closer to omniscience. As we move in that direction, what becomes the new constraint on technology? This post argues that the constraint is _good interfaces_, that is, something that allows us to specify what the AI should do. As with most interfaces, the primary challenge is dealing with the discrepancy between the user's abstractions (how humans think about the world) and the AI system's abstractions, which could be very alien to us (e.g. perhaps the AI system uses detailed low-level simulations). The author believes that this is the central problem of AI alignment: how to translate between these abstractions that accurately preserves meaning.
The post goes through a few ways that we could attempt to do this translation, but all of them seem to only reduce the amount of translation that is necessary: none of them solve the chicken-and-egg problem of how you do the very first translation between the abstractions.
Planned opinion:
I like this view on alignment, but I don't know if I would call it the _central_ problem of alignment. It sure seems important that the AI is _optimizing_ something: this is what prevents solutions like "make sure the AI has an undo button / off switch", which would be my preferred line of attack if the main source of AI risk were bad translations between abstractions. There's a longer discussion on this point here.
(I might change the opinion based on further replies to my other comment.)
Endorsed; that definitely captures the key ideas.
If you haven't already, you might want to see my answer to Steve's comment, on why translation to low-level structure is the right problem to think about even if the AI is using higher-level models.
I did see that answer and pretty strongly agree with it, the "low-level structure" part of my summary was meant to be an example, not a central case. To make this clearer, I changed
which could potentially be detailed accurate low-level simulations
to
which could be very alien to us (e.g. perhaps the AI system uses detailed low-level simulations)
So it seems like this framing of alignment removes the notion of the AI "optimizing for something" or "being goal-directed". Do you endorse dropping that idea?
With just this general argument, I would probably not argue for AI risk -- if I had to argue for it, the argument would go "we ask the AI to do something, this gets mistranslated and the AI does something else with weird consequences, maybe the weird consequences include extinction", but it sure seems like as it starts doing the "something else" we would e.g. turn it off.
Starting point: the problem which makes AI alignment hard is not the same problem which makes AI dangerous. This is the capabilities/alignment distinction: AI with extreme capabilities is dangerous; aligning it is the hard part.
So it seems like this framing of alignment removes the notion of the AI "optimizing for something" or "being goal-directed". Do you endorse dropping that idea?
Anything with extreme capabilities is dangerous, and needs to be aligned. This applies even outside AI - e.g. we don't want a confusing interface on a nuclear silo. Lots of optimization power is a sufficient condition for extreme capabilities, but not a necessary condition.
Here's a plausible doom scenario without explicit optimization. Imagine an AI which is dangerous in the same way as a nuke is dangerous, but more so: it can make large irreversible changes to the world too quickly for anyone to to stop it. Maybe it's capable of designing and printing a supervirus (and engineered bio-offence is inherently easier than engineered bio-defense); maybe it's capable of setting off all the world's nukes simultaneously; maybe it's capable of turning the world into grey goo.
If that AI is about as transparent as today's AI, and does things the user wasn't expecting about as often as today's AI, then that's not going to end well.
Now, there is the counterargument that this scenario would produce a fire alarm, but there's a whole host of ways that could fail:
Getting back to your question:
Do you endorse dropping that idea?
I don't endorse dropping the AI-as-optimizer idea entirely. It is definitely a sufficient condition for AI to be dangerous, and a very relevant sufficient condition. But I strongly endorse the idea that optimization is not a necessary condition for AI to be dangerous. Tool AI can be plenty dangerous if it's capable of making large, fast, irreversible changes to the world, and the alignment problem is still hard for that sort of AI.
Tool AI can be plenty dangerous if it's capable of making large, fast, irreversible changes to the world, and the alignment problem is still hard for that sort of AI.
I definitely agree with that characterization. I think the solutions I would look for would be quite different though: they would look more like "how do I ensure that the AI system has an undo button" and "how do I ensure that the AI system does things slowly", similarly to how with nuclear power plants (I assume) there are (possibly redundant) mechanisms that ensure you can turn off the power plant.
Of course these solutions are also subject to the same translation problem, but it seems plausible to me that that translation problem is easier to solve, relative to solving translation in full generality.
AI-as-optimizer would suggest that even if the translation problem were solved for the particular things I mentioned, it still might not be enough, because e.g. the AI might deliberately prevent me from pressing the undo button.
You could say something like "an AI that can enact large irreversible changes might form a plan where the large irreversible change starts with disabling the undo button", but then it sort of feels like we're bringing back in the idea of optimization. Maybe that's fine, we're pretty confused about optimization anyway.
"how do I ensure that the AI system has an undo button" and "how do I ensure that the AI system does things slowly"
I don't think this is realistic if we want an economically-competitive AI. There are just too many real-world applications where we want things to happen which are fast and/or irreversible. In particular, the relevant notion of "slow" is roughly "a human has time to double-check", which immediately makes things very expensive.
Even if we abandon economic competitiveness, I doubt that slow+reversible makes the translation problem all that much easier (though it would make the AI at least somewhat less dangerous, I agree with that). It's probably somewhat easier - having a few cycles of feedback seems unlikely to make the problem harder. But if e.g. we're originally training the AI via RL, then slow+reversible basically just adds a few more feedback cycles after deployment; if millions or billions of RL cycles didn't solve the problem, then adding a handful more at the end seems unlikely to help much (though an argument could be made that those last few are higher-quality). Also, there's still the problem of translating a human's high-level notion of "reversible" into a low-level notion of "reversible".
Taking a more outside view... restrictions like "make it slow and reversible" feel like patches which don't really address the underlying issues. In general, I'd expect the underlying issues to continue to manifest themselves in other ways when patches are applied. For instance, even with slow & reversible changes, it's still entirely plausible that humans don't stop something bad because they don't understand what's going on in enough detail - that's a typical scenario in the "translation problem" worldview.
Zooming out even further...
I think the solutions I would look for would be quite different though...
I think what's driving this intuition is that you're looking for ways to make the AI not dangerous, without actually aligning it (i.e. without solving the translation problem) - mainly by limiting capabilities. I expect that such strategies, in general, will run into similar problems to those mentioned above:
I think what's driving this intuition is that you're looking for ways to make the AI not dangerous, without actually aligning it (i.e. without solving the translation problem) - mainly by limiting capabilities.
Yup, that is definitely the intuition.
Taking a more outside view... restrictions like "make it slow and reversible" feel like patches which don't really address the underlying issues.
Agreed.
In general, I'd expect the underlying issues to continue to manifest themselves in other ways when patches are applied.
I mean, they continue to manifest in the normal sense, in that when you say "cure cancer", the AI systems works on a plan to kill everyone; you just now get to stop the AI system from actually running that plan.
For instance, even with slow & reversible changes, it's still entirely plausible that humans don't stop something bad because they don't understand what's going on in enough detail - that's a typical scenario in the "translation problem" worldview.
[...]
Also, there's still the problem of translating a human's high-level notion of "reversible" into a low-level notion of "reversible".
[...]
simple solutions will be patches which don't address everything, there will be a long tail of complicated corner cases, etc.
All of this is true; I'm more arguing that slow & reversible eliminates ~95% of the problems, and so if it's easier to do than "full" alignment, then it probably becomes the best thing to do on the margin.
I don't think this is realistic if we want an economically-competitive AI. There are just too many real-world applications where we want things to happen which are fast and/or irreversible. In particular, the relevant notion of "slow" is roughly "a human has time to double-check", which immediately makes things very expensive.
I'd expect we'd be able to solve this over time, e.g. first you use your AI system for simple tasks which you can check quickly, then as you start trusting that you've worked out the bugs for those tasks, you let the AI do them faster / without oversight, and move on to more complicated tasks, etc.
(This is a much more testing + engineering based approach; the standard argument against such an approach is that it fails in the presence of optimization.)
It certainly does mean you take a hit to economic competitiveness, I mostly think the hit is not that large and is something we could pay.
I agree with most of this reasoning. I think my main point of departure is that I expect most of the value is in the long tail, i.e. eliminating 95% of problems generates <10% or maybe even <1% of the value. I expect this both in the sense that eliminating 95% of problems unlocks only a small fraction of economic value, and in the sense that eliminating 95% of problems removes only a small fraction of risk. (For the economic value part, this is mostly based on industry experience trying to automate things.)
Optimization is indeed the standard argument for this sort of conclusion, and is a sufficient condition for eliminating 95% of problems to have little impact on risk. But again, it's not a necessary condition - if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn't really decreased. And that's exactly the sort of situation I expect when viewing translation as the central problem: illusion of transparency is exactly the sort of thing which doesn't seem like a problem 95% of the time, right up until you realize that everything was completely broken all along.
Anyway, sounds like value-in-the-tail is a central crux here.
Anyway, sounds like value-in-the-tail is a central crux here.
Seems somewhat right to me, subject to caveat below.
it's not a necessary condition - if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn't really decreased.
An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you've translated better and now you've eliminated 99% of the risk, and iterating this process you get to effectively no ongoing risk. There is of course risk during the iteration, but that risk can be reasonably small.
A similar argument applies to economic competitiveness: yes, your first agent is pretty slow relative to what it could be, but you can make it faster and faster over time, so you only lose a lot of value during the first few initial phases.
(For the economic value part, this is mostly based on industry experience trying to automate things.)
I have the same intuition, and strongly agree that usually most of the value is in the long tail. The hope is mostly that you can actually keep making progress on the tail as time goes on, especially with the help of your newly built AI systems.
An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you've translated better and now you've eliminated 99% of the risk...
I don't see how this ever actually gets around the chicken-and-egg problem.
An analogy: we want to translate from English to Korean. We first obtain a translation dictionary which is 95% accurate, then use it to ask our Korean-speaking friend to help out. Problem is, there's a very important difference between very similar translations of "help me translate things" - e.g. consider the difference between "what would you say if you wanted to convey X?" and "what should I say if I want to convey X?", when giving instructions to an AI. Both of those would produce very similar results, right up until everything went wrong. (Let me know if this analogy sounds representative of the strategies you imagine.)
If you do manage to get that first translation exactly right, and successfully ask your friend for help, then you're good - similar to the "translate how-to-translate" strategy from the OP. And with a 95% accurate dictionary, you might even have a decent chance of getting that first translation right. But if that first translation isn't perfect, then you need some way to find that out safely - and the 95% accurate dictionary doesn't make that any easier.
Another way to look at it: the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further. We need some other way to get at the ground truth, in order to actually reduce the error rate. If we know how to convey what-we-want with 95% accuracy, then we need some other way to get at the ground truth of translation in order to increase that accuracy further.
Let me know if this analogy sounds representative of the strategies you imagine.
Yeah, it does. I definitely agree that this doesn't get around the chicken-and-egg problem, and so shouldn't be expected to succeed on the first try. It's more like you get to keep trying this strategy over and over again until you eventually succeed, because if everything goes wrong you just unplug the AI system and start over.
the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further.
I think you get "ground truth data" by trying stuff and seeing whether or not the AI system did what you wanted it to do.
(This does suggest that you wouldn't ever be able to ask your AI system to do something completely novel without having a human along to ensure it's what we actually meant, which seems wrong to me, but I can't articulate why.)
I think you get "ground truth data" by trying stuff and seeing whether or not the AI system did what you wanted it to do.
That's the sort of strategy where illusion of transparency is a big problem, from a translation point of view. The difficult cases are exactly the cases where the translation usually produces the results you expect, but then produce something completely different in some rare cases.
Another way to put it: if we're gathering data by seeing whether the system did what we wanted, then the long tail problem works against us pretty badly. Those rare tail-cases are exactly the cases we would need to observe in order to notice problems and improve the system. We're not going to have very many of them to work with. Ability to generalize from small data sets becomes a key capability, but then we need to translate how-to-generalize in order for the AI to generalize in the ways we want (this gets at the can't-ask-the-AI-to-do-anything-novel problem).
(The other comment is my main response, but there's a possibly-tangential issue here.)
In a long-tail world, if we manage to eliminate 95% of problems, then we generate maybe 10% of the value. So now we use our 10%-of-value product to refine our solution. But it seems rather optimistic to hope that a product which achieves only 10% of the value gets us all the way to a 99% solution. It seems far more likely that it gets to, say, a 96% solution. That, in turn, generates maybe 15% of the value, which in turn gets us to a 96.5% solution, and...
Point being: in the long-tail world, it's at least plausible (and I would say more likely than not) that this iterative strategy doesn't ever converge to a high-value solution. We get fancier and fancier refinements with decreasing marginal returns, which never come close to handling the long tail.
Now, under this argument, it's still a fine idea to try the iterative strategy. But you wouldn't want to bet too heavily on its success, especially without a reliable way to check whether it's working.
Yeah, this could be a way that things are. My intuition is that it wouldn't be this way, but I don't have any good arguments for it.
I'm not sure why your default assumption is that the AGI's understanding of the world is at a "low level". My default assumption would be that it would develop a predictive world-model with entities that are at many different levels at once, sorta like humans do. (Or is that just a toy example to illustrate what you're talking about?)
I do expect that systems trained with limited information/compute will often learn multi-level models. That said, there's a few reasons why low-level is still the right translation target to think about.
First, there's the argument from the beginning of the OP: in the limit of abundant information & compute, there's no need for multi-level models; just directly modelling the low-level will have better predictive power. That's a fairly general argument, which applies even beyond AI, so it's useful to keep in mind.
But the main reason to treat low-level as the translation target is: assuming an AI does use high-level models, translating into those models directly will only be easier than translating into the low level to the extent that the AI's high-level models are similar to a human's high-level models. We don't have any reason to expect AI to use similar abstraction levels as humans except to the extent that those abstraction levels are determined by the low-level structure. In studying how to translate our own high-level models into low-level structure, we also learn when and to what extent an AI is likely to learn similar high-level structures, and what the correspondence looks like between ours and theirs.
"What if we verify the translation by having one group translate English-to-Korean, another group translate back, and reward both when the result matches the original?"
This is a fun idea. Does it work in practice for machine translation?
In the AI safety context, perhaps it would look like: A human gives an AI in a game world some instructions. The AI then goes and does stuff in the game world, and another AI looks at it and reports back to the human. The human then decides whether the report is sufficiently similar to the instructions that both AIs deserve reward.
I feel like eventually this would reach a bad equilibria where the acting-AI just writes out the instructions somewhere and the reporting-AI just reports what they see written.
This is a fun idea. Does it work in practice for machine translation?
I still find it mind-blowing, but unsupervised machine translation is a thing.
Holy shit, that's awesome. I wonder if it would work to figure out what dolphins, whales, etc. are saying.
I think you run into a problem that most animal communication is closer to a library of different sounds, each of which maps to a whole message, than it is something whose content is determined by internal structure, so you don't have the sort of corpus you need for unsupervised learning (while you do have the ability to do supervised learning).
Technology Changes Constraints argues that economic constraints are usually modular with respect to technology changes - so for reasoning about technology changes, it’s useful to cast them in terms of economic constraints. Two constraints we’ll talk about here:
Thanks to ongoing technology changes, both of these constraints are becoming more and more slack over time - compute and information are both increasingly abundant and cheap.
Immediate question: what happens in the limit as the prices of both compute and information go to zero?
Essentially, we get omniscience: our software has access to a perfect, microscopically-detailed model of the real world. Computers have the memory and processing capability to run arbitrary queries on that model, and predictions are near-perfectly accurate (modulo quantum noise). This limit applies even without AGI - as compute and information become more abundant, our software approaches omniscience, even limiting ourselves to special-purpose reasoning algorithms.
Of course, AGI would presumably be closer to omniscience than non-AGI algorithms, at the same level of compute/information. It would be able to more accurately predict more things which aren’t directly observable via available sensors, and it would be able to run larger queries with the same amount of compute. (How much closer to omniscience an AGI would get is an open question, but it would at least not be any worse in a big-O sense.)
Next question: as compute and information constraints slacken, which constraints become taut? What new bottlenecks appear, for problems which were previously bottlenecked on compute/information?
To put it differently: if our software can run arbitrary queries on an accurate, arbitrarily precise low-level model of the physical world, what else do we need in order to get value out of that capability?
Well, mainly we need some way to specify what it is that we want. We need an interface.
Our highly accurate low-level world model can tell us anything about the physical world, but the things-we-want are generally more abstract than molecules/atoms/fields. Our software can have arbitrarily precise knowledge and predictive power on physical observables, but it still won’t have any notion that air-pressure-oscillations which sound like the word “cat” have something to do with the organs/cells/biomolecules which comprise a cat. It won’t have built-in any notion of “tree” or “rock” or “human” - using such high-level abstractions would only impede predictive power, when we could instead model the individual components of such high-level objects.
It’s the prototypical interface problem: the structure of a high-precision world-model generally does not match the structure of what-humans-want, or the structure of human abstractions in general. Someone/something has to translate between the structures in order to produce anything useful.
As I see it, this is the central problem of alignment.
Some Approaches
Default: Humans Translate
Without some scalable way to build high-level world models out of low-level world models, we constantly need to manually translate things-humans-want into low-level specifications. It’s an intellectual-labor-intensive and error-prone process; writing programs in assembly code is not just an analogy but an example. Even today’s “high-level programming languages” are much more structurally similar to assembly code than to human world-models - Python has no notion of “oak tree”.
An analogy: translating high-level structure into low-level specification the way we do today is like translating English into Korean by hand.
Humans Translate Using Better Tools
It’s plausible (though I find it unlikely) that we could tackle the problem by building better tools to help humans translate from high-level to low-level - something like much-higher-level programming languages. I find it unlikely because we’d probably need major theoretical breakthroughs - for instance, how do I formally define “tree” in terms of low-level observables? Even if we had ways to do that, they’d probably enable easier strategies than building better programming languages.
Analogy: it’s like translating by hand from English to Korean, but with the assistance of a dictionary, spell-checker, grammar-checker, etc. But if we had an English-Korean dictionary, we'd probably be most of the way to automated translation anyway (in this respect, the analogy is imperfect).
Examples + Interpolation
Another path which is plausible (though I find it unlikely) is something like programming-by-example - not unlike today’s ML. This seems unlikely to work from both an inside and outside view:
Examples alone aren’t enough to make software reliably carve reality at the same joints as humans. There probably are some architectures which would reliably carve at the same joints as humans - different humans tend to chunk the world into similar objects, after all. But figuring out such an architecture would take more than just throwing lots of data at the problem.
To put it differently: the way-in-which-we-want-things-translated is itself something which needs to be translated. A human’s idea-of-what-constitutes-a-“good”-low-level-specification-of-“oak tree” is itself pretty high-level and abstract; that idea itself needs to be translated into a low-level specification before it can be used. If we’re trying to use examples+interpolation, then the interpolation algorithm is our “specification” of how-to-translate… and it probably isn’t a very good translation of our actual high-level idea of how-to-translate.
Analogy: it’s like teaching English to Korean speakers by pointing to trees and saying “tree”, pointing to cars and saying “car”, etc… except that none of them actually realize they’re supposed to be learning another language. The Korean-language instructions they received were not actually a translation of the English explanation “learn the language that person is speaking”.
Incentives
A small tweak to the previous approach: train a reinforcement learner.
The analogy: rather than giving our Korean-speakers some random Korean-language instructions, we don't give them any instructions - we just let them try things, and then pay them when they happen to translate things from English to Korean.
Problem: this requires some way to check that the translation was correct. Knowing what to incentivize is not any easier than specifying what-we-want to begin with. Rather than translating English-to-Korean, we’re translating English-to-incentives.
Now, there is a lot of room here for clever tricks. What if we verify the translation by having one group translate English-to-Korean, another group translate back, and reward both when the result matches the original? Or taking the Korean translation, giving it to some other Korean speaker, and seeing what they do? Etc. These are possible approaches to translating English into incentives, within the context of the analogy.
It’s possible in principle that translating what-humans-want into incentives is easier than translating into low-level specifications directly. However, if that’s the case, I have yet to see compelling evidence - attempts to specify incentives seem plagued by the same surprising corner cases and long tail of difficult translations as other strategies.
AI Translates
This brings us to the obvious general answer: have the AI handle the translation from high-level structure to low-level structure. This is probably what will happen eventually, but the previous examples should make it clear why it’s hard: an explanation of how-to-translate must itself be translated. In order to make an AI which translates high-level things-humans-want into low-level specifications, we first need a low-level specification of the high-level concept “translate high-level things-humans-want into low-level specifications”.
Continuing the earlier analogy: we’re trying to teach English to a Korean speaker, but that Korean speaker doesn’t have any idea that they’re supposed to be learning another language. In order to get them to learn English, we first need to somehow translate something like “please learn this language”.
This is a significant reduction of the problem: rather than translating everything by hand all the time, we just need to translate the one phrase “please learn this language”, and then the hard part is done and we can just use lots of examples for the rest.
But we do have a chicken-and-egg problem: somehow, we need to properly translate that first phrase. Screw up that first translation, and nothing else will work. That part cannot be outsourced; the AI cannot handle the translation because it has no idea that that’s what we want it to do.