It's great to hear that you have updated away from ambitious value learning towards corrigibility-like targets. It sounds like you now find it plausible that corrigibility will be a natural concept in the AI's ontology, despite it being incompatible with expected utility maximization. Does this mean that you expect we will be able to build advanced AI that doesn't become an expected utility maximizer?
I'm also curious how optimistic you are about the interpretability field being able to solve the empirical side of the abstraction problem in the next 5-10 years. Current interpretability work is focused on low-level abstractions (e.g. identifying how a model represents basic facts about the world) and extending the current approaches to higher-level abstractions seems hard. Do you think the current interpretability approaches will basically get us there or will we need qualitatively different methods?
Bah! :D It's sad to hear he's updated away from ambitions value learning towards corrigiblity-like targets. Eliezer's second-hand argument sounds circular to me; suppose that corrigibility as we'd recognize it isn't a natural abstraction - then generic AIs wouldn't use it to align child agents (instead doing something like value learning, or something even more direct), and so there wouldn't be a bunch of human-independent examples, so it wouldn't show up as a natural abstraction to those AIs.
I think the pointer “the thing I would do if I wanted to make a second AI that would be the best one I could make at my given intelligence” is what is being updated in favor of, since this does feel like a natural abstraction, given how many agents would think this (also seems very similar to the golden rule. “I will do what I would want a successor AI to do if the successor AI was actually the human’s successor AI”. or “treat others (the human) how I’d like to be treated (by a successor AI), (and abstracting one meta-level upwards)”). Whether this turns out to be value learning or something else 🤷. This seems a different question from whether or not it is indeed a natural abstraction.
Interesting. What is it that potentially makes "treat the human like I would like to be treated if I had their values" easier than "treat the human like they would like to be treated"?
John usually does not make his plans with an eye toward making things easier. His plan previously involved values because he thought they were strictly harder than corrigibility. If you solve values, you solve corrigibility. Similarly, if you solve abstraction, you solve interpretability, shard theory, value alignment, corrigibility, etc.
I don’t know all the details of John’s model here, but it may go something like this: If you solve corrigibility, and then find out corrigibility isn’t sufficient for alignment, you may expect your corrigible agent to help you build your value aligned agent.
Similarly, if you solve abstraction, you solve interpretability, shard theory, value alignment, corrigibility, etc.
In what way do you think solving abstraction would solve shard theory?
Does this mean that you expect we will be able to build advanced AI that doesn't become an expected utility maximizer?
When talking about whether some physical system "is a utility maximizer", the key questions are "utility over what variables?", "in what model do those variables live?", and "with respect to what measuring stick?". My guess is that a corrigible AI will be a utility maximizer over something, but maybe not over the AI-operator interface itself? I'm still highly uncertain what that type-signature will look like, but there's a lot of degrees of freedom to work with.
Do you think the current interpretability approaches will basically get us there or will we need qualitatively different methods?
We'll need qualitatively different methods. But that's not new; interpretability researchers already come up with qualitatively new methods pretty regularly.
Any updates to your model of the socioeconomic path to aligned AI deployment? Namely:
I expect there to be no major updates, but seems worthwhile to keep an eye on this.
So my new main position is: which potential alignment targets (human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AI’s internal language (which itself probably includes a lot of mathematics) is an empirical question, and that’s the main question which determines what we should target.
I'd like to make a case that Do What I Mean will potentially turn out to be the better target than corrigibility/value learning.
Primarily, "Do What I Mean" is about translation. Entity 1 compresses some problem specification defined over Entity 1's world-model into a short data structure — an order, a set of values, an objective function, etc. — then Entity 2 uses some algorithm to de-compress that data structure and translate it into a problem specification defined over Entity 2's world-model. The problem of alignment via Do What I Mean, then, is the problem of ensuring that Entity 2 (which we'll assume to be bigger) decompresses a specific type of compressed data structures using the same algorithm that was used to compress them in the first place — i. e., interprets orders the way they were intended/acts on our actual values and not the misspecified proxy/extrapolates our values from the crude objective function/etc.
This potentially has the nice property of collapsing the problem of alignment to the problem of ontology translation, and so unifying the problem of interpreting an NN and the problem of aligning an NN into the same problem.
In addition, it's probably a natural concept, in the sense that "how do I map this high-level description onto a lower-level model" seems like a problem any advanced agent would be running into all the time. There'll almost definitely be concepts and algorithms about that in the AI's world-model, and they may be easily repluggable.
Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?
Here's a dump of my current timeline models. (I actually originally drafted this as part of the post, then cut it.)
My current intuition is that deep learning is approximately one transformer-level paradigm shift away from human-level AGI. (And, obviously, once we have human-level AGI things foom relatively quickly.) That comes from an intuitive extrapolation: if something were about as much better as the models of the last 2-3 years, as the models of the last 2-3 years are compared to pre-transformer models, then I'd expect them to be at least human-level. That does not mean that nets will get to human level immediately after that transformer-level shift comes along; e.g. with transformers it still took ~2-3 years before transformer models really started to look impressive.
So the most important update from deep learning over the past year has been the lack of any transformer-level paradigm shift in algorithms, architectures, etc.
There are of course other potential paths to human-level (or higher) which don't route through a transformer-level paradigm shift in deep learning. One obvious path is to just keep scaling; I expect we'll see a paradigm shift well before scaling alone achieves human-level AGI (and this seems even more likely post-Chinchilla). The main other path is that somebody wires together a bunch of GPT-style AGIs in such a way that they achieve greater intelligence by talking to each other (sort of like how humans took off via cultural accumulation); I don't think that's very likely to happen near-term, but I do think it's the main path by which 5-year timelines would happen without a paradigm shift. Call it maybe 5-10%. Finally, of course, there's always the "unknown unknowns" possibility.
Back around 2014 or 2015, I was visiting my alma mater, and a professor asked me what I thought about the deep learning wave. I said it looked pretty much like all the previous ML/AI hype cycles: everyone would be very excited for a while and make grand claims, but the algorithms would be super finicky and unreliable. Eventually the hype would die down, and we’d go into another AI winter. About ten years after the start of the wave someone would show that the method (in this case large CNNs) was equivalent to some Bayesian model, and then it would make sense when it did/didn’t work, and it would join the standard toolbox of workhorse ML algorithms. Eventually some new paradigm would come along, and the hype cycle would start again.
… and in hindsight, I think that was basically correct up until transformers came along around 2017. Pre-transformer nets were indeed very finicky, and were indeed shown equivalent to some Bayesian model about ten years after the excitement started, at which point we had a much better idea of what they did and did not do well. The big difference from previous ML/AI hype waves was that the next paradigm - transformers - came along before the previous wave had died out. We skipped an AI winter; the paradigm shift came in ~5 years rather than 10-15.
… and now it’s been about five years since transformers came along. Just naively extrapolating from the two most recent data points says it’s time for the next shift. And we haven’t seen that shift yet. (Yes, diffusion models came along, but those don’t seem likely to become a transformer-level paradigm shift; they don’t open up whole new classes of applications in the same way.)
So on the one hand, I'm definitely nervous that the next shift is imminent. On the other hand, it's already very slightly on the late side, and if another 1-2 years go by I'll update quite a bit toward that shift taking much longer.
Also, on an inside view, I expect the next shift to be quite a bit more difficult than the transformers shift. (I don't plan to discuss the reasons for that, because spelling out exactly which technical hurdles need to be cleared in order to get nets to human level is exactly the sort of thing which potentially accelerates the shift.) That inside view is a big part of why my timelines last year were 10-15 years, and not 5. The other main reasons my timelines were 10-15 years were regression to the mean (i.e. the transformers paradigm shift came along very unusually quickly, and it was only one data point), general hype-wariness, and an intuitive sense that unknown unknowns in this case will tend to push toward longer timelines rather than shorter on net.
Put all that together, and there's a big blob of probability mass on ~5 year timelines; call that 20-30% or so. But if we get through the next couple years without a transformer-level paradigm shift, and without a bunch of wired-together GPTs spontaneously taking off, then timelines get a fair bit lot longer, and that's where my median world is.
Still on the "figure out agency and train up an aligned AGI unilaterally" path?
"Train up an AGI unilaterally" doesn't quite carve my plans at the joints.
One of the most common ways I see people fail to have any effect at all is to think in terms of "we". They come up with plans which "we" could follow, for some "we" which is not in fact going to follow that plan. And then they take political-flavored actions which symbolically promote the plan, but are not in fact going to result in "we" implementing the plan. (And also, usually, the "we" in question is too dysfunctional as a group to implement the plan even if all the individuals wanted to, because that is how approximately 100% of organizations of more than 10 people operate.) In cognitive terms, the plan is pretending that lots of other peoples' actions are choosable/controllable, when in fact those other peoples' actions are not choosable/controllable, at least relative to the planner's actual capabilities.
The simplest and most robust counter to this failure mode is to always make unilateral plans.
But to counter the failure mode, plans don't need to be completely unilateral. They can involve other people doing things which those other people will actually predictably do. So, for instance, maybe I'll write a paper about natural abstractions in hopes of nerd-sniping some complex systems theorists to further develop the theory. That's fine; the actions which I need to counterfact over in order for that plan to work are actions which I can in fact take unilaterally (i.e. write a paper). Other than that, I'm just relying on other people acting in ways in which they'll predictably act anyway.
Point is: in order for a plan to be a "real plan" (as opposed to e.g. a fabricated option, or a de-facto applause light), all of the actions which the plan treats as "under the planner's control" must be actions which can be taken unilaterally. Any non-unilateral actions need to be things which we actually expect people to do by default, not things we wish they would do.
Coming back to the question: my plans certainly do not live in some childrens' fantasy world where one or more major AI labs magically become the least-dysfunctional multiple-hundred-person organizations on the planet, and then we all build an aligned AGI via the magic of Friendship and Cooperation. The realistic assumption is that large organizations are mostly carried wherever the memetic waves drift. Now, the memetic waves may drift in a good direction - if e.g. the field of alignment does indeed converge to a paradigm around decoding the internal language of nets and expressing our targets in that language, then there's a strong chance the major labs follow that tide, and do a lot of useful work. And I do unilaterally have nonzero ability to steer that memetic drift - for instance, by creating public knowledge of various useful lines of alignment research converging, or by training lots of competent people.
That's the sort of non-unilaterality which I'm fine having in my plans: relying on other people to behave in realistic ways, conditional on me doing things which I can actually unilaterally do.
Has the FTX fiasco impacted your expectation of us-in-the-future having enough money=compute to do the latter?
Basically no.
I'd like to make a case that Do What I Mean will potentially turn out to be the better target than corrigibility/value learning. ...
I basically buy your argument, though there's still the question of how safe a target DWIM is.
Someone working full-time on an approach to the alignment problem that they feel optimistic about, and writing annual reflections on their work, is something that has been sorely lacking. +4
Nice update!
On the flip side, I expect we’ll see more discussion about which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them.
While I don't think of these as alignment targets per se (as I understand the term to be used), I strongly support discussing the internal language of the neural net and moving away from convoluted inner/outer schemes.
I like this post! It clarifies a few things I was confused on about your agenda and the progress you describe sounds pretty damn promising, although I only have intuitions here about how everything ties together.
In the interest of making my abstract intuition here more precise, a few weird questions:
Put all that together, extrapolate, and my 40% confidence guess is that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets. That will naturally solidify into a paradigm involving interpretability work on the experiment side, plus some kind of theory work figuring out what kinds of meaningful data structures to map the internals of neural networks to.
What does your picture of (realistically) ideal outcomes from theory work look like? Is it more giving interpretability researchers a better frame to reason under (like a more mathematical notion of optimization that we have to figure out how to detect in large nets against adversaries) or something even more ambitious that designs theoretical interpretability processes that Just Work, leaving technical legwork (what ELK seems like to me)?
While they definitely share core ideas of ontology mismatch, it feels like the approaches are pretty different in that you prioritize mathematical definitions a lot and ARC is heuristical. Do you think the mathematical stuff is necessary for sufficient deconfusion, or just a pretty tractable way to arrive at the answers we want?
We can imagine, e.g., the AI imagining itself building a sub-AI while being prone to various sorts of errors, asking how it (the AI) would want the sub-AI to behave in those cases, and learning heuristics that would generalize well to how we would want the AI to behave if it suddenly gained a lot of capability or was considering deceiving its programmers and so on.
I'm not really convinced that even if corrigibility is A Thing (I agree that it's plausible it is, but I think it could also just be trivially part of another Thing given more clarity), it's as good as other medium-term targets. Corrigibility as stated doesn't feel like it covers a large chunk of the likely threat models, and a broader definition seems like it's just rephrasing a bunch of the stuff from Do What I Mean or inner alignment. What am I missing about why it might be as good a target?
So, how’s The Plan going?
Pretty well!
In last year’s writeup of The Plan, I gave “better than a 50/50 chance” that it would work before AGI kills us all (and my median AI timelines were around 10-15 years). That was an outside view, accounting for planning fallacy and the inevitable negative surprises. My inside view was faster - just based on extrapolating my gut feel of the rate of progress, I privately estimated that The Plan would take around 8 years. (Of those 8, I expected about 3 would be needed to nail down the core conceptual pieces of agent foundations, and the other 5 would be to cross the theory-practice gap. Of course those would be intermingled, though with the theory part probably somewhat more front-loaded.)
Over the past year, my current gut feel is that progress has been basically in line with the inside-view 8 year estimate (now down to 7, since a year has passed), and maybe even a little bit faster than that.
So, relative to my outside-view expectation that things always go worse than my gut expects, things are actually going somewhat better than expected! I’m overall somewhat more optimistic now, although the delta is pretty small. It’s only been a year, still lots of time for negative surprises to appear.
Any high-level changes to The Plan?
There have been two main high-level changes over the past year.
First: The Plan predicted that, sometime over the next 5 (now 4) years, the field of alignment would “go from a basically-preparadigmatic state, where we don’t even know what questions to ask or what tools to use to answer them, to a basically-paradigmatic state, where we have a general roadmap and toolset”. Over the past year, I tentatively think the general shape of that paradigm has become visible, as researchers converge from different directions towards a common set of subproblems.
Second: I’ve updated away from thinking about ambitious value learning as the primary alignment target. Ambitious value learning remains the main long-term target, but I’ve been convinced that e.g. corrigibility is worth paying attention to as a target for early superhuman AGI. Overall, I’ve updated from “just aim for ambitious value learning” to “empirically figure out what potential medium-term alignment targets (e.g. human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AGI’s internal concept-language”.
Convergence towards a paradigm sounds exciting! So what does it look like?
Exciting indeed! Gradual convergence toward a technical alignment paradigm has probably been the most important update from the past year.
On the theoretical side, Paul Christiano, Scott Garrabrant, and myself had all basically converged to working on roughly the same problem (abstraction, ontology identification, whatever you want to call it) by early 2022. That kind of convergence is a standard hallmark of a proto-paradigm.
Meanwhile, within the past year-and-a-half or so, interpretability work has really taken off; Chris Olah’s lab is no longer head-and-shoulders stronger than everyone else. And it looks to me like the interpretability crowd is also quickly converging on the same core problem of abstraction/ontology-identification/whatever-you-want-to-call-it, but from the empirical side rather than the theoretical side.
That convergence isn’t complete yet - I think a lot of the interpretability crowd hasn’t yet fully internalized the framing of “interpretability is primarily about mapping net-internal structures to corresponding high-level interpretable structures in the environment”. In particular I think a lot of interpretability researchers have not yet internalized that mathematically understanding what kinds of high-level interpretable structures appear in the environment is a core part of the problem of interpretability. You have to interpret the stuff-in-the-net as something, and it’s approximately-useless if the thing-you-interpret-stuff-in-the-net-as is e.g. a natural-language string without any legible mathematical structure attached, or an ad-hoc mathematical structure which doesn’t particularly cut reality at the joints. But interpretability researchers have a very strong feedback loop in place, so I expect they’ll iterate toward absorbing that frame relatively quickly. (Though of course there will inevitably be debate about the frame along the way; I wouldn’t be surprised if it’s a hot topic over the next 1-2 years. And also in the comment section of this post.)
Put all that together, extrapolate, and my 40% confidence guess is that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets. That will naturally solidify into a paradigm involving interpretability work on the experiment side, plus some kind of theory work figuring out what kinds of meaningful data structures to map the internals of neural networks to.
As that shift occurs, I expect we’ll also see more discussion of end-to-end alignment strategies based on directly reading and writing the internal language of neural nets. (Retargeting The Search is one example, though it makes some relatively strong assumptions which could probably be relaxed quite a bit.) Since such strategies very directly handle/sidestep the issues of inner alignment, and mostly do not rely on a reward signal as the main mechanism to incentivize intended behavior/internal structure, I expect we’ll see a shift of focus away from convoluted training schemes in alignment proposals. On the flip side, I expect we’ll see more discussion about which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them.
Assuming this paradigm formation extrapolation is roughly correct, it’s great news! This sort of paradigm formation is exactly why The Plan was so optimistic about being able to solve alignment in the next 10-15 (well, now 9-14) years. And, if anything, it currently looks like the paradigm is coming together somewhat faster than expected.
Why the update about corrigibility?
Let’s start with why I mostly ignored corrigibility before. Mainly, I wasn’t convinced that “corrigibility” was even a coherent concept. Lists of desiderata for corrigibility sounded more like a grab-bag of tricks than like a set of criteria all coherently pointing at the same underlying concept. And MIRI’s attempts to formalize corrigibility had found that it was incompatible with expected utility maximization. That sounds to me like corrigibility not really being “a thing”.
Conversely, I expect that some of the major benefits which people want from corrigibility would naturally come from value learning. Insofar as humans want their AGI to empower humans to solve their own problems, or try to help humans do what the humans think is best even if it seems foolish to the AGI, or… , a value-aligned AI will do those things. In other words: value learning will produce some amount of corrigibility, because humans want their AGI to be corrigible. Therefore presumably there's a basin of attraction in which we get values "right enough" along the corrigibility-relevant axes.
The most interesting update for me was when Eliezer reframed the values-include-some-corrigibility argument from the opposite direction (in an in-person discussion): insofar as humans value corrigibility (or particular aspects of corrigibility), the same challenges of expressing corrigibility mathematically also need to be solved in order to target values. In other words, the key mathematical challenges of corrigibility are themselves robust subproblems of alignment, which need to be solved even for value learning. (Note: this is my takeaway from that discussion, not necessarily the point Eliezer intended.)
That argument convinced me to think some more about MIRI’s old corrigibility results. And... they're not very impressive? Like, people tried a few hacks, and the hacks didn't work. Fully Updated Deference is the only real barrier they found, and I don't think it's that much of a barrier - it mostly just shows that something is wrong with the assumed type-signature of the child agent, which isn't exactly shocking.
(Side note: fully updated deference doesn't seem like that much of a barrier in the grand scheme of things, but it is still a barrier which will probably block whatever your first idea is for achieving corrigibility. There are probably ways around it, but you need to actually find and use those ways around.)
While digging around old writing on the topic, I also found an argument from Eliezer that “corrigibility” is a natural concept:
Now that sounds like the sort of thing which is potentially useful! Shame that previous attempts to formulate corrigibility started with kinda-ad-hoc desiderata, rather than from an AI building a sub-AI while being prone to various sorts of errors. (Pro tip for theory work: when you’re formalizing a concept, and you have some intuitive argument for why it’s maybe a natural concept, start from that argument!)
So my overall takeaway here is:
So are you targeting corrigibility now?
No. I’ve been convinced that corrigibility is maybe A Thing; my previous reasons for mostly-ignoring it were wrong. I have not been convinced that it is A Thing; it could still turn out not to be.
But the generalizable takeaway is that there are potentially-useful alignment targets which might turn out to be natural concepts (of which corrigibility is one). Which of those targets actually turn out to be natural concepts is partially a mathematical question (i.e. if we can robustly formulate it mathematically then it’s definitely natural), and partially empirical (i.e. if it ends up being a natural concept in an AI’s internal ontology then that works too).
So my new main position is: which potential alignment targets (human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AI’s internal language (which itself probably includes a lot of mathematics) is an empirical question, and that’s the main question which determines what we should target.
How has broadening alignment target changed your day-to-day research?
It hasn’t. The reason is explained in Plans Are Predictions, Not Optimization Targets. Briefly: the main thing I’m working on is becoming generally less confused about how agents work. While doing that, I mostly aim for robust bottlenecks - understanding abstraction, for instance, is robustly a bottleneck for many different approaches (which is why researchers converge on it from many different directions). Because it’s robust, it’s still likely to be a bottleneck even when the target shifts, and indeed that is what happened.
What high-level progress have you personally made in the past year? Any mistakes made or things to change going forward?
In my own work, theoretical progress has been considerably faster than expected, while crossing the theory-practice gap has been mildly slower than expected. (Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.)
As of The Plan, by six months ago I was hoping to have efficient algorithms for computing natural abstractions in simulated environments, and that basically didn’t happen. I did do a couple interesting experiments (which haven’t been written up):
… but mostly I ended up allocating time to other things. The outputs of those experiments were what I need for now; I’m back to being bottlenecked on theory. (Which is normal - running a computational experiment and exploring the results in detail takes a few days or maybe a couple weeks at most, which is far faster than an iteration cycle on theory development, so of course I spend most of my time bottlenecked on theory.)
On the theory side, progress has zoomed along surprisingly quickly despite spending less time on it than I expected as of late last year. The Basic Foundations sequence is the main publicly-visible artifact of that progress so far; behind the scenes I’ve also continued to streamline the math of natural abstraction, and lately I’ve been working to better unify it with thermodynamic-style arguments and phase changes. (In particular, my current working hypothesis is that grokking is literally a phase change in the thermodynamic sense, induced by coupling to the environment via SGD. On that hypothesis, understanding how such coupling-induced phase changes work is the main next step to mapping net-internal structures to natural abstractions in the environment. But that’s the sort of hypothesis which could easily go out the window in another few weeks.) The main high-level update from the theory work is that, while getting abstraction across the theory-practice gap continues to be difficult, basically everything else about agent foundations is indeed way easier once we have a decent working operationalization of abstraction.
So I’ve spent less time than previously expected on both theory, and on crossing the theory-practice gap. Where did all that time go?
First, conferences and workshops. I said “yes” to basically everything in the first half of 2022, and in hindsight that was a mistake. Now I’m saying “no” to most conferences/workshops by default.
Second, training people (mostly in the MATS program), and writing up what I’d consider relatively basic intro-level arguments about alignment strategies which didn’t have good canonical sources. In the coming year, I’m hoping to hand off most of the training work; at this point I think we have a scalable technical alignment research training program which at least picks the low-hanging fruit (relative to my current ability to train people). In particular, I continue to be optimistic that (my version of) the MATS program shaves at least 3 years off the time it takes participants to get past the same first few bad ideas which everyone has and on to doing potentially-useful work.
What’s the current status of your work on natural abstractions?
In need of a writeup. I did finally work out a satisfying proof of the maxent form for natural abstractions on Bayes nets, and it seems like every week or two I have an interesting new idea for a way to use it. Writing up the proofs as a paper is currently on my todo list; I’m hoping to nerd-snipe some researchers from the complex systems crowd.
Getting it across the theory-practice gap remains the next major high-level step. The immediate next step is to work out and implement the algorithms implied by the maxent form.