Strong +1s to many of the points here. Some things I'd highlight:
I agree with almost all of this, in the sense that if you gave me these claims without telling me where they came from, I'd have actively agreed with the claims.
Things that don't meet that bar:
General: Lots of these points make claims about what Eliezer is thinking, how his reasoning works, and what evidence it is based on. I don't necessarily have the same views, primarily because I've engaged much less with Eliezer and so don't have confident Eliezer-models. (They all seem plausible to me, except where I've specifically noted disagreements below.)
Agreement 14: Not sure exactly what this is saying. If it's "the AI will probably always be able to seize control of the physical process implementing the reward calculation and have it output the maximum value" I agree.
Agreement 16: I agree with the general point but I would want to know more about the AI system and how it was trained before evaluating whether it would learn world models + action consequences instead of "just being nice", and even with the details I expect I'd feel pretty uncertain which was more likely.
Agreement 17: It seems totally fine to focus your attention on a specific subset of "easy-alignment" worlds and ensuring that those worlds survive, which could be described as "assuming there's a hope". That being said, there's something in this vicinity I agree with: in trying to solve alignment, people sometimes make totally implausible assumptions about the world; this is a worse strategy for reducing x-risk than working on the worlds you actually expect and giving them another ingredient that, in combination with a "positive model violation", could save those worlds.
Disagreement 10: I don't have a confident take on the primate analogy; I haven't spent enough time looking into it for that.
Disagreement 15: I read Eliezer as saying something different in point 11 of the list of lethalities than Paul attributes to him here; something more like "if you trained on weak tasks either (1) your AI system will be too weak to build nanotech or (2) it learned the general core of intelligence and will kill you once you get it to try building nanotech". I'm not confident in my reading though.
Disagreement 18: I find myself pretty uncertain about what to expect in the "breed corrigible humans" thought experiment.
Disagreement 22: I was mostly in agreement with this, but "obsoleting human contributions to alignment" is a pretty high bar if you take it literally, and I don't feel confident that happens before superintelligent understanding of the world (though it does seem plausible).
On 22, I agree that my claim is incorrect. I think such systems probably won't obsolete human contributions to alignment while being subhuman in many ways. (I do think their expected contribution to alignment may be large relative to human contributions; but that's compatible with significant room for humans to add value / to have made contributions that AIs productively build on, since we have different strengths.)
When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer, and I'm quite happy to see that there's a lot of convergence between my private list and Paul's list here.
I thought it would be a useful exercise to diff my list with Paul's; I'll record the result in the rest of this comment without the expectation that it's useful to anyone else.
Points on both lists:
I won't try to list all of the things that Paul mentioned which weren't on my list, but some of the most useful (for me) were:
Finally, a few points which were on my list and not Paul's, and which I feel like writing out:
When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer
Why privately?! Is there a phenomenon where other people feel concerned about the social reception of expressing disagreement until Paul does? This is a phenomenon common in many other fields - and I'd invoke it to explain how the 'tone' of talk about AI safety shifted so quickly once I came right out and was first to say everybody's dead - and if it's also happening on the other side then people need to start talking there too. Especially if people think they have solutions. They should talk.
OK, sure. First, I updated down on alignment difficulty after reading your lethalities post, because I had already baked in the expected-EY-quality doompost into my expectations. I was seriously relieved that you hadn't found any qualitatively new obstacles which might present deep challenges to my new view on alignment.
Here's one stab[1] at my disagreement with your list: Human beings exist, and our high-level reasoning about alignment has to account for the high-level alignment properties[2] of the only general intelligences we have ever found to exist ever. If ontological failure is such a nasty problem in AI alignment, how come very few people do terrible things because they forgot how to bind their "love" value to configurations of atoms? If it's really hard to get intelligences to care about reality, how does the genome do it millions of times each day?
Taking an item from your lethalities post:
19... More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.
There is a guaranteed-to-exist mechanistic story for how the human genome solves lethality no.19, because people do reliably form (at least some of) their values around their model of reality. (For more on what I mean by this, see this comment.) I think the genome probably does solve this lethality using loss functions and relatively crude reward signals, and I think I have a pretty good idea of how that happens.
I haven't made a public post out of my document on shard theory yet, because idea inoculation. Apparently, the document isn't yet written well enough to yank people out of their current misframings of alignment. Maybe the doc has clicked for 10 people. Most readers trip on a miscommunication, stop far before they can understand the key insights, and taper off because it seems like Just Another Speculative Theory. I apparently don't know how to credibly communicate that the theory is at the level of actually really important to evaluate & critique ASAP, because time is slipping away. But I'll keep trying anyways.
I'm attempting this comment in the hopes that it communicates something. Perhaps this comment is still unclear, in which case I ask the reader's patience for improved future communication attempts.
Like
1. "Human beings tend to bind their terminal values to their model of reality", or
2. "Human beings reliably navigate ontological shifts. Children remain excited about animals after learning they are made out of cells. Physicists don't stop caring about their family because they can model the world in terms of complex amplitudes."
Yes, human beings exist and build world models beyond their local sensory data, and have values over those world models not just over the senses.
But this is not addressing all of the problem in Lethality 19. What's missing is how we point at something specific (not just at anything external).
The important disanalogy between AGI alignment and humans as already-existing (N)GIs is:
I addressed this distinction previously, in one of the links in OP. AFAIK we did not know how to reliably ensure the AI is pointed towards anything external, as long as it's external. But also, humans are reliably pointed to particular kinds of external things. See the linked thread for more detail.
The important disanalogy
I am not attempting to make an analogy. Genome->human values is, mechanistically, an instance of value formation within a generally intelligent mind. For all of our thought experiments, genome->human values is the only instance we have ever empirically observed.
for humans there is no principal - our values can be whatever
Huh? I think I misunderstand you. I perceive you as saying: "There is not a predictable mapping from whatever-is-in-the-genome+environmental-factors to learned-values."
If so, I strongly disagree. Like, in the world where that is true, wouldn't parents be extremely uncertain whether their children will care about hills or dogs or paperclips or door hinges? Our values are not "whatever", human values are generally formed over predictable kinds of real-world objects like dogs and people and tasty food.
Or if you take evolution as the principal, the alignment problem wasn't solved.
The linked theory makes it obvious why evolution couldn't have possibly solved the human alignment problem. To quote:
Since human values are generally defined over the learned human WM, evolution could not create homo inclusive-genetic-fitness-maximus.
If values form because reward sends reinforcement flowing back through a person's cognition and reinforces the thoughts which (credit assignment judges to have) led to the reward, then if a person never thinks about inclusive reproductive fitness, they can never ever form a value shard around inclusive reproductive fitness. Certain abstractions, like lollipops or people, are convergently learned early in the predictive-loss-reduction process and thus are easy to form values around.
But if there aren't local mutations which make a person more probable to think thoughts about inclusive genetic fitness before/while the person gets reward, then evolution can't instill this value. Even if the descendents of that person will later be able to think thoughts about fitness.
On the other hand, under this theory, human values (by their nature) usually involve concepts which are easy to form shards of value around... Shard theory provides a story for why we might succeed at shard-alignment, even though evolution failed.
(Edited to expand my thoughts)
I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I'll summarise as "produce a mind that...":
(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.
I think I need to distinguish two versions of feat 3:
Humans show that feat 2 at least has been accomplished, but also 3a, as I take you to be pointing out. I maintain that 3b is not demonstrated by humans and is probably something we need.
Hm. I feel confused about the importance of 3b as opposed to 3a. Here's my first guess: Because we need to target the AI's motivation in particular ways in order to align it with particular desired goals, it's important for there not just to be a predictable mapping, but a flexibly steerable one, such that we can choose to steer towards "dog" or "rock" or "cheese wheels" or "cooperating with humans."
Is this close?
RE discussion of gradual-ness, continuity, early practice, etc.:
FWIW, here’s how I currently envision AGI developing, which seems to be in a similar ballpark as Eliezer’s picture, or at least closer than most people I think? (Mostly presented without argument.)
There’s a possible R&D path that leads to a model-based RL AGI. It would very agent-y, and have some resemblance to human brain algorithms (I claim), and be able to “figure things out” and “mull things over” and have ideas and execute on them, and understand the world and itself, etc., akin to how humans do all those things.
Large language models (LLMs) trained mainly by self-supervised learning (SSL), as built today, are not that path (although they might include some ingredients which would overlap with that path). In my view, those SSL systems are almost definitely safer, and almost definitely much less capable, than the agent-y model-based RL path. For example, I don’t think that the current SSL-LLM path is pointing towards “The last invention that man need ever make”. I won’t defend that claim here.
But meanwhile, like it or not, lots of other people are as we speak racing down the road towards the more brain-like, more agent-y, model-based RL AGI. We should presume that they’ll eventually succeed. We could try to stop them, but doing so seems impossible right now. In the future, the SSL-LLM path will be producing more impressive AI models than today, but I don’t expect that fact to change the world very much in relevant ways, such that we’ll still be in roughly the same situation of not having any way (AFAICT) to stop researchers from inventing agent-y model-based RL AGI. So our only choice is to figure out how to navigate the world in which people eventually build agent-y model-based RL AGI.
(Note—Since “SSL-LLM safety” doesn’t coincide with “agent-y model-based RL safety”, a natural consequence is Eliezer [uncharitably] describing some people’s work as ‘not engaging with the core difficulties’ or whatever.)
Anyway, model-based RL algorithms can already do neat things like play computer games, but they can’t yet conquer humanity, and I think part of that is related to model-based RL algorithmic innovations that have yet to happen. So during some period of time, those future algorithmic innovations will happen, and meanwhile people will be scaling up and optimizing and hardware-accelerating the algorithms and architectures.
(Based on human brain compute requirements, I think training from scratch to at least human-level intelligence will probably eventually be possible with relatively modest amounts of chips and money, see here.)
A key question is: how long is this period between “This specific model-based RL technological path is producing the AIs that everyone is using and everyone is talking about receiving a very large share of overall attention and investment by the ML research community” [reworded for clarification, see later comment] and “This specific model-based RL technological path can produce an out-of-control AGI that could destroy the world”?
Hard to say, but “a couple years” seems entirely plausible to me, and even “zero years (because, until the leading team worked out the kinks, their results weren’t great compared to other very different approaches, and few people were paying attention)” seems plausible. Whereas even “10 years” seems implausibly high to me, I think.
So I find all the arguments in this post related to slow-takeoff, gradual-ness, continuity, etc. to be not so reassuring.
A key question is: how long is this period between “This specific model-based RL technological path is producing the AIs that everyone is using and everyone is talking about” and “This specific model-based RL technological path can produce an out-of-control AGI that could destroy the world”?
Hard to say, but “a couple years” seems entirely plausible to me, and even “zero years (because, until the leading team worked out the kinks, their results weren’t great compared to other very different approaches, and few people were paying attention)” seems plausible. Whereas even “10 years” seems implausibly high to me, I think.
I don't think Paul would disagree with you about "a couple years" being plausible, based on Agreements #8 from his post (bold mine):
8. The broader intellectual world seems to wildly overestimate how long it will take AI systems to go from “large impact on the world” to “unrecognizably transformed world.” This is more likely to be years than decades, and there’s a real chance that it’s months. This makes alignment harder and doesn’t seem like something we are collectively prepared for.
At first I read Paul's post as having very gradualist assumptions all around. But he clarified to me in this comment and the back-and-forth we had in replies that he's a bit long on the initial time before AI has large impact on the world (similar to your “This specific model-based RL technological path is producing the AIs that everyone is using and everyone is talking about”), which he pegs at ~40% by 2040. After that point, he predicts a pretty speedy progression to "unrecognizably transformed world", which I think includes the possibility of catastrophe.
This is a thread for anyone who wants to give a high-level take or reaction that isn't contributing much to the discussion (and thus isn't worth a top-level comment).
I broadly agree with this much more than Eliezer's and think this did a good job of articulating a bunch of my fuzzy "this seems off". Most notably, Eliezer underrating the Importance and tractability of interpretability, and overrating the discontinuity of AI progress
Do you think that some of my disagreements should change if I had shorter timelines?
(As mentioned last time we talked, but readers might not have seen: I'm guessing ~15% on singularity by 2030 and ~40% on singularity by 2040.)
I think most of your disagreements on this list would not change.
However, I think if you conditioned on 50% chance of singularity by 2030 instead of 15%, you'd update towards faster takeoff, less government/societal competence (and thus things more likely to fail at an earlier, less dignified point), more unipolar/local takeoff, lower effectiveness of coordination/policy/politics-style strategies, less interpretability and other useful alignment progress, less chance of really useful warning shots... and of course, significantly higher p(doom).
To put it another way, when I imagine what (I think) your median future looks like, it's got humans still in control in 2035, sitting on top of giant bureaucracies of really cheap, really smart proto-AGIs that fortunately aren't good enough at certain key skills (like learning-to-learn, or concept formation, or long-horizon goal-directedness) to be an existential threat yet, but are definitely really impressive in a bunch of ways and are reshaping the world economy and political landscape and causing various minor disasters here and there that serve as warning shots. So the whole human world is super interested in AI stuff and policymakers are all caught up on the arguments for AI risk and generally risks are taken seriously instead of dismissed as sci-fi and there are probably international treaties and stuff and also meanwhile the field of technical alignment has had 13 more years to blossom and probably lots of progress has been made on interpretability and ELK and whatnot and there are 10x more genius researchers in the field with 5+ years of experience already... and even in this world, singularity is still 5+ years away, and probably there are lots of expert forecasters looking at awesome datasets of trends on well-designed benchmarks predicting with some confidence when it will happen and what it'll look like.
This world seems pretty good to me, it's one where there is definitely still lots of danger but I feel like >50% chance things will be OK. Alas it's not the world I expect, because I think probably things will happen sooner and go more quickly than that, with less time for the world to adapt and prepare.
For example, ARC’s report on ELK describes at least 10 difficulties of the same type and severity as the ~20 technical difficulties raised in Eliezer’s list.
I skimmed through the report and didn't find anything that looked like a centralized bullet point list of difficulties. I think it's valuable in general if people say what the problems are that they're trying to solve, and then collect them into a place so people can look them over simultaneously. I realize I haven't done enough of this myself, but if you've already written up the component pieces, that can make it easier to collect the bullet list.
I'm not sure if you are saying that you skimmed the report right now and couldn't find the list, or that you think that it was a mistake for the report not to contain a "centralized bullet point list of difficulties."
If you are currently looking for the list of difficulties: see the long footnote.
If you think the ELK report should have contained such a list: I definitely don't think we wrote this report optimally, but we tried our best and I'm not convinced this would be an improvement. The report is about one central problem that we attempt to state at the very top. Then there are a series of sections organized around possible solutions and the problems with those solutions, which highlight many of the general difficulties. I don't intuitively feel like a bulleted list of difficulties would have been a better way to describe the difficulties.
Liked this post a lot. In particular I think I strongly agree with "Eliezer raises many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument" as the general vibe of how I feel about Eliezer's arguments.
A few comments on the disagreements:
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.”
An in-between position would be to argue that even if we're maximally competent at the institutional problem, and can extract all the information we possibly can through experimentation before the first critical try, that just prevents the really embarrassing failures. Irrecoverable failures could still pop up every once in a while after entering the critical regime that we just could not have been prepared for, unless we have a full True Name of alignment. I think the crux here depends on your view on the Murphy-constant of the world (i.e how likely we are to get unknown unknown failures), and how long you think we need to spend in the critical regime before our automated alignment research assistants solve alignment.
By the time we have AI systems that can overpower humans decisively with nanotech, we have other AI systems that will either kill humans in more boring ways or else radically advanced the state of human R&D.
For what it's worth I think the level of tech needed to overpower humans in more boring ways is a substantial part of my "doom cinematic universe" (and I usually assume nanobots is meant metaphorically). In particular, I think it's plausible the "slightly-less-impressive-looking" systems that come before the first x-risk AI are not going to be obviously step-before-x-risk any more so than current scary capabilities advances, because of uncertainty over its exact angle (related to Murphy crux above) + discontinuous jumps in specific capabilities as we see currently in ML.
if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff.
SGD is definitely far from perfect optimization, and it seems plausible that if concealment against SGD is a thing at all, then it would be due to some kind of instrumental thing that a very large fraction of powerful AI systems converge on.
Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research
I think there's a lot of different cruxes hiding inside the question of how AI acceleration of alignment research interacts with P(doom), including how hard alignment is, and whether AGI labs will pivot to focus on alignment (some earlier thoughts here), even assuming we can align the weak systems used for this. Overall I feel very uncertain about this.
Early transformative AI systems will probably do impressive technological projects by being trained on smaller tasks with shorter feedback loops and then composing these abilities in the context of large collaborative projects
Explicitly registering agreement with this prediction.
Eliezer is relatively confident that you can’t train powerful systems by imitating human thoughts, because too much of human thinking happens under the surface.
Fwiw, I interpreted this as saying that it doesn't work as a safety proposal (see also: my earlier comment). Also seems related to his arguments about ML systems having squiggles.
Fwiw, I interpreted this as saying that it doesn't work as a safety proposal (see also: my earlier comment). Also seems related to his arguments about ML systems having squiggles.
Yup. You can definitely train powerful systems on imitation of human thoughts, and in the limit this just gets you a powerful mesa-optimizer that figures out how to imitate them.
The question is when you get a misaligned mesaoptimizer relative to when you get superhuman behavior.
I think it's pretty clear that you can get an optimizer which is upstream of the imitation (i.e. whose optimization gives rise to the imitation), or you can get an optimizer which is downstream of the imitation (i.e. which optimizes in virtue of its imitation). Of course most outcomes are messier than those two extremes, but the qualitative distinction still seems really central to these arguments.
I don't think you've made much argument about when the transition occurs. Existing language models strongly appear to be "imitation upstream of optimization." For example, it is much easier to get optimization out of them by having them imitate human optimization, than by setting up a situation where solving a hard problem is necessary to predict human behavior.
I don't know when you expect this situation to change; if you want to make some predictions then you could use empirical data to help support your view. By default I would interpret each stronger system with "imitation upstream of optimization" to be weak evidence that the transition will be later than you would have thought. I'm not treating those as failed predictions by you or anything, but it's the kind of evidence that adjusts my view on this question.
(I also think the chimp->human distinction being so closely coupled to language is further weak evidence for this view. But honestly the bigger thing I'm saying here is that 50% seems like a more reasonable place to be a priori, so I feel like the ball's in your court to give an argument. I know you hate that move, sorry.)
Epistemic status: some of these ideas only crystallized today, normally I would take at least a few days to process before posting to make sure there are no glaring holes in the reasoning, but I saw this thread and decided to reply since it's topical.
Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it's true of transformers). In order for Bayesian inference to converge to exact imitation, you usually need realizability. Obviously today we don't have realizability because the ANNs currently in use are not big enough to contain a brain, but we're gradually getting closer there[1].
More precisely, as ANNs grow in size we're approaching a regime I dubbed "pseudorealizability": on the one hand, the brain is in the prior[2], one the other hand, its description complexity is pretty high and therefore its prior probability is pretty low. Moreover, a more sophisticated agent (e.g. infra-Bayesian RL / Turing RL / infra-Bayesian physicalist) would be able to use the rest of world as useful evidence to predict some features of the human brain (i.e. even though human brains are complex, they are not random, there are reasons they came to be the way they are if you understand the broader context e.g. evolutionary history). But, the latter kind of inference does not take the form of having a (non-mesa-optimizing) complete cartesian parsimonious model of the world in which brains are a particular piece, because (i) such a model would be too computationally expensive (non-realizability) and (ii) bridge rules add a lot of complexity.
Hence, the honest-imitation hypothesis is heavily penalized compared to hypotheses that are in themselves agents which are more "epistemically sophisticated" than the outer loop of the AI. Why agents rather than some kind of non-agentic epistemic engines? Because, IB and IBP suggest that, this level of epistemic sophistication requires some entanglement between epistemic rationality and instrumental rationality: in these frameworks, it is not possible to decouple the two entirely.
From the perspective of the outer loop, we can describe the situation as: "I woke up, expecting to see a world that is (i) simple and (ii) computationally cheap. At first glance, the world seemed like, not that. But, everything became clear when I realized that the world is generated by a relatively-simple-and-cheap 'deity' who made the world like this on purpose because it's beneficial for it from its own strange epistemic vantage point."
Coming back to the question of, when to expect the transition from imitation-upstread-of-optimization to imitation-downstream-of-optimization. By the above line of argument, we should expect this transition to happen before the AI succeeds at any task which requires reasoning at least as sophisticated as the kind of reasoning that allows inferring properties of human brains from understanding the broader context of the world. Unfortunately, I cannot atm cache this out into a concrete milestone, but (i) it seems very believable that current language models are not there and (ii) maybe if we think about it more, we can come up with such a milestone.
Cotra's report is a relevant point of reference, even though "having as many parameters as the brain according to some way to count brain-parameters" is ofc not the same as "capable of representing something which approximates the brain up to an error term that behaves like random noise". ↩︎
Assuming the training protocol is sufficiently good at decoupling the brain from the surrounding (more complex) world and pointing the AI at only trying to imitate the brain. ↩︎
Hence, the honest-imitation hypothesis is heavily penalized compared to hypotheses that are in themselves agents which are more "epistemically sophisticated" than the outer loop of the AI.
In a deep learning context, the latter hypothesis seems much more heavily favored when using a simplicity prior (since gradient descent is simple to specify) than a speed prior (since gradient descent takes a lot of computation). So as long as the compute costs of inference remain smaller than the compute costs of training, a speed prior seems more appropriate for evaluating how easily hypotheses can become more epistemically sophisticated than the outer loop.
Not quite sure what you're saying here. Is the claim that speed penalties would help shift the balance against mesa-optimizers? This kind of solutions are worth looking into, but I'm not too optimistic about them atm. First, the mesa-optimizer probably won't add a lot of overhead compared to the considerable complexity of emulating a brain. In particular, it need not work by anything like our own ML algorithms. So, if it's possible to rule out mesa-optimizers like this, it would require a rather extreme penalty. Second, there are limits on how much you can shape the prior while still having feasible learning. And I suspect that such an extreme speed penalty would not cut it. Third, depending on the setup, an extreme speed penalty might harm generalization[1]. But we definitely need to understand it more rigorously.
The most appealing version is Christiano's "minimal circuits", but that only works for inputs of fixed size. It's not so clear what's the variable-input-size ("transformer") version of that. ↩︎
No, I wasn't advocating adding a speed penalty, I was just pointing at a reason to think that a speed prior would give a more accurate answer to the question of "which is favored" than the bounded simplicity prior you're assuming:
Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it's true of transformers)
But now I realise that I don't understand why you think this is true of transformers. Could you explain? It seems to me that there are many very simple hypotheses which take a long time to calculate, and which transformers therefore can't be representing.
The word "bounded" in "bounded simplicity prior" referred to bounded computational resources. A "bounded simplicity prior" is a prior which involves either a "hard" (i.e. some hypotheses are excluded) or a "soft" (i.e. some hypotheses are down-weighted) bound on computational resources (or both), and also inductive bias towards simplicity (specifically it should probably behave as ~ 2^{-description complexity}). For a concrete example, see the prior I described here (w/o any claim to originality).
This seems like a good thing to keep in mind, but also sounds too pessimistic about the ability of gradient descent to find inference algorithms that update more efficiently than gradient descent.
I do expect this to happen. The question is merely: what's the best predictor of how hard it is to find inference algorithms more efficient effective than gradient descent? Is it whether those inference algorithms are more complex than gradient descent? Or is it whether those inference algorithms run for longer than gradient descent? Since gradient descent is very simple but takes a long time to run, my bet is the latter: there are many simple ways to convert compute to optimisation, but few compute-cheap ways to convert additional complexity to optimization.
Faster than gradient descent is not a selective pressure, at least if we're considering typical ML training procedures. What is a selective pressure is regularization, which functions much more like a complexity prior than a speed prior.
So (again sticking to modern day ML as an example, if you have something else in mind that would be interesting) of course there will be a cutoff in terms of speed, excluding all algorithms that don't fit into the neural net. But among algorithms that fit into the NN, the penalty on their speed will be entirely explainable as a consequence of regularization that e.g. favors circuits that depend on fewer parameters, and would therefore be faster after some optimization steps.
If examples of successful parameters were sparse and tended to just barely fit into the NN, then this speed cutoff will be very important. But in the present day we see that good parameters tend to be pretty thick on the ground, and you can fairly smoothly move around in parameter space to make different tradeoffs.
Broadly agree with this in most points of disagreement with Eliezer, and also agree with many points of agreement.
Few points where I sort of disagree with both, although this is sometimes unclear
1.
Even if there were consensus about a risk from powerful AI systems, there is a good chance that the world would respond in a totally unproductive way. It’s wishful thinking to look at possible stories of doom and say “we wouldn’t let that happen;” humanity is fully capable of messing up even very basic challenges, especially if they are novel.
I literally agree with this, but at the same time, in contrast to Eliezer's original point, I also think there is a decent chance the world would respond in a somewhat productive way, and this is a mayor point of leverage.
For people who doubt this, I'd point to variance in initial governmental-level response to COVID19, which ranged from "highly incompetent" (eg. early US) to "quite competent" (eg Taiwan). (I also have some intuitions around this based on non-trivial amounts of first-hand experience with how governments actually internally worked and made decisions - which you certainly don't need to trust, but if you are highly confident in inability of governments to act, or do reasonable things, you should at least be less confident.)
2.
AI systems will ultimately be wildly superhuman, and there probably won’t be strong technological hurdles right around human level. Extrapolating the rate of existing AI progress suggests you don’t get too much time between weak AI systems and very strong AI systems, and AI contributions could very easily go from being a tiny minority of intellectual work to a large majority over a few years.
While I do agree there likely won't be strong technological hurdles, I think "right around the human level" is the point where it seems most likely some regulatory hurdles can be erected, or the human coordination landscape can change, or resources spent on alignment research could grow extremely fast, or, generally, weird things can happen. While I generally agree weird bad things can happen, I also do think weird good things can happen, and this also likely seems a potential period of increased leverage.
3.
There are strong social and political pressures to spend much more of our time talking about how AI shapes existing conflicts and shifts power. This pressure is already playing out and it doesn’t seem too likely to get better. I think Eliezer’s term “the last derail” is hyperbolic but on point.
I do agree that the pressures do exist, and would be bad if it caused many people working on the pessimistic-assumptions-side to switch to work on e.g. corporate governance; on the other hand, I don't agree it's just a distraction. Given previous two points, I think the overall state of power / coordination / conflict can have significant trajectory-shaping influence.
Also, this dynamic will likely bring many more people to work on alignment-adjacent topics, and I think there is some chance to steer part of this attention to productive work on important problems; I think this is more likely if at least some alignment researchers bother to engage with this influx of attention (as opposed to ignoring it as random distraction).
This response / increases in attention in some sense seem like the normal way how humanity solves problems, and it may be easier to steer it, rather than e.g. try to find&convince random people to care about technical alignment problems.
For people who doubt this, I’d point to variance in initial governmental-level response to COVID19, which ranged from “highly incompetent” (eg. early US) to “quite competent” (eg Taiwan).
Seems worth noting that Taiwan is an outlier in terms of average IQ of its population. Given this, I find it pretty unlikely that typical governmental response to AI would be more akin to Taiwan than the US.
Australia seems to have suffered a lot more from the pandemic than the U.S., paying much more in the cost of lockdown than even a relatively conservative worst-case estimate would have been for the costs of an uncontrolled COVID pandemic. I don't know about the others, but given that you put Australia on this list, I don't currently trust the others to have acted sensibly.
I'm not sure if you actually read carefully what you are commenting on. I emphasized early response, or initial governmental-level response in both comments in this thread.
Sure, multiple countries on the list made mistakes later, some countries sort of become insane, and so on. Later, almost everyone made mistakes with vaccines, rapid tests, investments in contact tracing, etc.
Arguing that the early lockdown was more costly than "an uncontrolled pandemic" would be pretty insane position (cf GDP costs, Italy had the closest thing to an uncontrolled pandemic). (Btw the whole notion of "an uncontrolled pandemic" is deeply confused - unless you are a totalitarian dictatorship, you cannot just order people "live as normally" during a pandemic when enough other people are dying; you get spontaneous "anarchic lockdowns" anyway, just later and in a more costly way)
If Australia was pursuing a strategy of "lock down irrespective of cost", then I don't think it makes sense to describe the initial response as competent. It just happened to be right in this case, but in order for the overall response to helpful, it has to be adaptive to the actual costs. I agree that the early response on its own would have indicated a potentially competent decision-making algorithm, but the later followup showed that the algorithm seems to have mostly been correct on accident, and not on-purpose.
I do appreciate the link to the GDP cost article. I would have to look into the methodology more to comment on that, but it certainly seems like an interest analysis and suggestive result.
I doubt that’s the primary component that makes the difference. Other countries which did mostly sensible things early are eg Australia, Czechia, Vietnam, New Zealand, Iceland.
What do you think is the primary component? I seem to recall reading somewhere that previous experience with SARS makes a big difference. I guess my more general point is that if the good COVID responses can mostly be explained by factors that predictably won't be available to the median AI risk response, then the variance in COVID response doesn't help to give much hope for a good AI risk response.
My main claim isn’t about what a median response would be, but something like “difference between median early covid governmental response and actually good early covid response was something between 1 and 2 sigma; this suggests bad response isn’t over-determined, and sensibe responses are within human reach”.
This seems to depend on response to AI risk being of similar difficulty as response to COVID. I think people who updated towards "bad response to AI risk is overdetermined" did so partly on the basis that the former is much harder. (In other words, if the median government has done this badly against COVID, what chance does it have against something much harder?) I wrote down a list of things that make COVID an easier challenge, which I now realize may be a bit of a tangent if that's not the main thing you want to argue about, but I'll put down here anyway so as to not waste it.
What do you think is the primary component? I seem to recall reading somewhere that previous experience with SARS makes a big difference. I guess my more general point is that if the good COVID responses can mostly be explained by factors that predictably won't be available to the median AI risk response, then the variance in COVID response doesn't help to give much hope for a good AI risk response.
What seemed to make a difference
In my view, 1. and 4. could go better than in covid, 2. is irrelevant, 3. and 5. seem broad parameters which can develop in different directions. Image you somehow become the main advisor to US president when the situation becomes really weird, and she follows your advice closely - my rough impression is in most situations you would be able to move the response to be moderately sane.
- it's relatively intuitive for humans to think about the mechanics of the danger and possible countermeasures
Empirically, this often wasn't true. Humans had mildly confused ideas about the micro-level, but often highly confused ideas about the exponential macro-dynamics. (We created a whole educational game on that, and have some feedback that for some policymakers it was the thing that helped them understand... after a year in the pandemic)
- previous human experiences with pandemics, including very similar ones like SARS
- there are very effective countermeasures that are much easier / less costly than comparable countermeasures for AI risk, such as distributing high quality masks to everyone and sealing one's borders
- COVID isn't agenty and can't fight back intelligently
- potentially divisive issues in AI risk response seem to be a strict superset of politically divisive issues in COVID response (additional issues include: how to weigh very long term benefits against short term costs, the sentience, moral worth, and rights of AIs, what kind of values do we want AIs to have, and/or who should have control/access to AI)
One factor which may make governments more responsive to AI risk is covid wasn't exactly threatening to states. Covid was pretty bad for individual people, and some businesses, but in some cases, the relative power of states even grew during covid. In contrast, in some scenarios it may be clear that AI is existential risk for states as well.
It sounds like we are broadly on the same page about 1 and 2 (presumably partly because my list doesn't focus on my spiciest takes, which might have generated more disagreement).
Here are some extremely rambling thoughts on point 3.
I agree that the interaction between AI and existing conflict is a very important consideration for understanding or shaping policy responses to AI, and that you should be thinking a lot about how to navigate (and potentially leverage) those dynamics if you want to improve how well we handle any aspect of AI. I was trying to mostly point to differences in "which problems related to AI are we trying to solve?" We could think about technical or institutional or economic approaches/aspects of any problem.
With respect to "which problem are we trying to solve?": I also think potential undesirable effects of AI on the balance of power are real and important, both because it affects our long term future and because it will affect humanity's ability to cope with problems during the transition to AI. I think that problem is at least somewhat less important than alignment, but will probably get much more attention by default. I think this is especially true from a technical perspective, because technical work plays a totally central work for alignment, and a much more unpredictable and incidental role for affecting the balance of power.
I'm not sure how alignment researchers should engage with this kind of alignment-adjacent topic. My naive guess would be that I (and probably other alignment researchers) should:
I am somewhat concerned that general blurring of the lines between alignment and other concerns will tend to favor topics with more natural social gravity. That's not enough to make me think it's clearly net negative to engage, but is at least enough to make me feel ambivalent. I think it's very plausible that semi-approvingly citing Eliezer's term "the last derail" was unwise, but I don't know. In my defense, the difficulty of talking about alignment per se, and the amount of social pressure to instead switch to talking about something else, is a pretty central fact about my experience of working on alignment, and leaves me protective of spaces and norms that let people just focus on alignment.
(On the other hand: (i) I would not be surprised if people on the other side of the fence feel the same way, (ii) there are clearly spaces---like LW---where the dynamic is reversed, though they have their own problems, (iii) the situation is much better than a few years ago and I'm optimistic that will continue getting better for a variety of reasons, not least that the technical problems in AI alignment become increasingly well-defined and conversations about those topics will naturally become more focused.)
I'm not convinced that the dynamic "we care a lot about who ends up with power, and more important topics are more relevant to the distribution of power" is a major part of how humanity solves hard human vs nature problems. I do agree that it's an important fact about humans to take into account when trying to solve any problem though.
Not very coherent response to #3. Roughly
With prioritization, I think it would be good if someone made some sort of a curated list "who is working on which problems, and why" - my concern with part of the "EAs figuring out what to do" is many people are doing some sort of expert-aggregation on the wrong level. (Like, if someone basically averages your and Eliezer Yudkowsky's conclusions giving 50% weight each, I don't think it is useful and coherent model)
This is a great complement to Eliezer's 'List of lethalities' in particular because in cases of disagreements beliefs of most people working on the problem were and still mostly are are closer to this post. Paul writing it provided a clear, well written reference point, and with many others expressing their views in comments and other posts, helped made the beliefs in AI safety more transparent.
I still occasionally reference this post when talking to people who after reading a bit about the debate e.g. on social media first form oversimplified model of the debate in which there is some unified 'safety' camp vs. 'optimists'.
Also I think this demonstrates that 'just stating your beliefs' in moderately-dimensional projection could be useful type of post, even without much justification.
Curated. Eliezer's List of Lethalities post has received an immense amount of attention, rightly so given the content, and I am extremely glad to see this response go live since Eliezer's views do not reflect a consensus, and it would be sad to have only one set of views be getting all the attention when I do think many of the questions are non-obvious.
I am very pleased to see public back-and-forth on questions of not just "how and whether we are doomed", but the specific gears behind them (where things will work vs cannot work). These questions bear on the enormous resources poured into AI safety work right now. Ensuring those resources get allocated in a way that actually the improve odds of our success is key.
I hope that others continue to share and debate their models of the world, Alignment, strategy, etc. in a way that is both on record and easily findable by others. Hopefully, we can look back in 10, 20, 50, etc years and reflect on how well we reasoned in these cloudy times.
RE Disagreement 5: Some examples where the aligned AIs will not consume the “free energy” of an out-of-control unaligned AI are:
1. Exploiting the free energy requires humans trusting the AIs more than they actually do. For example, humans with a (supposedly) aligned AGI may not trust the AGI to secure their own nuclear weapons systems, or to hack into its enemies’ nuclear weapons systems, or to do recursive self-improvement, or to launch von Neumann probes that can never be called back. But an out-of-control AGI would presumably be willing to do all those things.
2. Exploiting the free energy requires violating human laws, norms, Overton Windows, etc., or getting implausibly large numbers of human actors to agree with each other, or suffering large immediate costs for uncertain benefits, etc., such that humans don’t actually let their aligned AGIs do that. For example, maybe the only viable gray goo defense system consists of defensive nanobots that go proliferate in the biosphere, harming wildlife and violating national boundaries. Would people + aligned AGIs actually go and deploy that system? I’m skeptical. Likewise, if there’s a neat trick to melt all the non-whitelisted GPUs on the planet, I find it hard to imagine that people + aligned AGIs would actually do anything with that knowledge, or even that they would go looking for that knowledge in the first place. But an out-of-control unaligned AGI wouldn’t hesitate.
3. Exploiting the free energy accomplishes a goal that no human would want to accomplish, e.g. the removal of all oxygen from the atmosphere. Here, the attacking and defending AIs are trying to do two different things. Destroying a power grid may be much easier or much harder than preventing a power grid from being destroyed; a gray goo defense system may be much easier or much harder to create than gray goo, etc. I don’t have great knowledge about attack-defense balance in any of these domains, but I’m concerned by the disjunctive nature of the problem—an out-of-control AGI would presumably attack in whatever way had the worst attack-defense imbalance.
(Above is somewhat redundant with Paul’s strategy-stealing post; like Zvi I thought it was a nice post but I drew the opposite conclusion.)
I wrote a review here. There, I identify the main generators of Christiano's disagreement with Yudkowsky[1] and add some critical commentary. I also frame it in terms of a broader debate in the AI alignment community.
I divide those into "takeoff speeds", "attitude towards prosaic alignment" and "the metadebate" (the last one is about what kind of debate norms should we have about this or what kind of arguments should we listen to.)
AI improving itself is most likely to look like AI systems doing R&D in the same way that humans do. “AI smart enough to improve itself” is not a crucial threshold, AI systems will get gradually better at improving themselves. Eliezer appears to expect AI systems performing extremely fast recursive self-improvement before those systems are able to make superhuman contributions to other domains (including alignment research), but I think this is mostly unjustified. If Eliezer doesn’t believe this, then his arguments about the alignment problem that humans need to solve appear to be wrong.
One different way I've been thinking about this issue recently is that humans have fundamental cognitive limits e.g. brain size that AGI wouldn't have. There are possible biotech interventions to fix these but the easiest ones (e.g. just increase skull size) still require decades to start up. AI, meanwhile, could be improved (by humans and AIs) on much faster timescales. (How important something like brain size is depends on how much intellectual progress is explained by max intelligence than total intelligence; a naive reading of intellectual history would say max intelligence is important given that a high percentage of relevant human knowledge follows from <100 important thinkers.)
This doesn't lead me to assign high probability to "takeoff in 1 month", my expectation is still that AI improving AI will be an extension of humans improving AI (and then centaurs improving AI), but the iteration cycle time could be a lot faster due to AIs not having fundamental human cognitive limits.
My sense is that we are on broadly the same page here. I agree that "AI improving AI over time" will look very different from "humans improving humans over time" or even "biology improving humans over time." But I think that it will look a lot like "humans improving AI over time," and that's what I'd use to estimate timescales (months or years, most likely years) for further AI improvements.
Animal breeding would be a better analogy, and seems to suggest a different and much more tentative conclusion. For example, if humans were being actively bred for corrigibility and friendliness, it looks to me like like they would quite likely be corrigible and friendly up through the current distribution of human behavior.
I was just thinking about this. The central example that's often used here is "evolution optimized humans for inclusive genetic fitness, nonetheless humans do not try to actually maximize the amount of their surviving offspring, such as by everyone wanting to donate to sperm/egg banks".
But evolution does not seem to maximize fitness in that sense, where the fitness of a species would be a distinct thing-in-the-world that could be directly observed and optimized for. Something like "docileness" or "size", as used in animal breeding, would be a much better analogy, since those things are something that you can directly observe and optimize for - and human breeders do.
And... if humans had been explicitly bred for friendliness and corrigibility for a while, it seems to me that they likely would want to do the analogous thing of maximizing-their-donations-to-sperm/egg-banks. After all, we can already see that people who are high on either end of some personality trait such as altruism/selfishness, dominance/submission, openness/conservatism, etc., are likely to view that trait as a virtue (as long as nothing in the environment too overwhelmingly disproves this) and seek to become even more like that.
Altruistic people often want to become even more altruistic, selfish people eliminate their altruistic "weaknesses", dominant people to become more dominant, submissive people to make it easier for themselves to submit (this has some strong counterforces in our culture where submissiveness is generally considered undesirable, but you can still see it valued in e.g. workplace cultures where workers resent reforms that would give them more autonomy, preferring bosses to "just tell them what to do"), open people to become more open to experience, and so on.
Probably if people high on such traits were offered chances to self-modify to become even moreso - which seems analogous to the sperm/egg bank thing, since it's the cognitive optimization form of the instinctive thing - quite a few of them would.
What about selecting for "moderation in all things"? Is that not virtue?
Aristotle invented quantification you heard here first
I don’t think surviving worlds have a plan in the sense Eliezer is looking for.
This seems wrong to me, could you elaborate? Prompt: Presumably you think we do have a plan, it just doesn't meet Eliezer's standards. What is that plan?
Eliezer said:
Surviving worlds, by this point, and in fact several decades earlier, have a plan for how to survive. It is a written plan. The plan is not secret. In this non-surviving world, there are no candidate plans that do not immediately fall to Eliezer instantly pointing at the giant visible gaping holes in that plan.
... Key people are taking internal and real responsibility for finding flaws in their own plans, instead of considering it their job to propose solutions and somebody else's job to prove those solutions wrong. That world started trying to solve their important lethal problems earlier than this. Half the people going into string theory shifted into AI alignment instead and made real progress there. When people suggest a planetarily-lethal problem that might materialize later - there's a lot of people suggesting those, in the worlds destined to live, and they don't have a special status in the field, it's just what normal geniuses there do - they're met with either solution plans or a reason why that shouldn't happen, not an uncomfortable shrug and 'How can you be sure that will happen' / 'There's no way you could be sure of that now, we'll have to wait on experimental evidence.'
I'm guessing the disagreement is that Yudkowsky thinks the holes are giant visible and gaping, whereas you think they are indeed holes but you have some ideas for how to fix them and at any rate the plan is to work on fixing those holes and to not deploy powerful AGI until those holes are fixed. I'm guessing you agree that it's bad to meet suggestions for lethal problems with '"how can you be sure / we'll have to wait / shrug" and that instead it's good for people to start thinking about those problems and designing solutions now.
I guess there's also the "It is a written plan. It is not secret" part. I for one would feel noticeably better if we had a written, non-secret plan.
I think it's less about how many holes there are in a given plan, and more like "how much detail does it need before it counts as a plan?" If someone says that their plan is "Keep doing alignment research until the problem is solved", then whether or not there's a hole in that plan is downstream of all the other disagreements about how easy the alignment problem is. But it seems like, separate from the other disagreements, Eliezer tends to think that having detailed plans is very useful for making progress.
Analogy for why I don't buy this: I don't think that the Wright brothers' plan to solve the flying problem would count as a "plan" by Eliezer's standards. But it did work.
I think most worlds, surviving or not, don't have a plan in the sense that Eliezer is asking about.
I do agree that in the best worlds, there are quite a lot of very good plans and extensive analysis of how they would play out (even if it's not the biggest input into decision-making). Indeed, I think there are a lot of things that the best possible world would be doing that we aren't, and I'd give that world a very low probability of doom even if alignment was literally impossible-in-principle.
ETA: this is closely related to Richard's point in the sibling.
I'm guessing the disagreement is that Yudkowsky thinks the holes are giant visible and gaping, whereas you think they are indeed holes but you have some ideas for how to fix them
I think we don't know whether various obvious-to-us-now things will work with effort. I think we don't really have a plan that would work with an acceptably high probability and stand up to scrutiny / mildly pessimistic assumptions.
I would guess that if alignment is hard, then whatever we do ultimately won't follow any existing plan very closely (whether we succeed or not). I do think it's reasonably likely to agree at a very high level. I think that's also true even in the much better worlds that do have tons of plans.
at any rate the plan is to work on fixing those holes and to not deploy powerful AGI until those holes are fixed
I wouldn't say there is "a plan" to do that.
Many people have that hope, and have thought some about how we might establish sufficient consensus about risk to delay AGI deployment for 0.5-2 years if things look risky, and how to overcome various difficulties with implementing that kind of delay, or what kind of more difficult moves might be able to delay significantly longer than that.
Posting this comment to start some discussion about generalization and instrumental convergence (disagreements #8 and #9).
So my general thoughts here are that ML generalization is almost certainly not good enough for alignment. (At least in the paradigm of deep learning.) I think it's true with high confidence that if we're trying to train a neural net to imitate some value function, and that function takes a high-dimensional input, then it will be possible to find lots of inputs that cause the network to produce a high value when the value function produces a low one, or vice-versa. In other words, adversarial inputs exist. This is true even when the function is simple enough that the network certainly has more than enough Bayesian evidence to pin down the function. As far as I know, we haven't yet discovered a way to really fix this problem, though there are certainly ways to make the adversarial inputs a little more rare/harder to find.
Paul also mentions that high intelligence isn't a separate regime that the AI needs to generalize to, but rather that the AI can be trained continuously as its intelligence increases. I agree with this, but I don't think it constitutes a valid objection, since the regimes that we actually want the AI to generalize between are cooperation and betrayal. Generally these would look pretty different, with betrayal plans involving the AI tiling space with adversarial examples, etc. And we'd generally expect a discontinuous switch to betrayal only when the AI is confident it can win, so there's not really an opportunity to train the AI on betrayal examples beforehand.
Thanks for writing this!
Typo: "I see this kind of thinking from Eliezer a lot but it seems misleading or long" should be "...or wrong"
(Partially in response to AGI Ruin: A list of Lethalities. Written in the same rambling style. Not exhaustive.)
Agreements
Disagreements
(Mostly stated without argument.)
and obsolete human contributions to alignmentretracted) well before they need to develop superhuman understanding of much of the world or tricks about how to think, and so even if they have a very different profile of abilities to humans they may still be subhuman in many important ways.My take on Eliezer's takes
Ten examples off the top of my head, that I think are about half overlapping and where I think the discussions in the ELK doc are if anything more thorough than the discussions in the list of lethalities: