some thoughts on the short timeline agi lab worldview. this post is the result of taking capabilities people's world models and mashing them into alignment people's world models.
I think there are roughly two main likely stories for how AGI (defined as able to do any intellectual task as well as the best humans, specifically those tasks relevant for kicking off recursive self improvement) happens:
while I usually think about story 1, this post is about taking story 2 seriously.
it seems basically true that current AI systems are mostly aligned, and certainly not plotting our downfall. like you get stuff like sycophancy but it's relatively mild. certainly if AI systems were only ever roughly this misaligned we'd be doing pretty well.
the story is that once you have AGI, it builds and aligns its successor, which in turn builds and aligns its successor, etc. all the way up to superintelligence.
the problem is that at some link in the chain, you will have a model that can build its successor but not align it.
why is this the case? because progress on alignment is harder to verify than progress on capabilities, and this only gets more true as you ascend in capabilities. you can easily verify that superintelligence is superintelligent - ask it to make a trillion dollars (or put a big glowing X on the moon, or something). even if it's tricked you somehow, like maybe it hacked the bank, or your brain, or something, it also takes a huge amount of capabilities to trick you on these things. however, verifying that it's aligned requires distinguishing cases where it's tricking you from cases where it isn't, which is really hard, and only gets harder as the AI gets smarter.
though if you think about it, capabilities is actually not perfectly measurable either. pretraining loss isn't all we care about; o3 et al might even be a step backwards on that metric. neither are capabilities evals; everyone knows they get goodharted to hell and back all the time. when AI solves all the phd level benchmarks nobody really thinks the AI is phd level. ok, so our intuition for capabilities measurement being easy is true only in the limit, but not necessarily on the margin.
we have one other hope, which is that maybe we can just allocate more of the resources to solving alignment. it's not immediately obvious how to do this if the fundamental bottleneck is verifiability - even if you (or to be more precise, the AI) keep putting in more effort, if you have no way of telling what is good alignment research, you're kind of screwed. but one thing is that you can demand things that are strictly stronger than alignment, that are easier to verify. if this is possible, then you can spend a larger fraction of your computer on alignment to compensate.
in particular, because ultimately the only way we can make progress on alignment is by relying on whatever process for deciding that research is good that human alignment researchers use in practice (even provably correct stuff has the step where we decide what theorem to prove and give an argument for why that theorem means our approach is sound), there's an upper bound on the best possible alignment solution that humans could ever have achieved, which is plausibly a lot lower than perfectly solving alignment with certainty. and it's plausible that there are alignment equivalents to "make a trillion dollars" for capabilities that are easy to verify, strictly imply alignment, and extremely difficult to get any traction on (and with it, a series of weakenings of such a metric that are easier to get traction on but also less-strictly imply alignment). one hope is maybe this looks something like an improved version of causal scrubbing + a theory of heuristic arguments, or something like davidad's thing.
takeaways (assuming you take seriously the premise of very short timelines where AGI looks basic like current AI): first, I think it implies that we should try to figure out how to reduce the asymmetry in verifiability between capabilities and alignment. second, it updates me to being less cynical about work making current models aligned - I used to be very dismissive of this work as "not real alignment" but it does seem decently important in this world.
certainly if AI systems were only ever roughly this misaligned we'd be doing pretty well.
I think this is an important disagreement with the "alignment is hard" crowd. I particularly disagree with "certainly."
The question is "what exactly is the AI trying to do, and what happens if it magnified it's capabilities a millionfold and it and it's descendants were running openendedly?", and are any of the instances catastrophically bad?
Some things you might mean that are raising your position to "certainly" (whereas I'd say "most likely not, or, it's too dumb to even count as 'aligned' or 'misaligned'")
Were any of those what you meant? Or are you thinking about it an entirely different way?
I would naively expect, if you took LLM-agents current degree of alignment, and ran a lotta copies trying to help you with end-to-end alignment research with dialed up capabilities, at least a couple instances would end up trying to subtle sabotage you and/or escape.
what i meant by that is something like:
assuming we are in this short-timelines-no-breakthroughs world (to be clear, this is a HUGE assumption! not claiming that this is necessarily likely!), to win we need two things: (a) base case: the first AI in the recursive self improvement chain is aligned, (b) induction step: each AI can create and align its successor.
i claim that if the base case AI is about as aligned as current AI, then condition (a) is basically either satisfied or not that hard to satisfy. like, i agree current models sometimes lie or are sycophantic or whatever. but these problems really don't seem nearly as hard to solve as the full AGI alignment problem. like idk, you can just ask models to do stuff and they like mostly try their best, and it seems very unlikely that literal GPT-5 is already pretending to be aligned so it can subtly stab us when we ask it to do alignment research.
importantly, under our assumptions, we already have AI systems that are basically analogous to the base case AI, so prosaic alignment research on systems that exist today right now is actually just lots of progress on aligning the base case AI, and in my mind a huge part of the difficulty of alignment in the longer-timeline world is because we don't yet have the AGI/ASI, so we can't do alignment research with good empirical feedback loops.
like tbc it's also not trivial to align current models. companies are heavily incentivized to do it and yet they haven't succeeded fully. but this is a fundamentally easier class of problem than aligning AGI in longer-timelines world.
Mmm nod. (I bucket this under "given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?". Does that feel inaccurate, or, just, not how you'd exactly put it?)
There's a version of Short Timeline World (which I think is more likely? but, not confidently) which is : "the current paradigm does basically work... but, the way we get to ASI, as opposed to AGI, routes through 'the current paradigm helps invent a new better paradigm, real fast'."
In that world, GPT5 has the possibility-of-true-generality, but, not necessarily very efficiently, and once you get to the sharper part of the AI 2027 curve, the mechanism by which the next generation of improvement comes is via figuring out alternate algorithms.
I bucket this under "given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?". Does that feel inaccurate, or, just, not how you'd exactly put it?
I'm pretty sure it is not that. When people say this it is usually just asking the question: "Will current models try to take over or otherwise subvert our control (including incompetently)?" and noticing that the answer is basically "no".[1] What they use this to argue for can then vary:
I agree with (1), disagree with (2) when (2) is applied to superintelligence, and for (3) it depends on details.
In Leo's case in particular I don't think he's using the observation for much, it's mostly just a throwaway claim that's part of the flow of the comment, but inasmuch as it is being used it is to say something like "current AIs aren't trying to subvert our control, so it's not completely implausible on the face of it that the first automated alignment researcher to which we delegate won't try to subvert our control", which is just a pretty weak claim and seems fine, and doesn't imply any kind of extrapolation to superintelligence. I'd be surprised if this was an important disagreement with the "alignment is hard" crowd.
There are demos of models doing stuff like this (e.g. blackmail) but only under conditions selected highly adversarially. These look fragile enough that overall I'd still say current models are more aligned than e.g. rationalists (who under adversarially selected conditions have been known to intentionally murder people).
E.g. One naive threat model says "Orthogonality says that an AI system's goals are completely independent of its capabilities, so we should expect that current AI systems have random goals, which by fragility of value will then be misaligned". Setting aside whether anyone ever believed in such a naive threat model, I think we can agree that current models are evidence against such a threat model.
I'm claiming something like 3 (or 2, if you replace "given tremendous uncertainty, our best guess is" with "by assumption of the scenario") within the very limited scope of the world where we assume AGI is right around the corner and looks basically just like current models but slightly smarter
It sees like the reason Claude's level is misalignment is fine is because it's capabilities aren't very good, and there's not much/any reason to assume it'd be fine if you held alignment constant but dialed up capabilities.
Do you not think that?
(I don't really see why it's relevant how aligned Claude is if we're not thinking about that as part of it)
it'd be fine if you held alignment constant but dialed up capabilities.
I don't know what this means so I can't give you a prediction about it.
I don't really see why it's relevant how aligned Claude is if we're not thinking about that as part of it
I just named three reasons:
- Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that "the doomers were right")
- Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
- Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don't see this, which is evidence against that particular threat model.
Is it relevant to the object-level question of "how hard is aligning a superintelligence"? No, not really. But people are often talking about many things other than that question.
For example, is it relevant to "how much should I defer to doomers"? Yes absolutely (see e.g. #1).
the premise that i'm trying to take seriously for this thought experiment is, what if the "claude is really smart and just a little bit away from agi" people are totally right, so that you just need to dial up capabilities a little bit more rather than a lot more, and then it becomes very reasonable to say that claude++ is about as aligned as claude.
(again, i don't think this is a very likely assumption, but it seems important to work out what the consequences of this set of beliefs being true would be)
or at least, conditional on (a) claude is almost agi and (b) claude is mostly aligned, it seems like quite a strong claim to say "claude++ crosses the agi (= can kick off rsi) threshold at basically the same time it crosses the 'dangerous-core-of-generalization' threshold, so that's also when it becomes super dangerous." it's way stronger a claim than "claude is far away from being agi, we're going to make 5 breakthroughs before we achieve agi, so who knows whether agi will be anything like claude." or, like, sure, the agi threshold is a pretty special threshold, so it's reasonable to privilege this hypothesis a little bit, but when i think about the actual stories i'd tell about how this happens, it just feels like i'm starting from the bottom line first, and the stories don't feel like the strongest part of my argument.
(also, i'm generally inclined towards believing alignment is hard, so i'm pretty familiar with the arguments for why aligning current models might not have much to do with aligning superintelligence. i'm not trying to argue that alignment is easy. or like i guess i'm arguing X->alignment is easy, which if you accept it, can only ever make you more likely to accept that alignment is easy than if you didn't accept the argument, but you know what i mean. i think X is probably false but it's plausible that it isn't and importantly a lot of evidence will come in over the next year or so on whether X is true)
nod. I'm not sure I agreed with all the steps there but I agree with the general promise of "accept the premise that claude is just a bit away from AGI, and is reasonably aligned, and see where that goes when you look at each next step."
I think you are saying something that shares at least some structure with Buck's comment that
It seems like as AIs get more powerful, two things change:
- They probably eventually get powerful enough that they (if developed with current methods) start plotting to kill you/take your stuff.
- They get better, so their wanting to kill you is more of a problem.
I don't see strong arguments that these problems should arise at very similar capability levels, especially if AI developers actively try to prevent the AIs from taking over
(But where you're pointing at a different two sets of properties that may not arise at the same time)
I'm actually not sure I get what the two properties you're talking about, though. Seems like you're contrasting "claude++ crosses the agi (= can kick off rsi) threshold" with "crosses the 'dangerous-core-of-generalization' threshold"
I'm confused because I think the word "agi" basically does mean "cross the core-of-generalization threshold" (which isn't immediately dangerous, but, puts us into 'things could quickly get dangerous at any time" territory)
I do agree "able to do a loop of RSI doesn't intrinsically mean 'agi' or 'core-of-generalization'," there could be narrow skills for doing a loop of RSI. I'm not sure if you more meant "non-agi RSI" or, you see something different between "AGI" and "core-of-generalization." Or think there's a particular "dangerous core-of-generalization" separate from AGI.
(I think "the sharp left turn" is when the core-of-generalization starts to reflect on what it wants, which might come immediately after a core-of-generalization but also could come after either narrow-introspection + adhoc agency, or, might just take awhile for it to notice)
((I can't tell if this comment is getting way more in the weeds than is necessary, but, it seemed like the nuances of exactly what you meant were probably loadbearing))
i guess so? i don't know why you say "even as capability levels rise" - after you build and align the base case AI, humans are no longer involved in ensuring that the subsequent more capable AIs are aligned.
i'm mostly indifferent about what the paradigms look like up the chain. probably at some point up the chain things stop looking anything human made. but what matters at that point is no longer how good we humans are at aligning model n, but how good model n-1 is at aligning model n.
i think of the idealized platonic researcher as the person who has chosen ultimate (intellectual) freedom over all else. someone who really cares about some particular thing that nobody else does - maybe because they see the future before anyone else does, or maybe because they just really like understanding everything about ants or abstract mathematical objects or something. in exchange for the ultimate intellectual freedom, they give up vast amounts of money, status, power, etc.
one thing that makes me sad is that modern academia is, as far as I can tell, not this. when you opt out of the game of the Economy, in exchange for giving up real money, status, and power, what you get from Academia is another game of money, status, and power, with different rules, and much lower stakes, and also everyone is more petty about everything.
at the end of the day, what's even the point of all this? to me, it feels like sacrificing everything for nothing if you eschew money, status, and power, and then just write a terrible irreplicable p-hacked paper that reduces the net amount of human knowledge by adding noise and advances your career so you can do more terrible useless papers. at that point, why not just leave academia and go to industry and do something equally useless for human knowledge but get paid stacks of cash for it?
ofc there are people in academia who do good work but it often feels like the incentives force most work to be this kind of horrible slop.
Have you seen A Master-Slave Model of Human Preferences? To summarize, I think every human is trying to optimize for status, consciously or subconsciously, including those who otherwise fit your description of idealized platonic researcher. For example, I'm someone who has (apparently) "chosen ultimate (intellectual) freedom over all else", having done all of my research outside of academia or any formal organizations, but on reflection I think I was striving for status (prestige) as much as anyone, it was just that my subconscious picked a different strategy than most (which eventually proved quite successful).
at the end of the day, what’s even the point of all this?
I think it's probably a result of most humans not being very strategic, or their subconscious strategizers not being very competent. Or zooming out, it's also a consequence of academia being suboptimal as an institution for leveraging humans' status and other motivations to produce valuable research. That in turn is a consequence of our blind spot for recognizing status as an important motivation/influence for every human behavior, which itself is because not explicitly recognizing status motivation is usually better for one's status.
learning thread for taking notes on things as i learn them (in public so hopefully other people can get value out of it)
VAEs:
a normal autoencoder decodes single latents z to single images (or whatever other kind of data) x, and also encodes single images x to single latents z.
with VAEs, we want our decoder (p(x|z)) to take single latents z and output a distribution over x's. for simplicity we generally declare that this distribution is a gaussian with identity covariance, and we have our decoder output a single x value that is the mean of the gaussian.
because each x can be produced by multiple z's, to run this backwards you also need a distribution of z's for each single x. we call the ideal encoder p(z|x) - the thing that would perfectly invert our decoder p(x|z). unfortunately, we obviously don't have access to this thing. so we have to train an encoder network q(z|x) to approximate it. to make our encoder output a distribution, we have it output a mean vector and a stddev vector for a gaussian. at runtime we sample a random vector eps ~ N(0, 1) and multiply it by the mean and stddev vectors to get an N(mu, std).
to train this thing, we would like to optimize the following loss function:
-log p(x) + KL(q(z|x)||p(z|x))
where the terms optimize the likelihood (how good is the VAE at modelling data, assuming we have access to the perfect z distribution) and the quality of our encoder (how good is our q(z|x) at approximating p(z|x)). unfortunately, neither term is tractable - the former requires marginalizing over z, which is intractable, and the latter requires p(z|x) which we also don't have access to. however, it turns out that the following is mathematically equivalent and is tractable:
-E z~q(z|x) [log p(x|z)] + KL(q(z|x)||p(z))
the former term is just the likelihood of the real data under the decoder distribution given z drawn from the encoder distribution (which happens to be exactly equivalent to the MSE, because it's the log of gaussian pdf). the latter term can be computed analytically, because both distributions are gaussians with known mean and std. (the distribution p is determined in part by the decoder p(x|z), but that doesn't pin down the entire distribution; we still have a degree of freedom in how we pick p(z). so we typically declare by fiat that p(z) is a N(0, 1) gaussian. then, p(z|x) is implied to be equal to p(x|z) p(z) / sum z' p(x|z') p(z'))
One possible model of AI development is as follows: there exists some threshold beyond which capabilities are powerful enough to cause an x-risk, and such that we need alignment progress to be at the level needed to align that system before it comes into existence. I find it informative to think of this as a race where for capabilities the finish line is x-risk-capable AGI, and for alignment this is the ability to align x-risk-capable AGI. In this model, it is necessary but not sufficient for alignment for alignment to be ahead by the time it's at the finish line for good outcomes: if alignment doesn't make it there first, then we automatically lose, but even if it does, if alignment doesn't continue to improve proportional to capabilities, we might also fail at some later point. However, I think it's plausible we're not even on track for the necessary condition, so I'll focus on that within this post.
Given my distributions over how difficult AGI and alignment respectively are, and the amount of effort brought to bear on each of these problems, I think there's a worryingly large chance that we just won't have the alignment progress needed at the critical juncture.
I also think it's plausible that at some point before when x-risks are possible, capabilities will advance to the point that the majority of AI research will be done by AI systems. The worry is that after this point, both capabilities and alignment will be similarly benefitted by automation, and if alignment is behind at the point when this happens, then this lag will be "locked in" because an asymmetric benefit to alignment research is needed to overtake capabilities if capabilities is already ahead.
There are a number of areas where this model could be violated:
However, I don't think these violations are likely for the following reasons respective:
I think exploring the potential model violations further is a fruitful direction. I don't think I'm very confident about this model.
one man's modus tollens is another man's modus ponens:
"making progress without empirical feedback loops is really hard, so we should get feedback loops where possible" "in some cases (i.e close to x-risk), building feedback loops is not possible, so we need to figure out how to make progress without empirical feedback loops. this is (part of) why alignment is hard"
Yeah something in this space seems like a central crux to me.
I personally think (as a person generally in the MIRI-ish camp of "most attempts at empirical work are flawed/confused"), that it's not crazy to look at the situation and say "okay, but, theoretical progress seems even more flawed/confused, we just need to figure out some how of getting empirical feedback loops."
I think there are some constraints on how the empirical work can possibly work. (I don't think I have a short thing I could write here, I have a vague hope of writing up a longer post on "what I think needs to be true, for empirical work to be helping rather than confusedly not-really-helping")
A few axes along which to classify optimizers:
Some observations: it feels like capabilities robustness is one of the big things that makes deception dangerous, because it means that the model can figure out plans that you never intended for it to learn (something not very capabilities robust would just never learn how to deceive if you don't show it). This feels like the critical controller/search-process difference: controller generalization across states is dependent on the generalization abilities of the model architecture, whereas search processes let you think about the particular state you find yourself in. The actions that lead to deception are extremely OOD, and a controller would have a hard time executing the strategy reliably without first having seen it, unless NN generalization is wildly better than I'm anticipating.
Real world objectives is definitely another big chunk of deception danger; caring about the real world leads to nonmyopic behavior (though maybe we're worried about other causes of nonmyopia too? not sure tbh), I'm actually not sure how I feel about generality: on the one hand, it feels intuitive that systems that are only able to represent one objective have got to be in some sense less able to become more powerful just by thinking more; on the other hand I don't know what a rigorous argument for this would look like. I think the intuition relates to the idea of general reasoning machinery being the same across lots of tasks, and this machinery being necessary to do better by thinking harder, and so any model without this machinery must be weaker in some sense. I think this feeds into capabilities robustness (or lack thereof) too.
Examples of where things fall on these axes: