leogao's Shortform

24th May 2022

1 min read

3

This is a special post for quick takes by leogao. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

109Shallow review of live agendas in alignment & safety

73Shallow review of technical AI safety, 2024

19 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:33 PM

[-]leogao1mo18-3

some thoughts on the short timeline agi lab worldview. this post is the result of taking capabilities people's world models and mashing them into alignment people's world models.

I think there are roughly two main likely stories for how AGI (defined as able to do any intellectual task as well as the best humans, specifically those tasks relevant for kicking off recursive self improvement) happens:

AGI takes 5-15 years to build. current AI systems are kind of dumb, and plateau at some point. we need to invent some kind of new paradigm, or at least make a huge breakthrough, to achieve AGI. how easily aligned current systems are is not strongly indicative of how easily aligned future AGI is; current AI systems are missing the core of intelligence that is needed.
AGI takes 2-4 years to build. current AI systems are really close and we just need more compute and schlep and minor algorithmic improvements. current AI systems aren't exactly aligned, but they're like pretty aligned, certainly they aren't all secretly plotting our downfall as we speak.

while I usually think about story 1, this post is about taking story 2 seriously.

it seems basically true that current AI systems are mostly aligned, and certainly not plotting our downfall. like you get stuff like sycophancy but it's relatively mild. certainly if AI systems were only ever roughly this misaligned we'd be doing pretty well.

the story is that once you have AGI, it builds and aligns its successor, which in turn builds and aligns its successor, etc. all the way up to superintelligence.

the problem is that at some link in the chain, you will have a model that can build its successor but not align it.

why is this the case? because progress on alignment is harder to verify than progress on capabilities, and this only gets more true as you ascend in capabilities. you can easily verify that superintelligence is superintelligent - ask it to make a trillion dollars (or put a big glowing X on the moon, or something). even if it's tricked you somehow, like maybe it hacked the bank, or your brain, or something, it also takes a huge amount of capabilities to trick you on these things. however, verifying that it's aligned requires distinguishing cases where it's tricking you from cases where it isn't, which is really hard, and only gets harder as the AI gets smarter.

though if you think about it, capabilities is actually not perfectly measurable either. pretraining loss isn't all we care about; o3 et al might even be a step backwards on that metric. neither are capabilities evals; everyone knows they get goodharted to hell and back all the time. when AI solves all the phd level benchmarks nobody really thinks the AI is phd level. ok, so our intuition for capabilities measurement being easy is true only in the limit, but not necessarily on the margin.

we have one other hope, which is that maybe we can just allocate more of the resources to solving alignment. it's not immediately obvious how to do this if the fundamental bottleneck is verifiability - even if you (or to be more precise, the AI) keep putting in more effort, if you have no way of telling what is good alignment research, you're kind of screwed. but one thing is that you can demand things that are strictly stronger than alignment, that are easier to verify. if this is possible, then you can spend a larger fraction of your computer on alignment to compensate.

in particular, because ultimately the only way we can make progress on alignment is by relying on whatever process for deciding that research is good that human alignment researchers use in practice (even provably correct stuff has the step where we decide what theorem to prove and give an argument for why that theorem means our approach is sound), there's an upper bound on the best possible alignment solution that humans could ever have achieved, which is plausibly a lot lower than perfectly solving alignment with certainty. and it's plausible that there are alignment equivalents to "make a trillion dollars" for capabilities that are easy to verify, strictly imply alignment, and extremely difficult to get any traction on (and with it, a series of weakenings of such a metric that are easier to get traction on but also less-strictly imply alignment). one hope is maybe this looks something like an improved version of causal scrubbing + a theory of heuristic arguments, or something like davidad's thing.

takeaways (assuming you take seriously the premise of very short timelines where AGI looks basic like current AI): first, I think it implies that we should try to figure out how to reduce the asymmetry in verifiability between capabilities and alignment. second, it updates me to being less cynical about work making current models aligned - I used to be very dismissive of this work as "not real alignment" but it does seem decently important in this world.

[-]Raemon1mo84

certainly if AI systems were only ever roughly this misaligned we'd be doing pretty well.

I think this is an important disagreement with the "alignment is hard" crowd. I particularly disagree with "certainly."

The question is "what exactly is the AI trying to do, and what happens if it magnified it's capabilities a millionfold and it and it's descendants were running openendedly?", and are any of the instances catastrophically bad?

Some things you might mean that are raising your position to "certainly" (whereas I'd say "most likely not, or, it's too dumb to even count as 'aligned' or 'misaligned'")

"this ratio of 'do the thing you want' to 'sometimes do a thing you didn't want' is pretty acceptable."
"this magnitude of 'worst case outcome' is not that bad." (this seems technically true, but, is only because the capability level is low)
given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?

Were any of those what you meant? Or are you thinking about it an entirely different way?

I would naively expect, if you took LLM-agents current degree of alignment, and ran a lotta copies trying to help you with end-to-end alignment research with dialed up capabilities, at least a couple instances would end up trying to subtle sabotage you and/or escape.

[-]leogao1mo44

what i meant by that is something like:

assuming we are in this short-timelines-no-breakthroughs world (to be clear, this is a HUGE assumption! not claiming that this is necessarily likely!), to win we need two things: (a) base case: the first AI in the recursive self improvement chain is aligned, (b) induction step: each AI can create and align its successor.

i claim that if the base case AI is about as aligned as current AI, then condition (a) is basically either satisfied or not that hard to satisfy. like, i agree current models sometimes lie or are sycophantic or whatever. but these problems really don't seem nearly as hard to solve as the full AGI alignment problem. like idk, you can just ask models to do stuff and they like mostly try their best, and it seems very unlikely that literal GPT-5 is already pretending to be aligned so it can subtly stab us when we ask it to do alignment research.

importantly, under our assumptions, we already have AI systems that are basically analogous to the base case AI, so prosaic alignment research on systems that exist today right now is actually just lots of progress on aligning the base case AI, and in my mind a huge part of the difficulty of alignment in the longer-timeline world is because we don't yet have the AGI/ASI, so we can't do alignment research with good empirical feedback loops.

like tbc it's also not trivial to align current models. companies are heavily incentivized to do it and yet they haven't succeeded fully. but this is a fundamentally easier class of problem than aligning AGI in longer-timelines world.

[-]Raemon1mo22

Mmm nod. (I bucket this under "given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?". Does that feel inaccurate, or, just, not how you'd exactly put it?)

There's a version of Short Timeline World (which I think is more likely? but, not confidently) which is : "the current paradigm does basically work... but, the way we get to ASI, as opposed to AGI, routes through 'the current paradigm helps invent a new better paradigm, real fast'."

In that world, GPT5 has the possibility-of-true-generality, but, not necessarily very efficiently, and once you get to the sharper part of the AI 2027 curve, the mechanism by which the next generation of improvement comes is via figuring out alternate algorithms.

[-]Rohin Shah1mo*66

I bucket this under "given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?". Does that feel inaccurate, or, just, not how you'd exactly put it?

I'm pretty sure it is not that. When people say this it is usually just asking the question: "Will current models try to take over or otherwise subvert our control (including incompetently)?" and noticing that the answer is basically "no".^[1] What they use this to argue for can then vary:

Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that "the doomers were right")
Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don't see this, which is evidence against that particular threat model.^[2]

I agree with (1), disagree with (2) when (2) is applied to superintelligence, and for (3) it depends on details.

In Leo's case in particular I don't think he's using the observation for much, it's mostly just a throwaway claim that's part of the flow of the comment, but inasmuch as it is being used it is to say something like "current AIs aren't trying to subvert our control, so it's not completely implausible on the face of it that the first automated alignment researcher to which we delegate won't try to subvert our control", which is just a pretty weak claim and seems fine, and doesn't imply any kind of extrapolation to superintelligence. I'd be surprised if this was an important disagreement with the "alignment is hard" crowd.

^{^}
There are demos of models doing stuff like this (e.g. blackmail) but only under conditions selected highly adversarially. These look fragile enough that overall I'd still say current models are more aligned than e.g. rationalists (who under adversarially selected conditions have been known to intentionally murder people).
^{^}
E.g. One naive threat model says "Orthogonality says that an AI system's goals are completely independent of its capabilities, so we should expect that current AI systems have random goals, which by fragility of value will then be misaligned". Setting aside whether anyone ever believed in such a naive threat model, I think we can agree that current models are evidence against such a threat model.

[-]leogao26d20

I'm claiming something like 3 (or 2, if you replace "given tremendous uncertainty, our best guess is" with "by assumption of the scenario") within the very limited scope of the world where we assume AGI is right around the corner and looks basically just like current models but slightly smarter

[-]Raemon26d*10

It sees like the reason Claude's level is misalignment is fine is because it's capabilities aren't very good, and there's not much/any reason to assume it'd be fine if you held alignment constant but dialed up capabilities.

Do you not think that?

(I don't really see why it's relevant how aligned Claude is if we're not thinking about that as part of it)

[-]Rohin Shah25d20

it'd be fine if you held alignment constant but dialed up capabilities.

I don't know what this means so I can't give you a prediction about it.

I don't really see why it's relevant how aligned Claude is if we're not thinking about that as part of it

I just named three reasons:

Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that "the doomers were right")
Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don't see this, which is evidence against that particular threat model.

Is it relevant to the object-level question of "how hard is aligning a superintelligence"? No, not really. But people are often talking about many things other than that question.

For example, is it relevant to "how much should I defer to doomers"? Yes absolutely (see e.g. #1).

[-]leogao25d10

the premise that i'm trying to take seriously for this thought experiment is, what if the "claude is really smart and just a little bit away from agi" people are totally right, so that you just need to dial up capabilities a little bit more rather than a lot more, and then it becomes very reasonable to say that claude++ is about as aligned as claude.

(again, i don't think this is a very likely assumption, but it seems important to work out what the consequences of this set of beliefs being true would be)

or at least, conditional on (a) claude is almost agi and (b) claude is mostly aligned, it seems like quite a strong claim to say "claude++ crosses the agi (= can kick off rsi) threshold at basically the same time it crosses the 'dangerous-core-of-generalization' threshold, so that's also when it becomes super dangerous." it's way stronger a claim than "claude is far away from being agi, we're going to make 5 breakthroughs before we achieve agi, so who knows whether agi will be anything like claude." or, like, sure, the agi threshold is a pretty special threshold, so it's reasonable to privilege this hypothesis a little bit, but when i think about the actual stories i'd tell about how this happens, it just feels like i'm starting from the bottom line first, and the stories don't feel like the strongest part of my argument.

(also, i'm generally inclined towards believing alignment is hard, so i'm pretty familiar with the arguments for why aligning current models might not have much to do with aligning superintelligence. i'm not trying to argue that alignment is easy. or like i guess i'm arguing X->alignment is easy, which if you accept it, can only ever make you more likely to accept that alignment is easy than if you didn't accept the argument, but you know what i mean. i think X is probably false but it's plausible that it isn't and importantly a lot of evidence will come in over the next year or so on whether X is true)

[-]Raemon25d10

nod. I'm not sure I agreed with all the steps there but I agree with the general promise of "accept the premise that claude is just a bit away from AGI, and is reasonably aligned, and see where that goes when you look at each next step."

I think you are saying something that shares at least some structure with Buck's comment that

It seems like as AIs get more powerful, two things change:
They probably eventually get powerful enough that they (if developed with current methods) start plotting to kill you/take your stuff.
They get better, so their wanting to kill you is more of a problem.
I don't see strong arguments that these problems should arise at very similar capability levels, especially if AI developers actively try to prevent the AIs from taking over

(But where you're pointing at a different two sets of properties that may not arise at the same time)

I'm actually not sure I get what the two properties you're talking about, though. Seems like you're contrasting "claude++ crosses the agi (= can kick off rsi) threshold" with "crosses the 'dangerous-core-of-generalization' threshold"

I'm confused because I think the word "agi" basically does mean "cross the core-of-generalization threshold" (which isn't immediately dangerous, but, puts us into 'things could quickly get dangerous at any time" territory)

I do agree "able to do a loop of RSI doesn't intrinsically mean 'agi' or 'core-of-generalization'," there could be narrow skills for doing a loop of RSI. I'm not sure if you more meant "non-agi RSI" or, you see something different between "AGI" and "core-of-generalization." Or think there's a particular "dangerous core-of-generalization" separate from AGI.

(I think "the sharp left turn" is when the core-of-generalization starts to reflect on what it wants, which might come immediately after a core-of-generalization but also could come after either narrow-introspection + adhoc agency, or, might just take awhile for it to notice)

((I can't tell if this comment is getting way more in the weeds than is necessary, but, it seemed like the nuances of exactly what you meant were probably loadbearing))

[-]leogao1mo10

i guess so? i don't know why you say "even as capability levels rise" - after you build and align the base case AI, humans are no longer involved in ensuring that the subsequent more capable AIs are aligned.

i'm mostly indifferent about what the paradigms look like up the chain. probably at some point up the chain things stop looking anything human made. but what matters at that point is no longer how good we humans are at aligning model n, but how good model n-1 is at aligning model n.

[-]leogao4mo1820

i think of the idealized platonic researcher as the person who has chosen ultimate (intellectual) freedom over all else. someone who really cares about some particular thing that nobody else does - maybe because they see the future before anyone else does, or maybe because they just really like understanding everything about ants or abstract mathematical objects or something. in exchange for the ultimate intellectual freedom, they give up vast amounts of money, status, power, etc.

one thing that makes me sad is that modern academia is, as far as I can tell, not this. when you opt out of the game of the Economy, in exchange for giving up real money, status, and power, what you get from Academia is another game of money, status, and power, with different rules, and much lower stakes, and also everyone is more petty about everything.

at the end of the day, what's even the point of all this? to me, it feels like sacrificing everything for nothing if you eschew money, status, and power, and then just write a terrible irreplicable p-hacked paper that reduces the net amount of human knowledge by adding noise and advances your career so you can do more terrible useless papers. at that point, why not just leave academia and go to industry and do something equally useless for human knowledge but get paid stacks of cash for it?

ofc there are people in academia who do good work but it often feels like the incentives force most work to be this kind of horrible slop.

[-]Wei Dai1mo30

Have you seen A Master-Slave Model of Human Preferences? To summarize, I think every human is trying to optimize for status, consciously or subconsciously, including those who otherwise fit your description of idealized platonic researcher. For example, I'm someone who has (apparently) "chosen ultimate (intellectual) freedom over all else", having done all of my research outside of academia or any formal organizations, but on reflection I think I was striving for status (prestige) as much as anyone, it was just that my subconscious picked a different strategy than most (which eventually proved quite successful).

at the end of the day, what’s even the point of all this?

I think it's probably a result of most humans not being very strategic, or their subconscious strategizers not being very competent. Or zooming out, it's also a consequence of academia being suboptimal as an institution for leveraging humans' status and other motivations to produce valuable research. That in turn is a consequence of our blind spot for recognizing status as an important motivation/influence for every human behavior, which itself is because not explicitly recognizing status motivation is usually better for one's status.

[-]leogao1y171

learning thread for taking notes on things as i learn them (in public so hopefully other people can get value out of it)

[-]leogao1y*50

VAEs:

a normal autoencoder decodes single latents z to single images (or whatever other kind of data) x, and also encodes single images x to single latents z.

with VAEs, we want our decoder (p(x|z)) to take single latents z and output a distribution over x's. for simplicity we generally declare that this distribution is a gaussian with identity covariance, and we have our decoder output a single x value that is the mean of the gaussian.

because each x can be produced by multiple z's, to run this backwards you also need a distribution of z's for each single x. we call the ideal encoder p(z|x) - the thing that would perfectly invert our decoder p(x|z). unfortunately, we obviously don't have access to this thing. so we have to train an encoder network q(z|x) to approximate it. to make our encoder output a distribution, we have it output a mean vector and a stddev vector for a gaussian. at runtime we sample a random vector eps ~ N(0, 1) and multiply it by the mean and stddev vectors to get an N(mu, std).

to train this thing, we would like to optimize the following loss function:

-log p(x) + KL(q(z|x)||p(z|x))

where the terms optimize the likelihood (how good is the VAE at modelling data, assuming we have access to the perfect z distribution) and the quality of our encoder (how good is our q(z|x) at approximating p(z|x)). unfortunately, neither term is tractable - the former requires marginalizing over z, which is intractable, and the latter requires p(z|x) which we also don't have access to. however, it turns out that the following is mathematically equivalent and is tractable:

-E z~q(z|x) [log p(x|z)] + KL(q(z|x)||p(z))

the former term is just the likelihood of the real data under the decoder distribution given z drawn from the encoder distribution (which happens to be exactly equivalent to the MSE, because it's the log of gaussian pdf). the latter term can be computed analytically, because both distributions are gaussians with known mean and std. (the distribution p is determined in part by the decoder p(x|z), but that doesn't pin down the entire distribution; we still have a degree of freedom in how we pick p(z). so we typically declare by fiat that p(z) is a N(0, 1) gaussian. then, p(z|x) is implied to be equal to p(x|z) p(z) / sum z' p(x|z') p(z'))

[-]leogao3y50

One possible model of AI development is as follows: there exists some threshold beyond which capabilities are powerful enough to cause an x-risk, and such that we need alignment progress to be at the level needed to align that system before it comes into existence. I find it informative to think of this as a race where for capabilities the finish line is x-risk-capable AGI, and for alignment this is the ability to align x-risk-capable AGI. In this model, it is necessary but not sufficient for alignment for alignment to be ahead by the time it's at the finish line for good outcomes: if alignment doesn't make it there first, then we automatically lose, but even if it does, if alignment doesn't continue to improve proportional to capabilities, we might also fail at some later point. However, I think it's plausible we're not even on track for the necessary condition, so I'll focus on that within this post.

Given my distributions over how difficult AGI and alignment respectively are, and the amount of effort brought to bear on each of these problems, I think there's a worryingly large chance that we just won't have the alignment progress needed at the critical juncture.

I also think it's plausible that at some point before when x-risks are possible, capabilities will advance to the point that the majority of AI research will be done by AI systems. The worry is that after this point, both capabilities and alignment will be similarly benefitted by automation, and if alignment is behind at the point when this happens, then this lag will be "locked in" because an asymmetric benefit to alignment research is needed to overtake capabilities if capabilities is already ahead.

There are a number of areas where this model could be violated:

Capabilities could turn out to be less accelerated than alignment by AI assistance. It seems like capabilities is mostly just throwing more hardware at the problem and scaling up, whereas alignment is much more conceptually oriented.
After research is mostly/fully automated, orgs could simply allocate more auto-research time to alignment than AGI.
Alignment(/coordination to slow down) could turn out to be easy. It could turn out that applying the same amount of effort to alignment and AGI results in alignment being solved first.

However, I don't think these violations are likely for the following reasons respective:

It's plausible that our current reliance on scaling is a product of our theory not being good enough and that it's already possible to build AGI with current hardware if you have the textbook from the future. Even if the strong version of the claim isn't true, one big reason that the bitter lesson is true is that bespoke engineering is currently expensive, and if it became suddenly a lot cheaper we would see a lot more of it and consequently squeezing more out of the same hardware. It also seems likely that before total automation, there will be a number of years where automation is best modelled as a multiplicative factor on human researcher effectiveness. In that case, because of the sheer number of capabilities researchers compared to alignment researchers, alignment researchers would have to benefit a lot more to just break even.
If it were the case that orgs would pivot, I would expect them to currently be allocating a lot more to alignment than they do currently. While it's still plausible that orgs haven't allocated more to alignment because they think AGI is far away, and that a world where automated research is a thing is a world where orgs would suddenly realize how close AGI is and pivot, that hypothesis hasn't been very predictive so far. Further, because I expect the tech for research automation to be developed at roughly the same time by many different orgs, it seems like not only does one org have to prioritize alignment, but actually a majority weighted by auto research capacity have to prioritize alignment. To me, this seems difficult, although more tractable than the other alignment coordination problem, because there's less of a unilateralist problem. The unilateralist problem still exists to some extent: orgs which prioritize alignment are inherently at a disadvantage compared to orgs that don't, because capabilities progress feeds recursively into faster progress whereas alignment progress is less effective at making future alignment progress faster. However, on the relevant timescales this may become less important.
I think alignment is a very difficult problem, and that moreover by its nature it's incredibly easy to underestimate. I should probably write a full post about my take on this at some point, and I don't really have space here to really dive into it here, but a quick meta level argument for why we shouldn't lean on alignment easiness even if there is a non negligible chance of easiness is that a) given the stakes, we should exercise extreme caution and b) there are very few problems we have that are in the same reference class as alignment, and of the few that are even close, like computer security, they don't inspire a lot of confidence.

I think exploring the potential model violations further is a fruitful direction. I don't think I'm very confident about this model.

[-]leogao3y40

one man's modus tollens is another man's modus ponens:

"making progress without empirical feedback loops is really hard, so we should get feedback loops where possible" "in some cases (i.e close to x-risk), building feedback loops is not possible, so we need to figure out how to make progress without empirical feedback loops. this is (part of) why alignment is hard"

[-]Raemon3y20

Yeah something in this space seems like a central crux to me.

I personally think (as a person generally in the MIRI-ish camp of "most attempts at empirical work are flawed/confused"), that it's not crazy to look at the situation and say "okay, but, theoretical progress seems even more flawed/confused, we just need to figure out some how of getting empirical feedback loops."

I think there are some constraints on how the empirical work can possibly work. (I don't think I have a short thing I could write here, I have a vague hope of writing up a longer post on "what I think needs to be true, for empirical work to be helping rather than confusedly not-really-helping")

[-]leogao3y00

A few axes along which to classify optimizers:

Competence: An optimizer is more competent if it achieves the objective more frequently on distribution
Capabilities Robustness: An optimizer is more capabilities robust if it can handle a broader range of OOD world states (and thus possible pertubations) competently.
Generality: An optimizer is more general if it can represent and achieve a broader range of different objectives
Real-world objectives: whether the optimizer is capable of having objectives about things in the real world.

Some observations: it feels like capabilities robustness is one of the big things that makes deception dangerous, because it means that the model can figure out plans that you never intended for it to learn (something not very capabilities robust would just never learn how to deceive if you don't show it). This feels like the critical controller/search-process difference: controller generalization across states is dependent on the generalization abilities of the model architecture, whereas search processes let you think about the particular state you find yourself in. The actions that lead to deception are extremely OOD, and a controller would have a hard time executing the strategy reliably without first having seen it, unless NN generalization is wildly better than I'm anticipating.

Real world objectives is definitely another big chunk of deception danger; caring about the real world leads to nonmyopic behavior (though maybe we're worried about other causes of nonmyopia too? not sure tbh), I'm actually not sure how I feel about generality: on the one hand, it feels intuitive that systems that are only able to represent one objective have got to be in some sense less able to become more powerful just by thinking more; on the other hand I don't know what a rigorous argument for this would look like. I think the intuition relates to the idea of general reasoning machinery being the same across lots of tasks, and this machinery being necessary to do better by thinking harder, and so any model without this machinery must be weaker in some sense. I think this feeds into capabilities robustness (or lack thereof) too.

Examples of where things fall on these axes:

A rock would be none of the properties.
A pure controller (i.e a thermostat, "pile of heuristics") can be competent, but not as capabilities robust, not general at all, and have objectives over the real world.
An analytic equation solver would be perfectly competent and capablilities robust (if it always works), not very general (it can only solve equations), and not be capable of having real world objectives.
A search based process can be competent, would be more capabilities robust and general, and may have objectives over the real world.
A deceptive optimizer is competent, capabilities robust, and definitely has real world objectives

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

leogao's Shortform

3