Alignment researchers should think hard about switching to working on AI Control
I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50/50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!
For this post, my definitions are roughly:
Here are some thoughts, arguments, and analogies (epistemic status: there is no “hidden content”, if you don’t find the literal words I wrote persuasive you shouldn’t update. In particular, just update on the words and don't update about what my words imply about my beliefs.):
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify parts.
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs. This seems unlikely to turn out well. They may also persuade some of us to trust their alignment work, even though we really shouldn't.
So to go down this road is to bet that alignment has no philosophically confusing or hard to verify parts. I see some people saying this explicitly in the comments here, but why do they think that? How do they know? (I'm afraid that some people just don't feel philosophically confused about much of anything, and will push forward on that basis.) But you do seem to worry about philosophical problems, which makes me confused about the position you take here.
BTW I have similar objections to working on relatively easy forms of (i.e., unscalable) alignment solutions, and using the resulting aligned AIs to solve alignment for more powerful AIs. But at least there, one might gain some insights into the harder alignment problems from working on the easy problems, potentially producing some useful strategic information or making it easier to verify future proposed alignment solutions. So while I don't think that's a good plan, this plan seems even worse.
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
[...]
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs.
It currently seems unlikely to me that marginal AI control research I'm excited about is very economically useful. I agree that some control or control-adjacent research will end up being at least somewhat economically relevant.
I can also imagine worlds where AIs are clear schemers that are clearly unproductive to use by default, and thus control increases deployment through both straightforward incentives and risk compensation (though this maybe requires the AIs to have made a mistake or to have somewhat strange preferences as they would probably prefer fooling our tests).
My understanding is that your concerns on downside are maybe some combination of:
I'm very skeptical of (1) for marginal control research. It feelsl like this has to imply incredibly strong levels of inefficiency and usefulness of control research. Perhaps you're imagining a smaller acceleration effect, but think the benefits are also tiny relative to this small acceleration?
I'm somewhat sympathetic to things being somewhat inefficient such that if the field of people who might do control research instead did capabilities, that would be a notable acceleration (maybe 5%), but it seems like control research should be a massive reduction in acceleration relative to this.
On (2), this sounds like a weird intermediate level of competence where they are actually sensitive to full on scheming risks at an earlier point (when control works), but end up thinking this will scale despite it being relatively obvious it won't. I agree that control might end up being an excuse for scaling, but I don't think that it will be a counterfactual excuse (some other excuse would have been found) and I think we should just make sure to argue on the merits at the point.
I do think that (2) points toward focusing on advancing evaluation methodology more than techniques such that it would be relatively clear when things fail if people are being unreasonable and it is relatively easy to call out unreasonableness.
I think not doing control due to (3) can put in you in pretty scary situations: how can we know when you'll get scary warning shots vs full AI takeover? Further, AIs presumably want to avoid warning shots though it is unclear how much AIs will systematically want to advance the interests of other AIs. At a more basic level I don't love arguments of the form "actually, having bad things happen will actually be good, so we shouldn't try to prevent bad things which are pretty close to the actual bad things we're worried about". All that said, I'm sympathetic to researchers avoiding work on mitigating misuse or smaller-than-full-takeover harms due to this motivation. (I still think AI companies should have an obligation to make it very clear what risks they are imposing and to try to mitigate risks at multiple scales, but this obligation doesn't apply to researchers. If it is hard to hire good people to work on misuse, try paying more.)
When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs.
As in, your view is that:
I think there is a lot of very helpful alignment work which isn't conceptually bottlenecked and agendas with fewer conceptual barriers could go very far.
Further, I'm skeptical about evaluation being so hard.
Beyond that, I'm only imaging part of the theory of change of control work being to "solve alignment" or work on alignment.
Alternatives:
Additional time with very powerful AIs seems useful for studying them even if we can't get any acceleration from automation. Pausing at the brink is more difficult in terms of costs and coordination (as there will be more close actors and more incentive), but it might also be radically easier to make the case for pausing relative to today. So, it overall seems easier to arrange and time at the brink also seems much more useful all around.
(I also think that pausing looks less good if it isn't part of a well implemented, thoughtful and extremely high effort approach which e.g. actively intervenes with hardware and is aggressive on international efforts. So, I'm not very into pausing "for the wrong reasons".)
(Edited to add)
I'd like to argue that there is a lot of helpful stuff which isn't conceptually bottlenecked.
Concretely, let's imagine targeting aligning AI systems which are capable enough to dominate top human experts in speed and cost across all relevant domains (including philosophy/conceptual work). In particular, we want to align them well enough that we're happy to totally defer to them on really tricky questions like "what should be our high level approach for handling risks for further AI systems"[2]. I think this is a pretty good target for a rational actor, then you can actually hand off decision making to such AIs. (They might decides to work on alignment of smarter systems or pursue totally different routes.)
It seems difficult though not impossible that we could reasonably confidently (95%)[3] reach this target via very prosaic/empirical approaches using huge amounts of AI labor (e.g. a combination of testing things on huge numbers of model organisms, doing some high level interp (both white and black box) to get a bit of a sense for what is going on in AI systems in general, and verifying that AIs seem to do a good job in held out conceptual domains where we happen to know the answers). And, I think prosaic approaches with huge amounts labor could also substantially increase chances of hitting this target (e.g. 4x risk reduction) even if we can't hit this level of confidence.
This doesn't really look like "solve alignment", but in practice it reduces risk a bunch.
It's also possible that better understanding of risks will indicate that proceeding to wildly superhuman AI immediately is fine, though I'm skepitical. ↩︎
The AIs might ask us questions or whatever to figure out our preferences. ↩︎
Obviously, 95% confidence is objectively unreasonable, but it is notable improvement over doing nothing, particularly in worlds where we find evidence for misalignment. ↩︎
People interested in a discussion about control with someone who is maybe closer to Wei Dai's perspective might be interested in my dialogue with habyrka.
My vague plan along these lines is to attempt as hard as possible to defer all philosophically confusing questions to the "long reflection", and to use AI control as a tool to help produce AIs that can help preserve long term option value (including philosophical option value) as best as possible.
I seperately have hope we can solve "the entire problem" at some point, e.g. through ARC's agenda (which I spend most of my time trying to derisk and advance).
The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.
Well, this would be the lone crux. The rest of the stuff you wrote is about non-exploding AI, right? And is therefore irrelevant to the thing about everyone dying, except insofar as controlled non-exploding AI can help prevent uncontrolled exploding AI from killing everyone?
I really think if you want to tell a story of AI Control work being good (especially compared to working on preventing AGI from being built in the first place), the important and difficult part is figuring out how to actually use these more powerful AI systems to either achieve some kind of global moratorium, or make unprecedented progress on the hard parts of the AI Alignment problem.
When I see most people start thinking about control, I rarely see them interface with either of these two things, and honestly, I mostly see them come up with cool additional commercial applications of AI, which I think will mostly make the present situation worse. To be clear, I think thinking about how to use AI systems for global coordination or making progress on the hard parts of the AI Alignment problem is tractable, but I haven't seen much work on it, and I am worried it is not the default thing that people will work on when they start thinking about control.
I think it is unlikely for control work to buy humanity much time until someone builds a very powerful unaligned AI system, at least at our present levels of coordination tools. I think controlled AI systems might help with coordination here, but I do think someone needs to do the actual hard work of thinking about how it might.
(I go into this and various related things in my dialogue with Ryan on control)
FWIW, I don't think of myself as having a super strong opinion on whether working on control is better than working on preventing people from building catastrophically dangerous AI (and I've seriously considered doing the latter).
(I do feel more optimistic than you about being able to use controlled AIs for something useful.)
@Buck do you or Ryan have a writeup that includes: (a) a description of the capabilities of a system that you think would be able to do something useful for the sorts of objectives that Habryka talks about and (b) what that something useful is.
Bonus points if it has (c) the likelihood that you think such a system will be controllable by 20XX and (d) what kind of control setup you think would be required to control it.
On (a) and (b), we describe this at a high level here.
We don't really have anything written on (c) or (d). (c) really depends a lot on effort, so I'd probably prefer to talk mostly about (d) including what evalutions would be needed at various points etc.
For (a), I think we potentially care about all of:
It's likely control is breaking down by (3) unless control ends up being quite easy or the implementation/evaluation is very good.
On (b) we plan on talking more about this soon. (Buck's recent EAGx talk is pretty up to date with our current thinking, though this talk is obviously not that detailed. IDK if you can find a recording anywhere.)
Re a, there’s nothing more specific on this than what we wrote in “the case for ensuring”. But I do think that our answer there is pretty good.
Re b, no, we need to write some version of that up; I think our answer here is ok but not amazing, writing it up is on the list.
yep agreed, I have a bunch of vague plans in this direction. I most generally think that AI control is a pretty good tool in the toolbox, and is unlikely to make things much worse but plausibly makes things much better.
quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”
This feels like a pretty central cruxy point - and not just for the relevance of the pivotal act framing specifically. I think it's underlying a whole difference of worldview or problem-solving approach.
A couple other points in a similar direction:
I would say that your mindset, when approaching AI safety, seems to be an epsilon fallacy.
Sure, in principle a sum of numbers can be large without any individual number being large. In practice, the 80/20 rule is a thing, and everything has bottlenecks all the time. If work is not addressing a bottleneck, then it's approximately useless.
(Somewhat more precisely: if marginal work is not addressing something which is a bottleneck on current margins, then it's approximately useless.)
Of importance, tractability and neglectedness, importance is the most important. In practice, it is usually better to have a thousand people trying to solve a major bottleneck each with low chance of success, than a thousand people making tractable progress on some neglected issue which is not a bottleneck.
I think I disagree with your model of importance. If your goal is the make a sum of numbers small, then you want to focus your efforts where the derivative is lowest (highest? signs are hard), not where the absolute magnitude is highest.
The "epsilon fallacy" can be committed in both directions: both in that any negative dervative is worth working on, and that any extremely large number is worth taking a chance to try to improve.
I also seperately think that "bottleneck" is not generally a good term to apply to a complex project with high amounts of technical and philosophical uncertainty. The ability to see a "bottleneck" is very valuable should one exist, but I am skeptical of the ability to strongly predict where such bottlnecks will be in advance, and do not think the historical record really supports the ability to find such bottlenecks reliably by "thinking", as opposed to doing a lot of stuff, including trying things and seeing what works. If you have a broad distribution over where a bottleneck might be, then all activities lend value by "derisking" locations for particular bottlenecks if they succeed, and providing more evidence that a bottleneck is in a particular location if it fails. (kinda like: https://en.wikipedia.org/wiki/Swiss_cheese_model) For instance, I think of "deceptive alignment" as a possible way to get pessimal generalization, and thus a proabalistic "bottleneck" to various alignment approaches. But there are other ways things can fail, and so one can still lend value by solving non-deceptive-alignment related problems (although my day job consists of trying to get "benign generalization" our of ML, and thus does infact address that particular bottleneck imo).
I also seperately think that if someone thinks they have identified a bottleneck, they should try to go resolve it as best they can. I think of that as what you (John) is doing, and fully support such activities, although think I am unlikely to join your particular project. I think the questions you are trying to answer are very interesting ones, and the "natural latents" approach seems likely to shed at some light on whats going on with e.g. the ability of agents to communicate at all.
I do think that "we don't have enough information to know where the bottlenecks are yet" is in-general a reasonable counterargument to a "just focus on the bottlenecks" approach (insofar as we in fact do not yet have enough information). In this case I think we do have enough information, so that's perhaps a deeper crux.
Hiliariously, it seems likely that our disagreement is even more meta, on the question of "how do you know when you have enough information to know", or potentially even higher, e.g. "how much uncertainty should one have given that they think they know" etc.
I agree it is better work on bottlenecks than non-bottlenecks. I have high uncertainty about where such bottlenecks will be, and I think sufficiently low amounts of work have gone into "control" that it's obviously worth investing more, because e.g. I think it'll let us get more data on where bottlenecks are.
see my longer comment https://www.lesswrong.com/posts/A79wykDjr4pcYy9K7/mark-xu-s-shortform#8qjN3Mb8xmJxx59ZG
I directionally agree with this (and think it's good to write about this more, strongly upvoted!)
For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:
I think 2. is arguably the most promising strategy for 1., but I've occasionally noticed myself conflating them more than I should.
1. gives you the naive 50/50 equilibrium, i.e. 50% of people should naively work on this broad notion of control. But I think other reasons in favor apply more strongly to 2. (e.g. the tractability arguments are significantly weaker for model internals-based approaches to 1.)
I also think (non-confidently) that 2. is what's really very different from most existing research. For control in the first, broad sense, some research seems less clearly on either the control or alignment side.
But I do agree that safety-motivated researchers should evaluate approaches from a control perspective (in the broad sense) more on the margin. And I also really like the narrower black-box approach to control!
I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there's some under way, and I'm excited for more).
We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.
I plan to write more about white box control soon (but I plan to write about a lot of stuff…)
This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me. E.g. my categorization would be something like:
Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn't make much sense for research, most of which helps with both. It does make sense as a distinction for "what plan will you implement in practice" -- but labs have said very little publicly about that.
Other things that characterize work done under the name of "control" so far are (1) it tries to be very concrete about its threat models, to a greater degree than most other work in AI safety, and (2) it tries to do assurance, taking a very worst case approach. Maybe you're saying that people should do those things more, but this seems way more contentious and I'd probably just straightforwardly disagree with the strength of your recommendation (though probably not its direction).
Nitpick: I would also quibble with your definitions; under your definitions, control seems like a subset of alignment (the one exception if you notice the model is scheming and then simply stop using AI). I think you really have to define alignment as models reliably doing what you want independent of the surrounding context, or talk about "trying to do what you want" (which only makes sense when applied to models, so has similar upshots).
Tbc I like control and think more effort should be put into it; I just disagree with the strength of the recommendation here.
This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me.
Agreed. (And I made a comment to this effect on Mark's draft of this short form.) I don't think it is that productive to classify research directions, the central thing we're trying to point at is a classification for reasons why you might think that you're fine where one reason is "the AI couldn't do bad stuff without getting caught" and another reason is "the AI wouldn't do bad stuff" (obviously simplifying somewhat).
I also think that white-box control should typically be considered pretty separately from black-box control and the story for it is often pretty different. (Though when thinking about white-box control you should often adopt the central control perspective, e.g., you should incorporate white-box runtime detection techniques (like probes) into an overall control evaluation via exogenously trying to determine the probe reliability and other important aspects of the probe. For instance, the probe can take advantage of black box strategies and how safe we are depends on how many actions AIs need to cause huge problems.)
The claims I'd like to make is:
Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn't make much sense for research, most of which helps with both. It does make sense as a distinction for "what plan will you implement in practice" -- but labs have said very little publicly about that.
You're potentially thinking about footnote 4 in our post on control:
As one operationalization, suppose that you had two teams at your AI lab, one of which was much more competent, and you had to assign one to be in charge of minimizing the probability that the AIs you're about to train and deploy were scheming, and the other to be in charge of ensuring control. To address risk due to powerful AI in the next 8 years, we think it would be better to assign the more competent team to ensuring control, because we think that the quality of control interventions is more influential on this risk.
This footnote is a bit confusingly worded, but I think the situation we were trying to say is "Suppose you expect dangerously powerful AI in the next year, and the current year is prior to 2032. We're guessing you should put the better team on control."
This is different than research in the run up.
I also think that to the extent people are trying to do backchained research focused on specific applications, it makes sense to put the better team on control over reducing the chance that scheming arises. (But these aren't the only classes of interventions and some interventions don't nicely fit into these buckets, e.g., you can do work on differentially making AIs more useful for alignment work which isn't well classified as either and you can work on high level interpretability which aims to roughly understand how AIs make decisions in some cases (this high-level interp doesn't clearly help with reducing the chance that scheming arises very directly, but could help with a bunch of stuff).)
I would suggest 50% of researchers working on a broader definition of control: including "control", technical governance work and technical outreach (scary demos, model organisms of misalignment).
This post raises an important perspective on the practicalities of AI Control versus Alignment. Given the potential for AI to function productively even when not fully aligned, do you believe that current AI control methods are scalable enough to handle future AGI systems? Additionally, what would be the main challenges in ensuring that AI control strategies are robust in highly dynamic or emergency scenarios?
AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind
Some considerations:
(Thanks to Neel Nanda for inspiring this post, and Ryan Greenblatt for comments.)
ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the "optimal allocation".
I think this depends on whether you think AI safety at a lab is more of an O-ring process or a swiss-cheese process. Also, if you think it's more of an O-ring process, you might be generally less excited about working at a scaling lab.
Some possible counterpoints:
Great post. I'm on GDM's new AI safety and alignment team in the Bay Area and hope readers will consider joining us!
I would advise people to think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”
What evidence is there that working at a scaling lab risks creating a "corrupted" perception? When I try thinking of examples, the people that come to my mind seem to have quite successfully transitioned from working at a scaling lab to doing nonprofit / government work. For example:
I was intending to warn about the possibility of future perception of corruption, e.g. after a non-existential AI catastrophe. I do not think anyone currently working at safety teams is percieved as that "corrupted", although I do think there is mild negative sentiment among some online communities (some parts of twitter, reddit, etc.).
I think two major cruxes for me here are:
My current best guess is that Anthropic is still under the threshold for good enough safety culture (despite seeming better than I expected in a number of ways), and meanwhile that Deepmind is just too intractably far gone.
I think people should be hesitant to work at any scaling lab, but, I think Anthropic might be possible to make "the one actually good scaling lab", and I don't currently expect that to be tractable at Deepmind and I think "having at least one" seems good for the world (although it's a bit hard for me to articulate why at the moment)
I am interested in hearing details about Deepmind that anyone thinks should change my mind about this.
This viewpoint is based on having spent at least 10s of hours trying to learn and about influence both org's culture, at various times.
In both cases, I don't get the sense that people at the orgs really have a visceral sense that "decisionmaking processes can be fake", I think they will be fake by default and the org is better modeled as following general incentives, and DeepMind has too many moving people and moving parts at a low enough density that it doesn't seem possible to fix. For me to change my mind about this, I would need to someone there to look me in the eye and explain that they do have a visceral sense of how organizational decisionmaking processes can be fake, and why they nonetheless think DeepMind is tractable to fix. I assume it's hard for @Rohin Shah and @Neel Nanda can't really say anything publicly that's capable of changing my mind for various confidentiality and political reasons, but, like, that's my crux.
(conving me in more general terms "Ray, you're too pessimistic about org culture" would hypothetically somehow work, but, you have a lot of work to do given how thoroughly those pessimistic predictions came true about OpenAi)
I think Anthropic also has this problem, but the threshold of almost-aligned-leadership and actually-pretty-aligned people that it feels at least possible to me for the to fix it. The main things that would persuade me that they are over the critical threshold is if they publicly spent social capital on clearly spelling out why the x-risk problem is hard, and made explicit plans to not merely pause for a bit when they hit an RSP threshold, but (at least in some circumstances) advocate strongly for global government shutdown for like 20+ years.
The high level claim seems pretty true to me. Come to the GDM alignment team, it's great over here! It seems quite important to me that all AGI labs have good safety teams
Thanks for writing the post!
> think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”
Does this mean something like:
1. People who join scaling labs can have their values drift, and future safety employers will suspect by-default that ex-scaling lab staff have had their values drift, or
2. If there is a non-existential AGI disaster, scaling lab staff will be looked down upon
or something else entirely?
Basically (2), very small amounts of (1) (perhaps qualitatively similar to the amount of (1) you would apply to e.g. people joining US AISI or UK AISI)