1 min read

1

This is a special post for quick takes by Mark Xu. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
34 comments, sorted by Click to highlight new comments since:
[-]Mark Xu6833

Alignment researchers should think hard about switching to working on AI Control

I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50/50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!

For this post, my definitions are roughly:

  • AI alignment is the task of ensuring the AIs “do what you want them to do”
  • AI control is the task of ensuring that if the AIs are not aligned (e.g. don’t always “do what you want” and potentially want to mess with you), then you are still OK and can use them for economically productive tasks (an important one of which is doing more alignment/control research.)

Here are some thoughts, arguments, and analogies (epistemic status: there is no “hidden content”, if you don’t find the literal words I wrote persuasive you shouldn’t update. In particular, just update on the words and don't update about what my words imply about my beliefs.):

  • Everything is in degrees. We can “partially align” some AIs, and things will be better if we can use those AIs for productive tasks, like helping with alignment research. The thing that actually matters is “how aligned are the AIs” + “how aligned to they need to be to use them for stuff”, so we should also focus on the 2nd thing.
  • If you were a hedge fund, and your strategy for preventing people from stealing your data was and starting new hedge fund was “we will make the hedge fund a super fun place to work and interview people carefully to make sure they really care about working for us”, and you also did not do things like “maybe people shouldn’t be able to download all the trading strategies onto a USB stick which we let them take home”, “maybe people should be forced to keep their work computers in the office” and “maybe we should have some people checking if other employees are trying to steal data and make some plans about how to stop them from starting a rival hedge fund if they do”, then all the other hedge funds would think you were an idiot. You should obviously try both to check if people want to defect, and also make it harder for them to defect.
  • I think a lot of the neglect of AI control can be traced to the “pivotal act” based framings of how to make aligned AI go well. I think these views are mistaken, but I won’t provide a great argument here (quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”). However, I don’t think things are shaping up for single pivotal acts to be very important, and being able to productively use early AI labor to help with alignment seems like it is going to be an important part of the future going well.
  • Working on AI control is really tractable compared to working on alignment. There is a lot of prior art in people thinking about how to e.g. prevent data from being stolen by insider threats, secure various information against adversaries, generally ensure that national security secrets don’t get leaked, etc. We can set up various plausibly looking scenarios and try to “game them out” under certain plausibly seeming assumptions about how much alignment we’ll be able to achieve. We can make assumptions about AI capabilities and alignment without really getting into the nitty gritty of philosophically confusing topics like “what is deception really?” and “how likely is deception?” and “when will this all go down?”
    • Note that this means control might be easier for the AIs to automate, and easier to do “during crunch time.”
    • It also means that AI control is a bit contiguous with work people already do to protect from adversaries. I think the AI thing qualitatively changes the picture though, so it’s still worth thinking about.
  • If you succeed at AI control, there is a more compelling story you can tell for why you should implement the specific scheme that you thought of. In comparison, a lot of alignment proposals depend on philosophically dubious and confusing assumptions about the nature of cognition/SGD/etc. That is to say, a successful proposal in AI control comes with an argument that might be convincing to established security professionals if they thought about it, but there is likely no such class of professionals capable of lending your proposal external legitimacy in AI alignment.
  • The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.
  • There is a bunch of research that is useful for both alignment and control, e.g. model organisms of deception, interpretability techniques, oversight techniques, etc. More people should analyze such research from the perspective of “how can this be helpful even if it fails to produce an aligned AI?”
[-]Wei Dai2439

Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.

When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify parts.

Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs. This seems unlikely to turn out well. They may also persuade some of us to trust their alignment work, even though we really shouldn't.

So to go down this road is to bet that alignment has no philosophically confusing or hard to verify parts. I see some people saying this explicitly in the comments here, but why do they think that? How do they know? (I'm afraid that some people just don't feel philosophically confused about much of anything, and will push forward on that basis.) But you do seem to worry about philosophical problems, which makes me confused about the position you take here.

BTW I have similar objections to working on relatively easy forms of (i.e., unscalable) alignment solutions, and using the resulting aligned AIs to solve alignment for more powerful AIs. But at least there, one might gain some insights into the harder alignment problems from working on the easy problems, potentially producing some useful strategic information or making it easier to verify future proposed alignment solutions. So while I don't think that's a good plan, this plan seems even worse.

Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.

[...]

Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs.

It currently seems unlikely to me that marginal AI control research I'm excited about is very economically useful. I agree that some control or control-adjacent research will end up being at least somewhat economically relevant.

I can also imagine worlds where AIs are clear schemers that are clearly unproductive to use by default, and thus control increases deployment through both straightforward incentives and risk compensation (though this maybe requires the AIs to have made a mistake or to have somewhat strange preferences as they would probably prefer fooling our tests).

My understanding is that your concerns on downside are maybe some combination of:

  1. AI control research will be economically very useful (e.g. accelerates AI by 6 months conditional on 5 year timelines).
  2. Control will result in irrational risk compensation style actions where reducing earlier risks provides false confort about later risks (as control isn't scalable) and this will ultimately make the situation worse.
  3. AI control prevents earlier warning shots which would have created more will for costly actions that save us later. (I'm not sure if this is actually your concern, though some statements pattern matched to this.)

I'm very skeptical of (1) for marginal control research. It feelsl like this has to imply incredibly strong levels of inefficiency and usefulness of control research. Perhaps you're imagining a smaller acceleration effect, but think the benefits are also tiny relative to this small acceleration?

I'm somewhat sympathetic to things being somewhat inefficient such that if the field of people who might do control research instead did capabilities, that would be a notable acceleration (maybe 5%), but it seems like control research should be a massive reduction in acceleration relative to this.

On (2), this sounds like a weird intermediate level of competence where they are actually sensitive to full on scheming risks at an earlier point (when control works), but end up thinking this will scale despite it being relatively obvious it won't. I agree that control might end up being an excuse for scaling, but I don't think that it will be a counterfactual excuse (some other excuse would have been found) and I think we should just make sure to argue on the merits at the point.

I do think that (2) points toward focusing on advancing evaluation methodology more than techniques such that it would be relatively clear when things fail if people are being unreasonable and it is relatively easy to call out unreasonableness.

I think not doing control due to (3) can put in you in pretty scary situations: how can we know when you'll get scary warning shots vs full AI takeover? Further, AIs presumably want to avoid warning shots though it is unclear how much AIs will systematically want to advance the interests of other AIs. At a more basic level I don't love arguments of the form "actually, having bad things happen will actually be good, so we shouldn't try to prevent bad things which are pretty close to the actual bad things we're worried about". All that said, I'm sympathetic to researchers avoiding work on mitigating misuse or smaller-than-full-takeover harms due to this motivation. (I still think AI companies should have an obligation to make it very clear what risks they are imposing and to try to mitigate risks at multiple scales, but this obligation doesn't apply to researchers. If it is hard to hire good people to work on misuse, try paying more.)


When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs.

As in, your view is that:

  • Greatly accelerating all alignment work which isn't somewhat philosophically/conceptually confusing won't be very useful for solving alignment. (Because of conceptual bottlenecks.)
  • Philosophy/conceptual work is sufficiently hard to check when an AI is trying to screw you over that control can't force scheming AIs to notably help with this work. (Note that even if evaluation in this regime is much harder than generation, that isn't clearly a blocker, because we potentially only have a to supervise a small subset of outputs.) (Note that this is distinct from the view that AIs won't be capable enough to help with this work as is currently the case.)

I think there is a lot of very helpful alignment work which isn't conceptually bottlenecked and agendas with fewer conceptual barriers could go very far.

Further, I'm skeptical about evaluation being so hard.

Beyond that, I'm only imaging part of the theory of change of control work being to "solve alignment" or work on alignment.

Alternatives:

  • Buy time by both making it cheaper for actors to stall and also making risks better understood[1].
  • Also buy time by pushing AI takeover later in time.
  • Work on alternative exit conditions like emulated minds (difficult but maybe doable).

Additional time with very powerful AIs seems useful for studying them even if we can't get any acceleration from automation. Pausing at the brink is more difficult in terms of costs and coordination (as there will be more close actors and more incentive), but it might also be radically easier to make the case for pausing relative to today. So, it overall seems easier to arrange and time at the brink also seems much more useful all around.

(I also think that pausing looks less good if it isn't part of a well implemented, thoughtful and extremely high effort approach which e.g. actively intervenes with hardware and is aggressive on international efforts. So, I'm not very into pausing "for the wrong reasons".)


(Edited to add)

I'd like to argue that there is a lot of helpful stuff which isn't conceptually bottlenecked.

Concretely, let's imagine targeting aligning AI systems which are capable enough to dominate top human experts in speed and cost across all relevant domains (including philosophy/conceptual work). In particular, we want to align them well enough that we're happy to totally defer to them on really tricky questions like "what should be our high level approach for handling risks for further AI systems"[2]. I think this is a pretty good target for a rational actor, then you can actually hand off decision making to such AIs. (They might decides to work on alignment of smarter systems or pursue totally different routes.)

It seems difficult though not impossible that we could reasonably confidently (95%)[3] reach this target via very prosaic/empirical approaches using huge amounts of AI labor (e.g. a combination of testing things on huge numbers of model organisms, doing some high level interp (both white and black box) to get a bit of a sense for what is going on in AI systems in general, and verifying that AIs seem to do a good job in held out conceptual domains where we happen to know the answers). And, I think prosaic approaches with huge amounts labor could also substantially increase chances of hitting this target (e.g. 4x risk reduction) even if we can't hit this level of confidence.

This doesn't really look like "solve alignment", but in practice it reduces risk a bunch.


  1. It's also possible that better understanding of risks will indicate that proceeding to wildly superhuman AI immediately is fine, though I'm skepitical. ↩︎

  2. The AIs might ask us questions or whatever to figure out our preferences. ↩︎

  3. Obviously, 95% confidence is objectively unreasonable, but it is notable improvement over doing nothing, particularly in worlds where we find evidence for misalignment. ↩︎

People interested in a discussion about control with someone who is maybe closer to Wei Dai's perspective might be interested in my dialogue with habyrka.

My vague plan along these lines is to attempt as hard as possible to defer all philosophically confusing questions to the "long reflection", and to use AI control as a tool to help produce AIs that can help preserve long term option value (including philosophical option value) as best as possible.

I seperately have hope we can solve "the entire problem" at some point, e.g. through ARC's agenda (which I spend most of my time trying to derisk and advance).

The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.

Well, this would be the lone crux. The rest of the stuff you wrote is about non-exploding AI, right? And is therefore irrelevant to the thing about everyone dying, except insofar as controlled non-exploding AI can help prevent uncontrolled exploding AI from killing everyone?

I really think if you want to tell a story of AI Control work being good (especially compared to working on preventing AGI from being built in the first place), the important and difficult part is figuring out how to actually use these more powerful AI systems to either achieve some kind of global moratorium, or make unprecedented progress on the hard parts of the AI Alignment problem. 

When I see most people start thinking about control, I rarely see them interface with either of these two things, and honestly, I mostly see them come up with cool additional commercial applications of AI, which I think will mostly make the present situation worse. To be clear, I think thinking about how to use AI systems for global coordination or making progress on the hard parts of the AI Alignment problem is tractable, but I haven't seen much work on it, and I am worried it is not the default thing that people will work on when they start thinking about control.

I think it is unlikely for control work to buy humanity much time until someone builds a very powerful unaligned AI system, at least at our present levels of coordination tools. I think controlled AI systems might help with coordination here, but I do think someone needs to do the actual hard work of thinking about how it might.

(I go into this and various related things in my dialogue with Ryan on control)

FWIW, I don't think of myself as having a super strong opinion on whether working on control is better than working on preventing people from building catastrophically dangerous AI (and I've seriously considered doing the latter).

(I do feel more optimistic than you about being able to use controlled AIs for something useful.)

@Buck do you or Ryan have a writeup that includes: (a) a description of the capabilities of a system that you think would be able to do something useful for the sorts of objectives that Habryka talks about and (b) what that something useful is. 

Bonus points if it has (c) the likelihood that you think such a system will be controllable by 20XX and (d) what kind of control setup you think would be required to control it.

On (a) and (b), we describe this at a high level here.

We don't really have anything written on (c) or (d). (c) really depends a lot on effort, so I'd probably prefer to talk mostly about (d) including what evalutions would be needed at various points etc.

For (a), I think we potentially care about all of:

  1. Systems which are perhaps qualitatively similarly smart to OK software engineers and which are capable of speeding up R&D work by 10x (speedups aren't higher due to human bottlenecks). (On a nearcast, we'd expect such systems to be very broadly knowledgeable, pretty fast, and very well tuned for many of their usages.)
  2. Systems which nearly strictly dominate top human scientists on capability and which are perhaps similar in qualitative intelligence (I'd guess notably, but not wildly weaker and compensating in various ways.) Such systems likely some domains/properties in which they are much better than any human or nearly any human.
  3. Systems which are qualitatively smarter than any human by a small amount.

It's likely control is breaking down by (3) unless control ends up being quite easy or the implementation/evaluation is very good.

On (b) we plan on talking more about this soon. (Buck's recent EAGx talk is pretty up to date with our current thinking, though this talk is obviously not that detailed. IDK if you can find a recording anywhere.)

Re a, there’s nothing more specific on this than what we wrote in “the case for ensuring”. But I do think that our answer there is pretty good.

Re b, no, we need to write some version of that up; I think our answer here is ok but not amazing, writing it up is on the list.

yep agreed, I have a bunch of vague plans in this direction. I most generally think that AI control is a pretty good tool in the toolbox, and is unlikely to make things much worse but plausibly makes things much better.

quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”

This feels like a pretty central cruxy point - and not just for the relevance of the pivotal act framing specifically. I think it's underlying a whole difference of worldview or problem-solving approach.

A couple other points in a similar direction:

  • A thing I noticed in our discussion on the model delta with Christiano post: your criterion for useful AI safety work seems to be roughly "this will contribute some marginal value" as opposed to "this will address a bottleneck".
  • Right at the top of this thread, you say: "I think Redwood Research’s recent work on AI control really 'hits it out of the park', and they have identified a tractable and neglected intervention that can make AI go a lot better". Note what's conspicuously missing there: tractable and neglected, but you don't claim importance.

I would say that your mindset, when approaching AI safety, seems to be an epsilon fallacy.

Sure, in principle a sum of numbers can be large without any individual number being large. In practice, the 80/20 rule is a thing, and everything has bottlenecks all the time. If work is not addressing a bottleneck, then it's approximately useless.

(Somewhat more precisely: if marginal work is not addressing something which is a bottleneck on current margins, then it's approximately useless.)

Of importance, tractability and neglectedness, importance is the most important. In practice, it is usually better to have a thousand people trying to solve a major bottleneck each with low chance of success, than a thousand people making tractable progress on some neglected issue which is not a bottleneck.

I agree it is better work on bottlenecks than non-bottlenecks. I have high uncertainty about where such bottlenecks will be, and I think sufficiently low amounts of work have gone into "control" that it's obviously worth investing more, because e.g. I think it'll let us get more data on where bottlenecks are.

I think I disagree with your model of importance. If your goal is the make a sum of numbers small, then you want to focus your efforts where the derivative is lowest (highest? signs are hard), not where the absolute magnitude is highest.

The "epsilon fallacy" can be committed in both directions: both in that any negative dervative is worth working on, and that any extremely large number is worth taking a chance to try to improve.

I also seperately think that "bottleneck" is not generally a good term to apply to a complex project with high amounts of technical and philosophical uncertainty. The ability to see a "bottleneck" is very valuable should one exist, but I am skeptical of the ability to strongly predict where such bottlnecks will be in advance, and do not think the historical record really supports the ability to find such bottlenecks reliably by "thinking", as opposed to doing a lot of stuff, including trying things and seeing what works. If you have a broad distribution over where a bottleneck might be, then all activities lend value by "derisking" locations for particular bottlenecks if they succeed, and providing more evidence that a bottleneck is in a particular location if it fails. (kinda like: https://en.wikipedia.org/wiki/Swiss_cheese_model) For instance, I think of "deceptive alignment" as a possible way to get pessimal generalization, and thus a proabalistic "bottleneck" to various alignment approaches. But there are other ways things can fail, and so one can still lend value by solving non-deceptive-alignment related problems (although my day job consists of trying to get "benign generalization" our of ML, and thus does infact address that particular bottleneck imo).

I also seperately think that if someone thinks they have identified a bottleneck, they should try to go resolve it as best they can. I think of that as what you (John) is doing, and fully support such activities, although think I am unlikely to join your particular project. I think the questions you are trying to answer are very interesting ones, and the "natural latents" approach seems likely to shed at some light on whats going on with e.g. the ability of agents to communicate at all.

I do think that "we don't have enough information to know where the bottlenecks are yet" is in-general a reasonable counterargument to a "just focus on the bottlenecks" approach (insofar as we in fact do not yet have enough information). In this case I think we do have enough information, so that's perhaps a deeper crux.

Hiliariously, it seems likely that our disagreement is even more meta, on the question of "how do you know when you have enough information to know", or potentially even higher, e.g. "how much uncertainty should one have given that they think they know" etc.

I directionally agree with this (and think it's good to write about this more, strongly upvoted!)

For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:

  1. "ensuring that if the AIs are not aligned [...], then you are still OK" (which I think is the main meaning of "AI control")
  2. Making ~worst-case assumptions about things like neural representations or inductive biases (which in practice means you likely rely on black-box methods, as in Redwood's existing work on control).

I think 2. is arguably the most promising strategy for 1., but I've occasionally noticed myself conflating them more than I should.

1. gives you the naive 50/50 equilibrium, i.e. 50% of people should naively work on this broad notion of control. But I think other reasons in favor apply more strongly to 2. (e.g. the tractability arguments are significantly weaker for model internals-based approaches to 1.)

I also think (non-confidently) that 2. is what's really very different from most existing research. For control in the first, broad sense, some research seems less clearly on either the control or alignment side.

But I do agree that safety-motivated researchers should evaluate approaches from a control perspective (in the broad sense) more on the margin. And I also really like the narrower black-box approach to control!

I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there's some under way, and I'm excited for more).

We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.

I plan to write more about white box control soon (but I plan to write about a lot of stuff…)

This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me. E.g. my categorization would be something like:

  • Clearly alignment: debate theory, certain flavors of process supervision
  • Clearly control: removing affordances (e.g. "don't connect the model to the Internet")
  • Could be either one: interpretability, critique models (in control this is called "untrusted monitoring"), most conceptions of ELK, generating inputs on which models behave badly, anomaly detection, capability evaluations, faithful chain of thought, ...

Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn't make much sense for research, most of which helps with both. It does make sense as a distinction for "what plan will you implement in practice" -- but labs have said very little publicly about that.

Other things that characterize work done under the name of "control" so far are (1) it tries to be very concrete about its threat models, to a greater degree than most other work in AI safety, and (2) it tries to do assurance, taking a very worst case approach. Maybe you're saying that people should do those things more, but this seems way more contentious and I'd probably just straightforwardly disagree with the strength of your recommendation (though probably not its direction).

Nitpick: I would also quibble with your definitions; under your definitions, control seems like a subset of alignment (the one exception if you notice the model is scheming and then simply stop using AI). I think you really have to define alignment as models reliably doing what you want independent of the surrounding context, or talk about "trying to do what you want" (which only makes sense when applied to models, so has similar upshots).

Tbc I like control and think more effort should be put into it; I just disagree with the strength of the recommendation here.

This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me.

Agreed. (And I made a comment to this effect on Mark's draft of this short form.) I don't think it is that productive to classify research directions, the central thing we're trying to point at is a classification for reasons why you might think that you're fine where one reason is "the AI couldn't do bad stuff without getting caught" and another reason is "the AI wouldn't do bad stuff" (obviously simplifying somewhat).

I also think that white-box control should typically be considered pretty separately from black-box control and the story for it is often pretty different. (Though when thinking about white-box control you should often adopt the central control perspective, e.g., you should incorporate white-box runtime detection techniques (like probes) into an overall control evaluation via exogenously trying to determine the probe reliability and other important aspects of the probe. For instance, the probe can take advantage of black box strategies and how safe we are depends on how many actions AIs need to cause huge problems.)

The claims I'd like to make is:

  • There should be more research specifically targeting a black-box control story and which is operating from a black-box control perspective. (I'd guess like 20% of the overall AI safety portfolio should be this and maybe 30% of safety work at labs. This is not including work like "get AIs to be more useful at alignment research and get better at DC eval methodology" which should also be subsidized from a control perspective but seem non-central.)
  • When people do research which could be pretty useful from a black-box control perspective (e.g. debate or other scalable oversight style research), they should consider sometimes explicitly think about the control perspective and consider adjusting their research accordingly (to the extent they can understand this perspective well enough). To the extent that people have an alignment style story they think about while doing their research, they should be roughly equally excited about thinking about the control story. I also find that the control story is often more straightforward.
  • People thinking about safety plans for near term powerful AI should strongly consider control.

Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn't make much sense for research, most of which helps with both. It does make sense as a distinction for "what plan will you implement in practice" -- but labs have said very little publicly about that.

You're potentially thinking about footnote 4 in our post on control:

As one operationalization, suppose that you had two teams at your AI lab, one of which was much more competent, and you had to assign one to be in charge of minimizing the probability that the AIs you're about to train and deploy were scheming, and the other to be in charge of ensuring control. To address risk due to powerful AI in the next 8 years, we think it would be better to assign the more competent team to ensuring control, because we think that the quality of control interventions is more influential on this risk.

This footnote is a bit confusingly worded, but I think the situation we were trying to say is "Suppose you expect dangerously powerful AI in the next year, and the current year is prior to 2032. We're guessing you should put the better team on control."

This is different than research in the run up.


I also think that to the extent people are trying to do backchained research focused on specific applications, it makes sense to put the better team on control over reducing the chance that scheming arises. (But these aren't the only classes of interventions and some interventions don't nicely fit into these buckets, e.g., you can do work on differentially making AIs more useful for alignment work which isn't well classified as either and you can work on high level interpretability which aims to roughly understand how AIs make decisions in some cases (this high-level interp doesn't clearly help with reducing the chance that scheming arises very directly, but could help with a bunch of stuff).)

I would suggest 50% of researchers working on a broader definition of control: including "control", technical governance work and technical outreach (scary demos, model organisms of misalignment). 

This post raises an important perspective on the practicalities of AI Control versus Alignment. Given the potential for AI to function productively even when not fully aligned, do you believe that current AI control methods are scalable enough to handle future AGI systems? Additionally, what would be the main challenges in ensuring that AI control strategies are robust in highly dynamic or emergency scenarios?

[-]Mark Xu4024

AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind

Some considerations:

  • Safety researchers should want Google Deepmind (GDM) to have a robust and flourishing safety department. It seems plausible that GDM will be able to create "the smartest" models: they have lots of talent, and own lots of computers. (see e.g. https://epochai.org/data/notable-ai-models#computing-capacity)
  • Anthropic (ANT) might run into trouble in the future due to not owning their own computers, e.g. if Amazon (or where ever they're renting their computers from) starts their own internal scaling competitor, and decides to stop renting out most of their compute.
  • ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the "optimal allocation".
  • GDM only recently started a bay area based safety research team/lab (with members like Alex Turner). So if people had previously decided to work for ANT based on location, they now have the opportunity to work for GDM without relocating.
  • I've heard that many safety researchers join ANT without considering working for GDM, which seems like an error, although I don't have 1st hand evidence for this being true.
  • ANT vs GDM is probably a less important consideration than “scaling lab” (ANT, OAI, GMD, XAI, etc.) vs “non scaling lab” (USAISIUKAISIRedwoodARCPalisadeMETRMATS, etc. (so many...)). I would advise people to think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted” [edit: I mean viewed as corrupted  by the broader world in situations where e.g. there is a non-existential AI disaster or there is rising dislike of the way AI is being handled by coorperations more broadly, e.g. similar to how working for an oil company might result in various climate people thinking you're corrupted, even if you were trying to get the oil company to reduce emissions, etc. I personally do not think GDM or ANT safety people are "corrupted"] (in addition to strengthening them, which I expect people to spend more time thinking about by default).
  • Because ANT has a stronger safety culture, doing safety at GDM involve more politics and navigating around buerearcracy, and thus might be less productive. This consideration applies most if you think the impact of your work is mostly through the object level research you do, which I think is possible but not that plausible. 

(Thanks to Neel Nanda for inspiring this post, and Ryan Greenblatt for comments.)

[-]kave112

ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the "optimal allocation".

I think this depends on whether you think AI safety at a lab is more of an O-ring process or a swiss-cheese process. Also, if you think it's more of an O-ring process, you might be generally less excited about working at a scaling lab.

Some possible counterpoints:

  • Centralization might actually be good if you believe there are compounding returns to having lots of really strong safety researchers in one spot working together, e.g. in terms of having other really good people to work with, learn from, and give you feedback.
  • My guess would be that Anthropic resources its safety teams substantially more than GDM in terms of e.g. compute per researcher (though I'm not positive of this).
  • I think the object-level research productivity concerns probably dominate, but if you're thinking about influence instead, it's still not clear to me that GDM is better. GDM is a much larger, more bureaucratic organization, which makes it a lot harder to influence. So influencing Anthropic might just be much more tractable.

Great post. I'm on GDM's new AI safety and alignment team in the Bay Area and hope readers will consider joining us!

I would advise people to think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”


What evidence is there that working at a scaling lab risks creating a "corrupted" perception? When I try thinking of examples, the people that come to my mind seem to have quite successfully transitioned from working at a scaling lab to doing nonprofit / government work. For example:

  • Paul Christiano went from OpenAI to the nonprofit Alignment Research Center (ARC) to head of AI safety at the US AI Safety Institute.
  • Geoffrey Irving worked at Google Brain, OpenAI, and Google DeepMind. Geoffrey is now Chief Scientist at the UK AI Safety Institute.
  • Beth Barnes worked at DeepMind and OpenAI and is now founder and head of research at Model Evaluation and Threat Research (METR).

I was intending to warn about the possibility of future perception of corruption, e.g. after a non-existential AI catastrophe. I do not think anyone currently working at safety teams is percieved as that "corrupted", although I do think there is mild negative sentiment among some online communities (some parts of twitter, reddit, etc.).

I think two major cruxes for me here are:

  • is it actually tractable to affect Deepmind's culture and organizational decisionmaking
  • how close to the threshold is Anthropic for having a good enough safety culture?

My current best guess is that Anthropic is still under the threshold for good enough safety culture (despite seeming better than I expected in a number of ways), and meanwhile that Deepmind is just too intractably far gone. 

I think people should be hesitant to work at any scaling lab, but, I think Anthropic might be possible to make "the one actually good scaling lab", and I don't currently expect that to be tractable at Deepmind and I think "having at least one" seems good for the world (although it's a bit hard for me to articulate why at the moment)

I am interested in hearing details about Deepmind that anyone thinks should change my mind about this.

This viewpoint is based on having spent at least 10s of hours trying to learn and about influence both org's culture, at various times.

In both cases, I don't get the sense that people at the orgs really have a visceral sense that "decisionmaking processes can be fake", I think they will be fake by default and the org is better modeled as following general incentives, and DeepMind has too many moving people and moving parts at a low enough density that it doesn't seem possible to fix. For me to change my mind about this, I would need to someone there to look me in the eye and explain that they do have a visceral sense of how organizational decisionmaking processes can be fake, and why they nonetheless think DeepMind is tractable to fix. I assume it's hard for @Rohin Shah and @Neel Nanda can't really say anything publicly that's capable of changing my mind for various confidentiality and political reasons, but, like, that's my crux.

(conving me in more general terms "Ray, you're too pessimistic about org culture" would hypothetically somehow work, but, you have a lot of work to do given how thoroughly those pessimistic predictions came true about OpenAi)

I think Anthropic also has this problem, but the threshold of almost-aligned-leadership and actually-pretty-aligned people that it feels at least possible to me for the to fix it. The main things that would persuade me that they are over the critical threshold is if they publicly spent social capital on clearly spelling out why the x-risk problem is hard, and made explicit plans to not merely pause for a bit when they hit an RSP threshold, but (at least in some circumstances) advocate strongly for global government shutdown for like 20+ years.

The high level claim seems pretty true to me. Come to the GDM alignment team, it's great over here! It seems quite important to me that all AGI labs have good safety teams

Thanks for writing the post!

> think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”

Does this mean something like:

1. People who join scaling labs can have their values drift, and future safety employers will suspect by-default that ex-scaling lab staff have had their values drift, or
 

2. If there is a non-existential AGI disaster, scaling lab staff will be looked down upon

or something else entirely?

Basically (2), very small amounts of (1) (perhaps qualitatively similar to the amount of (1) you would apply to e.g. people joining US AISI or UK AISI)