IMO prosaic alignment techniques (say, around improving supervision quality through RRM & debate type methods) are highly underrated by the ML research community, even if you ignore x-risk and just optimize for near-term usefulness and intellectual interestingness. I think this is due to a combination of (1) they haven't been marketed well to the ML community, (2) lack of benchmarks and datasets, (3) need to use human subjects in experiments, (4) it takes a decent amount of compute, which was out of reach, perhaps until recently.
As someone who has really not been a fan of a lot of the recent conversations on LessWrong that you mentioned, I thought this was substantially better in an actually productive way with some really good analysis.
Also, if you or anyone else has a good concrete idea along these lines, feel free to reach out to me and I can help you get support, funding, etc. if I think the idea is a good one.
(Moderation note: added to the Alignment Forum from LessWrong.)
I'd be curious to hear what your thoughts are on the other conversations, or at least specifically which conversations you're not a fan of?
My guess is that Evan dislikes the apocalyptic /panicky conversations that people are recently having on Lesswrong
That's my guess also, but I'm more asking just in case that's not the case, and he disagrees with (for example) the Pragmatic AI Safety sequence, in which case I'd like to know why.
I was referring to stuff like this, this, and this.
I haven't finished it yet, but I've so far very much enjoyed the Pragmatic AI Safety sequence, though I certainly have disagreements with it.
Getting the AI industry to take AI risk seriously is a necessary and sufficient condition for survival.
I'm going to play devil's advocate against this claim.
First: the AI industry taking AI risk seriously is not obviously a sufficient condition for survival. In the long run, the hard technical problems would still have to be solved in order to safely use AI. And there would still be a timer before someone built an unsafe AI: FLOPs would presumably still keep getting cheaper, and publicly-available algorithms and insights would keep accumulating (even if somewhat less quickly). Even with the whole AI industry on board, sooner or later some hacker would build an unsafe AI in their basement.
Getting the whole AI industry on board would buy time. It would not, in itself, be a win condition.
Second: getting the AI industry to take AI risk seriously is not obviously a necessary condition. It is necessary that people working on alignment have a capabilities lead. However, as you mention in the post:
Moreover, I don't think alignment and capabilities are orthogonal. I think they're very much positively correlated.
It is true that today's alignment researchers do not have any significant capabilities edge (or at least aren't showing it). But today's alignment researchers are also not even close to solving the alignment problem. I expect that an alignment research group which was able to solve the hard parts of alignment would also be far ahead of the mainstream on capabilities, because the two are so strongly correlated. I very much doubt that one could figure out how to robustly align general intelligence without also figuring out how to build it efficiently.
Strong positive correlation between alignment and capabilities research problems mean that non-alignment researchers win the capabilities race mainly in worlds where the alignment researchers aren't able to solve the alignment problem anyway.
Small request: given that it's plausible that a bunch of LW material on this topic will end up quoted out of context, would you mind changing the headline example in section 5 to something less bad-if-quoted-out-of-context?
There are some good thoughts here, I like this enough that I am going to comment on the effective strategies angle. You state that
The wider AI research community is an almost-optimal engine of apocalypse.
and
AI capabilities are advancing rapidly, while our attempts to align it proceed at a frustratingly slow pace.
I have to observe that, even though certain people on this forum definitely do believe the above two statements, even on this forum this extreme level of pessimism is a minority opinion. Personally, I have been quite pleased with the pace of progress in alignment research.
This level of disagreement, which is almost inevitable as it involves estimates about about the future. has important implications for the problem of convincing people:
As per above, we'd be fighting an uphill battle here. Researchers and managers are knowledgeable on the subject, have undoubtedly heard about AI risk already, and weren't convinced.
I'd say that you would indeed be facing an uphill battle, if you'd want to convince most researchers and managers that the recent late-stage Yudkowsky estimates about the inevitability of an AI apocalypse are correct.
The effective framing you are looking for, even if you believe yourself that Yudkowsky is fully correct, is that more work is needed on reducing long-term AI risks. Researchers and managers in the AI industry might agree with you on that, even if they disagree with you and Yudkowsky about other things.
Whether these researchers and managers will change their whole career just because they agree with you is a different matter. Most will not. This is a separate problem, and should be treated as such. Trying to solve both problems at once by making people deeply afraid about the AI apocalypse is a losing strategy.
The wider AI research community is an almost-optimal engine of apocalypse. The primary metric of a paper's success is how much it improves capabilities along concrete metrics, publish-or-perish dynamics supercharge that, the safety side of things is neglected to the tune of 1:49 rate of safety to other research, and most results are made public so as to give everyone else in the world a fair shot at ending it too.
It doesn't have to be this way. The overwhelming majority of the people involved do not actually want to end the world. There must exist an equilibrium in which their intentions match their actions.
Even fractionally shifting the status quo towards that equilibrium would have massive pay-offs, as far as timelines are concerned. Fully overturning it may well constitute a sufficient condition for humanity's survival. Yet I've seen precious little work done in this area, compared to the technical questions of AI alignment. It seems to be picking up in recent months, though — and I'm happy to contribute.
This post is an attempt at a comprehensive high-level overview of the tactical and strategic options available to us.
1. Rationale
Why is it important? Why is it crucial?
First. We, uh, need to make sure that if we figure alignment out people actually implement it. Like, imagine that tomorrow someone comes up with a clever hack that robustly solves the alignment problem... but it increases the compute necessary to train any given ML model by 10%, or it's a bit tricky to implement, or something. Does the wider AI community universally adopt that solution? Or do they ignore it? Or do the industry leaders, after we extensively campaign, pinky-swear to use it the moment they start training models they feel might actually pose a threat, then predictably and fatally misjudge that?
In other words: When the time comes, we'll need to convince people that safety is important enough to fuss around a bit for its sake. But if we can't convince them to do that now, what would change then?
I suppose having a concrete solution, instead of vague prophecies of doom, would give us more credibility. But... would it, really? And what if we won't have a concrete solution even then, just a bunch of weird heuristics that may nonetheless measurably improve our odds?
The latter seems reasonably likely, too. As this excellent post points out, most of the contemporary deep-learning progress comes from "messy bottom-up atheoretical empirical tinkering". If AGI comes from DL, it's plausible that, even if we arrive at the solution to alignment from mathematical foundations, the actual implementation will take the form of messy hacks. Ones that will probably need to be fine-tuned for any given model architecture. And given the no-fire-alarm principle[1], to be safe, we'll need to ensure that any sufficiently big model is only run with these hacks built-in.
If any given AI researcher is still not taking alignment seriously by then, how will we make them bother with all of that every time they run an experiment? How will we ensure they don't half-ass it even once?
Second. If that figure I quoted, the 49:1 ratio, is even remotely correct, there's plenty of room to upscale our research efforts. Imagine if every researcher started spending 2% more of their time thinking about alignment. That'd double the researcher-hours spent on the problem!
Which doesn't directly translate into 2x progress, I'll grant. Given the field's pre-paradigmic status, the returns to scale might be relatively small... but by no means negligible. Even if we don't necessarily have research directions clearly outlined, having much more people stumbling around in the dark still increases the chances of bumping into something useful.
Another argument I've seen is that upscaling may increase the capabilities gain too much. I don't find this convincing:
(The progress on mechanistic interpretability, in particular, can absolutely be usefully upscaled. One of the main bottlenecks there is understanding individual circuits, which is a (relatively) low-skill task that's 1) mainly bottlenecked by researcher-hours, not top-researcher-hours, 2) can be easily parallelized. It's probably not the only research direction like this, too — we just need to look for them.)
Third, slowing progress down. This one is straightforward enough, I suppose.
2. Existing Work
I'm sure I've missed a lot of things, but this seems like a good start.
3. Types of Interventions
I would broadly categorize them into the following:
If you feel that any of this is very ill-advised or icky, GOTO 6.
3.0. Effective Strategies for Changing Public Opinion
The titular paper is very relevant here. I'll summarize a few points.
It should really go without saying, but epistemic hygiene stays important here. We're not "aiming to explain, not persuade" anymore, we're very much aiming to persuade, but that in itself is not unethical. It's no excuse to slip into an arguments-are-soldiers employ-all-the-fallacies lie-your-head-off mindset. (If you need a "pragmatic" reason not to do that, GOTO 5.)
3.1. Straightforward Appeals to Insiders
As per above, we'd be fighting an uphill battle here. Researchers and managers are knowledgeable on the subject, have undoubtedly heard about AI risk already, and weren't convinced. Arguments that recontextualize AI risk, AI, or existential risks in general, are likely to be more effective than attempts to tell them things they already know. They're more likely to misprioritize safety, rather than be totally ignorant of the need for it.
An in-depth overview of the best ways to craft an argument is beyond the scope of this post (though this might be a good place to look). Two important meta-principles to keep in mind:
There are two tacks to take here: macro-level and micro-level.
Macro. Broad appeals to the entire industry, with the aim of changing the agreed-upon social reality, de-stigmatizing AI Safety, and so on. Concrete projects may look like this.
Micro. Targeted efforts to convince industry leaders. As per, 50% of AI progress is made by fewer than 50% of the researchers; orders of magnitude fewer than that. Similarly, getting the leadership of DeepMind and OpenAI fully on our side would have an outsized impact. In theory, a project here may go all the way down to "find effective arguments to convince this specific very important person" levels of fidelity.
I'm more optimistic about the second tack, and generally about activism that has precise focused short-term objectives whose success or failure can be clearly evaluated, and which we can quickly iterate on.
One of the flaws of the "micro" approach is that our victories may be washed away by a paradigm shift. Most of the top GOFAI researchers didn't keep their positions into the ML era, and the top ML researchers may not survive into the next one. I expect this isn't much of a problem, though. If we manage to convince the leading researchers, their views should quickly trickle down to the rest of the field, and the field's structure is likely to survive an upheaval.
3.2. Sideways Appeals to Insiders
There's one dimension along which we can broaden our standards for persuasion.
When trying to influence people — either individually or en masse — we usually argue that addressing existential risks is necessary because, duh, the looming end of humanity. The importance of that work should be self-evident to any moral person. They'd agree with us if we can only make them recognize the existential threat for the existential threat it is. No, it isn't just sci-fi! Yes, working out these weird math problems really can save the world!
But it's not the only reason someone might decide to work on AI Safety. People's career choices are motivated by all kinds of things:
And so on. This approach is relatively neglected, I suspect, because it's steeped in ulterior motives. There's a very specific reason we'd be making these arguments, and it's not because we want our interlocutors to have fun/get rich/etc.
But there's no reason we can't be open about these motives, which would take all under-handedness out of it. And there are many legitimate reasons to prefer AI Safety over "mainstream" capabilities research:
Overall, I don't expect this approach to work on the top-tier researchers, for obvious reasons. But it might work to attract the people entering the field, potentially en masse. It might also work as a good supplement to straightforward appeals: when we're trying to convince someone not to work on X, it's good to have a ready offer for what to do instead. Even better if that offer is more lucrative than their current job.
Another thing to keep in mind: trivial inconveniences. Making entering or transitioning to AI Safety 10% easier might have disproportionate effects, like doubling the amount of entrants.
Concrete projects in this area may involve creating organizations working on reducing AI risk that make competitive job offers, providing funding or career assistance to ML specialists, explaining how to start, advertising personally appealing features of working on AI risk, encouraging people to have fun, and lowering the barrier to entry by distilling research. On that note, Chris Olah's Distill project is also a good example of a "macro-level" intervention of this type, though it's on hiatus/potentially failed.
3.3. Appeals to Outsiders
Any effective work along this dimension requires answering an exciting question: how do you put out a flame using a flamethrower?
Perhaps that's a bit harsh. Perhaps even counter-productively harsh, given my previous calls for treating audiences with respect. But let's not kid ourselves: we've seen how the world handled COVID-19.
An initiative that pushes for X might convince people or governments to do anti-X instead. If we convince them to do X after all, they might do extremely ineffective things that accomplish nothing, or even somehow do things that actually make anti-X happen. And conversely, activism completely unrelated to X might make it happen!
Good news, though: COVID-19 had shown us just how badly things are broken. Keeping the Simulacra Levels and the autopsies of the failures in mind, it might be possible to find interventions that have the effects we want.
That's explicitly what we'd be doing, though: deciding what effect we want to cause, then searching for an action that would cause it, once propagated through the broken pathways of our civilization. For that reason, I'm not making the distinction between "straightforward" and "sideways" appeals here: surface-level efforts to achieve something aren't strongly correlated with that thing happening, even given their surface-level success. All appeals are sideways appeals.
Having a good model of realpolitik is a necessity here.
The general principles of "know your audience" and "maintain epistemic hygiene" still apply, though. The inference gap is much larger, but that has its advantages: direct persuasion would be more effective, on average.
Useful consequences in this area may include:
Again, causing these consequences is not as simple as pushing for them on the object-level. Lobbying for laws that regulate AI is likely to lead to poorly-targeted bans with lots of loopholes that just burn our political capital; mandating AI Safety oversight is useless if the safety committees will consist of yes-men; trying to rile up the public might well see their ire redirected our way.
Some inspirations here might be what had happened to nuclear energy or human cloning, or how the woke movement had managed to infiltrate the corporate/academic culture. Anti-corporate movements might be good allies of convenience here in general. An anti-inspiration, what not to do, is what's happened to cryonics.
Warning: This entire space of interventions has elevated levels of background radiation. Most interventions of this type are deceptively ineffective, and if you're aiming for impact first and foremost, it'd be very easy to slip into bad epistemic habits or unethical practices.
Moreover, it's necessarily antagonistic with regards to AI researchers and major AI labs. Any success here would worsen the public or legal landscape for them, and they'd be less likely to listen to straightforward appeals from us.
I've not completely despaired of this approach; the payoffs are significant, and I'm sure there are some interventions that are at once effective and reliable and ethical and only scorch our bridges. Furthermore, if straightforward appeals and other cooperative ideas won't work, pivoting to this is an obvious Plan B.
As far as directly useful consequences are concerned, I would empathize well-targeted interventions with easily evaluated victory/failure conditions, even more so than in 3.1.
However, the best plays here might be tactically useful interventions, aimed at what Pragmatic AI Safety calls diffuse factors. Such interventions don't directly decrease AI risk, but they create favourable conditions for other interventions. E. g., causing the public to be vaguely concerned about AI won't directly help, but a world in which the public is concerned is a world in which we're more well-positioned to influence the AI industry in other ways. (I'll come back to that in Part 4.)
3.4. Joining the Winning Side
In some sense, the easiest way to accomplish our goal is not to try to change the AI industry's incentive structures, but to ride them. The industry as a whole is agnostic with regards to alignment. It cares about:
Current alignment research directions are none of these things. Progress is difficult to measure, the problem doesn't cleanly break down into sub-problems you could work on in isolation, and the results don't translate to e. g. more powerful and marketable ML models. Reducing the alignment problem to this sort of "digestible" form is non-trivial — that's the whole problem with our lack of an established paradigm.
But. I think there are certain potential avenues of alignment research that are relatively amendable to being transformed into a form the AI industry would find digestible, while also being pretty effective and practical ways to make progress on alignment.
And herein lies my disagreement with Pragmatic AI Safety. They suggest that alignment research should aim for zero capabilities advancement. I think the metric to keep in mind here, instead, is expected return on capabilities for an hour of research.
It pays to play along with the current industry incentives, most notably the profit motive. In addition, the idea of aiming for minimal capabilities externalities seems deeply unnatural for me for other reasons:
All in all, I'm optimistic about the existence of research projects that are at once 1) quickly reachable, 2) would find traction with the current status quo of the AI industry, 3) efficiently progress alignment. Crucially, finding such a research direction would nearly guarantee that the alignment solution is implemented in any future major model.
The most obvious candidate is mechanistic interpretability, of course (and it's no coincidence that it seems to be the most popular AI Safety direction outside our circle), but I have a few other ideas that I hope to post soon.
3.5. Influencing the Research Culture
All of the other approaches attempt to influence the AI industry through intermediaries: through the research projects it pursues, through the people it's implemented on, through the wider social environment it's embedded in. But perhaps there is room for a more direct intervention?
The industry is a social construct. The qualities that make a project a good one, the tastes the researchers have, the incentives they operate under — all of this is, to some extent, arbitrary. It has a ground-truth component, but the current configuration is not uniquely determined by the ground truth of the research subject. Rather, it's defined by weights that this social construct currently assigns to different features of the ground truth.
The current AI industry prefers tinkering to empiricism, and capabilities to safety. How can we shift this?
There's been two proposals that I've already mentioned:
I think something like 2) is worth implementing. I'm unclear on how to evaluate 1); I'm guessing mechanistic interpretability just hasn't progressed that far yet. If we generalize from those two, though...
We want to synthesize a construct C with the following properties:
What is a fact such that, if the researchers knew of it and knew other researchers or the general public or the grant-makers also knew of it, would make them pursue alignment research? What is a fact that would change the sociopolitical landscape such that the incentives shift, even fractionally, towards the things we want?
"Sufficiently powerful non-aligned AI is going to end the world and aligning AI is difficult" as C evidently doesn't satisfy the second criterion. Logan's idea is to force it into the industry's social reality by a monetary injection. Chris Olah's idea has a more niche definition of AI risk as C, but either 2) is still difficult, or 3) didn't work out.
Both approaches take C for granted, then attempt to find paths to satisfying 2). I think there's some promise in directly mining the concept-space for a C that'd have the desired properties "out of the box". It's essentially the "sideways appeals" approach writ large.
I suspect studying the history of science for cases where research tastes/standards changed would be useful here. An example of such a shift might be the replication crisis in psychology (in the broadest of terms).
I suspect this approach hasn't been exhausted, given that it's pretty non-intuitive.
I suspect this idea is overly clever in a very stupid way and will not actually work.
Still, if it does work, there are probably some low-hanging fruits there. And just intuitively... If you go looking for a reason, you generally find one, don't you?
4. What You Can Do
The logistics graph that leads to a superintelligent AI's deployment has many bottlenecks, and controlling any one of them would be sufficient. Taking over the researcher supply, or the money supply, or the compute supply, or the research project supply, or the reputation supply, or the supply of any other crucial resource I'm not thinking of, would ensure excellent conditions for a safe advanced AI to emerge.
But the path to this doesn't look like a concentrated push along the corresponding dimension. As Pragmatic AI Safety points out, diversification is key.
There are interdependencies everywhere: success at one thing affects the probabilities of success of all other projects. Finding an appealing research direction would make it easier to attract people our way. Putting social pressure on major AI labs would make safety-adjacent research directions more appealing. Shifting research tastes in a subfield would make it easier to change people's minds. And so on.
Moreover, it's not obvious what bottleneck would be the easiest to gain control of, without the benefit of hindsight. Future events and novel discoveries may shift any part of the landscape in unpredictable ways, open or close doors for us.
Improving AI Safety's future position means pursuing a strategy that is robust to such random environmental fluctuations. It means maximizing our far-away action space. We need to have a diversified portfolio of plans; we need to be improving our position all across the board, always looking for what new opportunities have arisen.
In theory, it would be great to have central coordination. Some organization or resource which tracks the feasibility of various interventions across the entire gameboard, and pursues/recommends those that move the gameboard into the most advantageous states while spending the least resources,
and also you should put me in charge.In practice, this sort of coordination is both difficult and fragile, with a single point of failure. We're not a single organization, either, but a diverse conglomerate of organizations, movements, groups and individuals.
But we can approximate central coordination.
It's often pointed out that impact in the modern world has a tail-heavy distribution. In some areas, it's effective to have many separate groups putting their full strength behind diverse high-variance projects. Many of them will fail, but some will succeed massively.
The project of advancing AI Safety is, to a large extent, one such area.[3] My general advice would be as follows:
There's a caveat here, though: without central coordination, how can we ensure that none of these disparate projects trip each other up? As I'd mentioned, successfully appealing to governments might mess up our relations with AI researchers, and failed persuasion attempts (macro- and micro- both) often make subsequent ones much harder.
Some amount of that is probably inevitable. Part of it can be mitigated by trying out minimal/small-scale versions of any projects that might result in net-negative impact on failure. But a much larger chunk of it could be mitigated by ensuring that we—
5. Avoid Thermonuclear Ideas
You likely know what I'm talking about. The class of ideas that includes lying and manipulation as its most tame members, and expands to cover some much worse extremes.
I know some of these ideas may seem very clever and Appropriately Drastic, and the stakes — literally astronomical — could not be higher. We're accelerating directly into a wall, and our attempts to swerve away seem ineffectual. It may feel emotionally resonant to resolve to Stop Being Nice and Pull Out All the Stops and solve the problem in some gravely decisive fashion, By Any Means Necessary.
But it will not work in the real world, outside fantasies. It will not solve the problem in the long term, and in the meantime it will crash and burn, and hurt people, and ruin our PR, and tank the chances of other, more productive and realistic approaches. Even if you think your idea will definitely succeed, you're failing to think at scale. What would you expect to work better: a policy under which some of us pursue plans that blow up so hard they set us collectively back a few years, or a policy under which our plans only ever compound on each other's successes?
Following the first policy is a defection, not just against the rest of society, but against all our other risk-mitigation initiatives. We're better than this.
As a rule-of-thumb, you can use something like Shannon's maxim. If whatever clever plan you're considering and the entire causal chain that led to it became common knowledge, would it fail and destroy our credibility and our other plans? If yes, this is a radioactive plan, get it away.
Things that seem like ruthless pragmatism are frequently not actually ruthlessly pragmatic. They're just excuses to indulge your base instincts.
Be cool, in general. Find ways to be cool about this mess. We have resources for that and everything.
6. The Thin Line
I concur with lc's post and the people in that post's comments: we have a slight taboo against the sort of full-scale activism I'm arguing for. It's exemplified by this sort of sentiment. I suspect it's a combination of two things:
It makes sense that it exists. One of the foundations of this movement is "raising the sanity waterline" — and approaching interactions with people outside the movement with less rigor is not how you set an example. It's also easier to enforce the same norms upon yourself and each other in all situations, instead of switching between different sets depending on context.
I'm tempted to say that we've overcorrected here; that we can or must relax our standards somewhat, in the light of shortening timelines and in the face of our slow progress.
But I'm not sure. Relaxing the standards is absolutely a slippery slope. This decision might be "meta-radioactive", in the sense that it will see us accelerating straight into the epicenter of a nuclear explosion.
I don't know how to strike the right balance here. It definitely seems like we can opt to be more effective without inching towards self-defeating Stupid Evil, but maybe the mere act of acknowledging that possibility would shift our social reality in undesirable ways? Maybe John's position is right, and we should call out our epistemically suspect behavior even as we agree that it's the right thing to do.
7. Closing Thoughts
The recent months have seen increasing amounts of alarm and doom-saying in our circles. AI capabilities are advancing rapidly, while our attempts to align it proceed at a frustratingly slow pace. There are optimistic voices, but the general disposition seems quite grim.
Well. If alignment is really so hard, maybe we should quit trying to solve it?
In hindsight, I'm a bit baffled that field-building wasn't our main focus this entire time. Getting the AI industry to take AI risk seriously is a necessary and sufficient condition for survival. Solving alignment by ourselves is... neither. If the technical problems are truly insurmountable in the time we have left — and I don't yet know that they are, but I can certainly imagine it — we should just shift our focus to social-based solutions.
The goal, I should note, is not outreach. Convincing a few, or many, AI researchers to switch to alignment won't solve the problem where we have a multi-billion dollar industry stockpiling uranium in the hopes of spontaneously assembling a nuclear reactor. The aim should be to shift that status quo. Changing people's minds is a fine instrumental goal, but the terminal one is to influence the robust agent-agnostic process itself.
I'd like to suggest that there might be a snowball effect involved — that a 10% progress at this task would make the subsequent 90% easier, and so on. There might, indeed, be. I'm not that optimistic, though. I expect it'll be an uphill battle all the while, because the sort of carefulness we'd like to cultivate has the tendency to rot away, as organizations become corrupted and people value-drift.
It's possible that this is also impossible. That we can't change the AI industry in time, any more than we can independently solve alignment in time. But it seems less impossible to me.
And if we keep looking for approaches that are less and less impossible, perhaps we'll find one that isn't impossible at all.
Which may or may not have been recently confirmed by this.
Very important. See point 7 here.
But not AI Safety itself, of course, only the project of spreading it. AI is very much Scott's Distribution 1, and the fact that our civilization is treating it as a Distribution 2 is the entire bloody problem.