I call "alignment strategy" the high-level approach to solving the technical problem[1]. For example, value learning is one strategy, while delegating alignment research to AI is another. I call "alignment metastrategy" the high-level approach to converging on solving the technical problem in a manner which is timely and effective. (Examples will follow.)
In a previous article, I summarized my criticism of prosaic alignment. However, my analysis of the associated metastrategy was too sloppy. I will attempt to somewhat remedy that here, and also briefly discuss other metastrategies, to serve as points of contrast and comparison.
The conservative metastrategy follows the following algorithm:
This is my own favorite metastrategy. The main reason it can fail is if the unconservative research we failed to stop creates unaligned TAI before we can deploy an AI defense system (currently, we have a long way to go to complete step 2).
I think that it's pretty clear that a competent civilization would follow this path, since it seems like the only one which leads to a good long-term outcome without taking unnecessary risks[2]. Of course, in itself that is an insufficient argument to prove that, in our actual civilization, the conservative metastrategy is the best for those concerned with AI risk. But, it is suggestive.
Beyond that, I won't lay out the case for the conservative metastrategy here. The interested reader can turn to 1 2 3 4 5.
The incrementalist metastrategy follows the following algorithm:
This is (more or less) the metastrategy favored by adherents of prosaic alignment. In particular, this is what the relatively safety-conscious actors involved with leading AI labs present as their plan.
There are 3 main mutually reinforcing problems with putting our hopes in this metastrategy, which I discuss below. There are 2 aspects to each problem: the "design" aspect, which is what would happen if the best version of the incrementalist metastrategy was implemented, and the "implementation" aspect, which is what happens in AI labs in practice (even when they claim to follow incrementalist metastrategy).
If a new AI capability is found in step 1, and the knowledge is allowed to propagate, then irresponsible actors will continue to compound it with additional advances before the alignment problem is solved on the new level. Ideally, either the new capability should remain secret at least until the entire iteration is over, or government policy should prevent any actor from subverting the metastrategy, or some sufficient combination of both needs to be in place. Implementing the above involves major institutional and political challenges, which undermines the feasibility of the entire metastrategy.
Now, one could argue that the same problem exists with the conservative metastrategy: after all, some theoretical discoveries can also be used by irresponsible actors for dangerous AI capability advances. However, when it comes to theoretical research, the problem only shows up in some cases and usually requires additional steps by irresponsible actors (going from theory to practice). Moreover, inasmuch as dangerous AI capability advances only happen when they're grounded at solid theory, there is more hope that the risks will be transparent, predictable and avoidable.
Hence, while the conservative metastrategy also calls for regulation against irresponsible actors, the entire approach allows for graceful degradation when this part fails. On the other hand, the incrementalist metastrategy calls for working directly on the sort of research which presents the most risk: opaque AI designs discovered by trial-and-error and selected only for their capabilities. This is comparable to the dubious rationale behind gain-of-function research.
In practice, even "safety conscious" AI labs share a lot of technical details about their research. Even when they don't, their employees are not selected for outstanding trustworthiness and understanding of the risks. Many of them might ultimately continue to other organizations and disseminate the knowledge, even when nominally bound by NDA[3]. I also doubt they maintain a very high standard of cybersecurity. All this is in sharp contrast to organizations which actually take secrecy seriously (military and government bodies dealing with classified information).
See also: Six Dimensions of Operational Adequacy
Step 3 in the incrementalist metastrategy requires validating alignment. But, how can we tell whether an AI is aligned? What does it even mean for it to be aligned? One possibility is the economic criterion: "the AI is aligned when it behaves close enough to the designer's intent to be a profitable product". This must be appealing to company executives. However, under this definition there might be abrupt changes in alignment as a function of capability progress (or just time) that undermine the entire case for incrementalism.
The latter is true even if capability progress is perfectly smooth. For example, imagine agentic AIs that go along with our wishes because they know that they are within our power, right until they aren't. We gradually hand over the entire economy to AIs, then one day the AIs decide it's time to get rid of the humans. Even if some problems become apparent in advance, at some point it will be very hard to stop/slow the tide in the face of economic incentives.
Even with the best intentions, measuring alignment is hard. One difficulty is deceptive alignment: agentic AI that strategically searches for ways to appear aligned because this appearance is correlated with utility (for obvious reasons) under its internal world-model. But even if the AI is not deceiving us on purpose, we might still select an AI design whose behavior locally appears aligned but internal reasoning is very different from what we expect, or we can end up with behaviors that are too complicated to interpret. See Wentworth's article for more discussion.
A satisfactory solution to the measurement problem requires interpretability tools that can detect any unintended motivation inside the AIs reasoning, before you can complete even a single iteration. However, we are far from having such tools. Moreover, it's not clear for what kind of AI designs such tools exist even in principle. Worse, even if we had candidate tools, how would we know they work correctly? Without either a theoretical foundation or an alternative reliable method of measurement?
AFAICT, AI labs just follow the economic criterion. We have no idea what kind of reasoning processes happen inside e.g. large language models and have no tools to determine that. When problems are found, they are patched as best as the companies know how, but some problems (jailbreaks, hallucinations) still don't have a general solution. I'm pointing this out not because these specific problems are important in themselves, but because it shows the companies don't actually follow the incrementalist metastrategy at all: despite the fact their products are demonstrably misaligned in some sense, they continue to compound capability advances.
What happens if progress in AI capabilities fails to be smooth and predictable? (Which I believe will happen.) It creates two problems:
The combination of unpredictable progress with information security failures can create an extremely dangerous situation where responsible actors need to play difficult catch-up with irresponsible actors, even when the irresponsible actors are too weak to "win" the race otherwise. This consideration is in addition to the risk of a "responsible" actor ending the world by a lab experiment.
I don't distinguish between "design" and "implementation" here, because the incrementalist metastrategy has only been defended by AI labs (even nominally) for a few years, and there haven't been many large and unexpected innovations during this time[4].
The butlerian metastrategy is: postpone TAI indefinitely. Essentially, it requires creating an AI defense system without the help of TAI[5].
This metastrategy doesn't have many advocates, but I still find it useful to have a name for it. Most metastrategies exist on a continuum where the butlerian metastrategy is an (extreme) endpoint, depending on how bullish they are on stopping irresponsible forms of AI research and for how long.
The major problems with this metastrategy are:
The transhumanist metastrategy follows the following algorithm:
This metastrategy is advocated by Eliezer Yudkowsky.
There are 3 reasons I'm pessimistic about this metastrategy:
Inasmuch as augmented humans are safe, this metastrategy has the advantage that (in contrast to incrementalism) it doesn't burn the commons by advancing AI capability. Therefore, it might make sense to work on this and the conservative approach in parallel.
There is some arbitrary choice about what constitutes "high-level". It would be more precise to say that there is are multiple layers or even a spectrum between "general direction" and "full detailed solution". For example, Physicalist Superimitation is much more detailed than just value learning, but is still insufficiently detailed for actual implementation.
Some critics doubt that the sort of mathematical theory I pursue is possible at all. To those I would point out that a competent civilization would at least make a serious attempt to create that theory before trying something else.
I think that enforcing an NDA is pretty hard: the company doesn't have an easy way to know the NDA is violated, and even if it suspects that, proving it is hard, especially given that the technical details mutate along the way.
As far as I know. If there have been but they were kept secret, then, kudos!
In general, an AI defense system can be preventative i.e. aimed at preventing unaligned AI from being created rather than disabling it after it is created. It would definitely have to be preventative in this case.