What if we just simulate a bunch of alignment researchers, and have them solve the problem for us?

Of all the Dumb Alignment Ideas, this one is easily the best. Simple argument in favor: well, it’s not going to do any worse than the researchers would have done. In other words, it will probably do at least as well as we would have done without it, and possibly better, insofar as it can run faster than realtime.

Another angle: human mimicry is a simple objective to train against, and is about as outer-aligned as the humans being mimicked. Which isn’t necessarily perfect, but it’s as aligned as our alignment researchers were going to be anyway (assuming inner alignment issues are handled, which we will indeed assume for the entirety of this post).

Those are pretty good arguments. But man, there are some subtle devils in the details.

Simulation vs Prediction

The ideal version of human mimicry is mind uploads: directly simulate our researchers in a stable, research-friendly environment for a long time.

The operationalization which people usually actually have in mind is to train an ML system to predict research outputs - e.g. I might prompt GPT for a johnswentworth post from the year 2050.

Even setting aside inner alignment issues, these two are radically different.

Generalization Problems

In order for GPT to generate a realistic johnswentworth post from the year 2050, it has to generalize way out of distribution.

… Well, ok, maybe I turn into one of those old researchers who just repeats the same things over and over again for decades, and then GPT doesn’t need to generalize way out of distribution. But in that case it isn’t very helpful to prompt for one of my posts from 2050 anyways, and we should prompt for something else instead (Thane Ruthenis has been writing great stuff lately, maybe try him?). The whole point of asking for future research write-ups is to see useful stuff we have not yet figured out; that means generalizing way out of the distribution of writing we already have.

But if the system generalizes too well out of distribution, then it correctly guesses that AGI will take over the world before 2050, and my attempt to prompt for a johnswentworth post from 2050 will instead return predicted writings from a ridiculously superhuman future AGI pretending to be johnswentworth. And those writings presumably try to influence the reader in ways which bring about the AGI’s takeover.

So in order to do useful work, our GPT-style system has to generalize out of distribution, but not too far out of distribution. We don’t know how wide the window is between generalizing enough and generalizing too much, or if the window is wide enough to be useful at all.

One thing we can guess: prompting for research outputs in the very near future is probably much safer than prompting for dates further out. johnswentworth post from 2025 is a safer prompt than johnswentworth post from 2050. The less far out of distribution we go, the safer we are. Similarly, the more likely we are to solve the alignment problem and avoid AI takeover, the less likely it is that prompting GPT for future research outputs is dangerous, and the more likely it is to work.

The closer we are to solving alignment already, and the more likely we are to make it, the less dangerous it is to predict future research outputs. In other words: predicting future research outputs can only safely buy us a relatively small number of bits; we have to already be reasonably close to surviving in order for it to work.

So What’s Different About Simulation?

Simulating researchers in a stable, research-friendly environment for a long time does not have the “predict outputs of a future AGI” problem. Why? What’s the key difference?

The key is the “stable, research-friendly environment” part. Our simulated researchers are in a simulated environment where AGI is not going to take over. It’s a counterfactual world very different from our own.

Alas, querying counterfactual worlds is fundamentally not a thing one can do simply by prompting GPT. Conceptually, prompts just do Bayesian conditioning on the modeled text distribution (i.e. condition the text on starting with the prompt); counterfactuals move us to an entirely different distribution. To generate a counterfactual query, we’d have to modify the system’s internals somehow. And in fact, there has recently been some cool work which demonstrates decent performance on counterfactual queries by modifying GPT’s internals! I don’t think it’s to the point where we could counterfact on something as complicated as “world in which AGI doesn’t take over and our alignment researchers successfully solve the problem”, and I don’t think it’s robust enough to put much weight on it yet, but the basic version of the capability does exist.

General Principle: Human Mimicry Buys A Limited Number Of Bits

Suppose GPT-12, with its vast training data and compute, internally concludes that humanity has a 1-in-32 chance of aligning/surviving AGI on our current trajectory. Then humanity would need 5 bits of optimization pressure in order to make it.

The more bits of optimization pressure humanity needs, the less likely human mimicry is to save us; we have to already be reasonably close to surviving in order for it to work. We already talked about this principle in the context of accidentally prompting GPT to return writing from a future AGI, but the principle is much more general than that.

The Weird Shit Problem

Suppose we need 20 bits of optimization pressure (i.e. on our current trajectory we have only a ~1-in-a-million chance of avoiding AGI takeover). We train GPT, and counterfact on its internals to a world where AGI doesn’t take over. But if our chances of avoiding takeover were that low (under GPT's model), then they’re probably dominated by weird shit, things which have probabilities on the order of 1-in-a-million or less. Maybe we nuke ourselves to death or get hit by a big damn asteroid. Maybe aliens decide that humanity’s AGI is about to become a problem to the rest of the galaxy and they need to take over rather than just letting us develop. Maybe time travel turns out to be a thing and weird time travel bullshit happens. Most likely it’s something weird enough that I won’t think of it.

Those weird futures vary in how safe they are to query (time travel would probably be on the very short list of things as dangerous as AGI), and in how likely they are to return anything useful at all (asteroid extinction tends to cut off blog post writing). But approximately zero of them involve our researchers just doing their research in a stable, research-friendly environment for a long time.

So when we need a lot of bits, it’s not enough to just counterfact on a high-level thing like “AGI doesn’t take over” and then let GPT pick the most probable interpretation of that world. We need pretty detailed, low-level counterfactuals.

Generalization Again

Another angle: number of bits of optimization required is a direct measure of “how far out of distribution” we need to generalize. Even setting aside actively dangerous queries, our simulator/predictor has to generalize out of distribution in order to return anything useful. In practice, the system will probably only be able to generalize so far, which limits how many bits of optimization we can get from it.

Expect More Problems

We’ve now been through a few different arguments all highlighting the idea that human mimicry can only buy so many bits of optimization (future AGI problem, weird shit problem, generalization). I expect the principle to be more general than these arguments. In other words, even if we patch the specific failure modes which these particular arguments talk about, trying to pump lots of bits of optimization out of human mimicry is still likely to be dangerous in ways we have not yet realized.

This is just an instance of the general principle that optimization becomes more dangerous as we crank up the optimization power - the very principle for why AGI is dangerous-by-default in the first place.

Takeaways

This post mostly talked about the limitations of human mimicry, but I want to emphasize: this is the best of the Dumb Alignment Ideas to date. If GPT-style models reach human or superhuman general intelligence next month, and we can't realistically delay its release, and you are the person sitting in front of the prompt wondering what to do, then prompting for future alignment research is absolutely what you should do. (And start not very far in the future, and read the damn outputs before going further, they'll hopefully contain additional warnings or new plans which you should do instead of prompting for more future research.) At that point it's not very likely to work, but we don't have a better immediate plan.

Good interpretability tools can buy us more bits in two ways:

  • They potentially allow us to directly handle inner alignment issues by retargeting the search
  • They potentially allow us to counterfact on the system's internal model, so we can predict researchers working in an environment less full of threats

Insofar as the tools are available, this is the thing to aim for if AGI is imminent.

... But the general principle is that human mimicry can buy only a limited number of bits. We definitely want to have the interpretability tools to implement the best version of human mimicry we can, but at the end of the day we'll mostly improve our chances by getting closer to solving the full alignment problem ourselves.

New Comment
4 comments, sorted by Click to highlight new comments since:

The problem of future unaligned AI leaking into human imitation is something I wrote about before. Notice that IDA-style recursion help a lot, because instead of simulating a process going deep into the external timeline's future, you're simulating a "groundhog day" where the researcher wakes up over and over at the same external time (more realistically, the restart time is drifting forward with the time outside the simulation) with a written record of all their previous work (but no memory of it). There can still be a problem if there is a positive probability of unaligned AI takeover in the present (i.e. during the time interval of the simulated loop), but it's a milder problem. It can be further ameliorated if the AI has enough information about the external world to make confident predictions about the possibility of unaligned takeover during this period. The out-of-distribution problem is also less severe: the AI can occasionally query the real researcher to make sure its predictions are still on track.

Another angle: number of bits of optimization required is a direct measure of “how far out of distribution” we need to generalize.

I think it's useful to distinguish between the amount of optimization we ask the model to do versus the unlikelihood of the world we ask it to simulate.

For instance, I can condition on something trivial like "the weather was rainy on 8/14, sunny on 8/15, rainy on 8/16...". This specifies a very unlikely world, but so long as the pattern I specify is plausible it doesn't require much optimization on the part of the model or take me far out of distribution. There can be many, many plausible patterns like this because the weather is a chaotic system and so intrinsically has a lot of uncertainty, so there's actually a lot of room to play here.

That's a silly example, but there are more useful ones. Suppose I condition on a sequence of weather patterns (all locally plausible) that affect voter turnout in key districts such that politicians get elected who favor policies that shift the world towards super-tight regulatory regimes on AI. That let's me push down the probability that there's a malicious AI in the simulated world without requiring the model itself to perform crazy amounts of optimization.

Granted, when the model tries to figure out what this world looks like, there's a danger that it says "Huh, that's a strange pattern. I wonder if there's some master-AGI engineering the weather?" and simulates that world. That's possible, and the whole question is about whether the things you conditioned on pushed down P(bad AGI controls the world) faster than they made the world-writ-large unlikely.

This is great, but also misses the loopyness. If GPT12 looks at the future, surely most of that future is massively shaped by GPT12. We are in fixed point, self fulfilling prophecy land. (Or, if you somehow condition on its current output being nothing, then the next slightly different attempt with GPT13. ) If GPT-n doubles the chance of success, the only fixed point is success. 

Yup, I intentionally didn't go into the whole stable fixed-point thing in this post, it's a whole complicated can of worms which applies in multiple different ways to multiple different schemes.