Pursuing convergent instrumental subgoals on the user's behalf doesn't always require good priors

jessicata

I recommend reading Scalable AI control before reading this post.

In particular, in the "Hard values, easy values" section of this post, Paul writes:

My statement of the control problem is only really meaningful because there are instrumental subgoals that are shared (or are extremely similar) between many different values, which let us compare the efficacy with which agents pursue those different values. Performance on these very similar subgoals should be used as the performance metric when interpreting my definition of AI control problem.

In fact even if we only resolved the problem for the similar-subgoals case, it would be pretty good news for AI safety. Catastrophic scenarios are mostly caused by our AI systems failing to effectively pursue convergent instrumental subgoals on our behalf, and these subgoals are by definition shared by a broad range of values.

Convergent instrumental subgoals are mostly about gaining power. For example, gaining money is a convergent instrumental subgoal. If some individual (human or AI) has convergent instrumental subgoals pursued well on their behalf, they will gain power. If the most effective convergent instrumental subgoal pursuit is directed towards giving humans more power (rather than giving alien AI values more power), then humans will remain in control of a high percentage of power in the world.

If the world is not severely damaged in a way that prevents any agent (human or AI) from eventually colonizing space (e.g. severe nuclear winter), then the percentage of the cosmic endowment that humans have access to will be roughly close to to the percentage of power that humans have control of at the time of space colonization. So the most relevant factors for the composition of the universe are (a) whether anyone at all can take advantage of the cosmic endowment, and (b) the long-term balance of power between different agents (humans and AIs).

I expect that ensuring that the long-term balance of power favors humans constitutes most of the AI alignment problem, and that other parts of the AI alignment problem (e.g. ensuring AIs are beneficial in the short term, ensuring that AI systems don't cause global catastrophic risks that cause the cosmic endowment to become unavailable to any agent) will be easier to solve after thinking about this part of the problem. So I'm going to focus on power acquisition for now.

Priors and multidimensional power

Convergent instrumental subgoals aren't totally convergent, since power is multidimensional, and some types of power are more useful for some values.

Suppose the sun is either going to turn green or blue in 10 years. No one knows which color it will turn; people disagree, and it seems like their beliefs about the sun are irreconcilable because they result from different priors. The people who predict the sun will turn green (or equivalently, care more about futures in which the sun is green) buy more green-absorbing solar panels, while those who predict the sun will turn blue will buy more blue-absorbing solar panels. How could we measure how much power different people have?

In situations like this, it seems wrong to reduce power to a single scalar; there are at least 2 scalars involved in this situation (how much power someone has in futures where the sun turns green, versus in futures where the sun turns blue).

For the AI to gain power on the user's behalf, it should gain the kind of power the user cares about. If the user thinks the sun will turn green, then the AI should buy green-absorbing solar panels.

What if the user hasn't made up their mind about which color the sun will be, or it's hard for the AI to elicit the user's beliefs for some other reason? Then the AI could pursue a conservative strategy, in which the user does not lose power in either possible world. In the case of solar panels, if 60% of the solar panels that everyone buys absorb green, then the AI should invest 60% of the user's solar panel budget in green-absorbing solar panels and 40% in blue-absorbing solar panels. This way, the user generates the same percentage of the energy in each possible world, and thus has the same amount of relative power. This is suboptimal compared to if the user had more defined beliefs about the sun, but the user isn't any worse off than they were before, so this seems fine.

I think this is an important observation! It means that it isn't always necessary for an AI system to have good priors about hard-to-verify facts (such as the eventual color of the sun), as long as it's possible to estimate the effective priors of the agents who the user wants to be competitive with. In particular, if there is some "benchmark" unaligned AI system, and it is possible to determine that AI system's effective prior over facts like the color of the sun, then it should be possible to build an aligned AI system to use a similar prior and thereby be competitive with the unaligned AI system in all possible futures.

This doesn't only apply to priors, it also applies to things like discount rates (which are kind of like "priors about which times matter") and preferences about which parts of the universe are best to colonize (which are kind of like "priors about which locations matter"). In general, it seems like "estimating what types of power a benchmark system will try acquiring and then designing an aligned AI system that acquires the same types of power for the user" is a general strategy for making an aligned AI system that is competitive with a benchmark unaligned AI system.

I'm not sure if I understand this correctly. Suppose there's an unaligned AI that has a high prior that P=NP is true, or that a halting oracle exists in the universe, so it pours all its resources into searching for a polynomial time algorithm for NP-complete problems, or trying to find the halting oracle. (Assuming the user doesn't understand or have an opinion) your AI would match the unaligned AI's prior and do the same thing?

Also, do you have a suggestion of how the idea would work when there are multiple unaligned AIs with different priors, and power shifts between them as various uncertainties are resolved in favor of some and against others?

In the case of P=NP, if both the aligned AI and the unaligned AI it's competing with pour the same percentage of their resources into searching for a polynomial time algorithm for NP-complete problems, this seems fine. The aligned AI could have gained more power in expectation by not searching for these algorithms, but at least it didn't lose any relative power. At some future point (as Paul points out) humans are going to change the AI's design, which might include changing its prior on P=NP.

Here's a possible way for another AI (A) to exploit your AI (B). Search for a statement S such that B can't consult its human about S's prior and P(A will win a future war against B | S) is high. Then adopt a high prior for S, wait for B to do the same, and come to B to negotiate a deal that greatly favors A.

It does seem like there's something right about your idea, but I'm confused about what that is.

Here's my story for this. A gets to design an agent A' (which has a weird prior), then we design an agent B' to be competitive with A'. But actually this isn't possible using the strategy in the post, since A' believes it will win, so B' will also believe that A' will win.

What we should have done was to design an agent B that is competitive with A directly. That is, when doing this reasoning about priors and such, A has its own beliefs about things (and doesn't necessarily expect to win against B); B should inherit those priors rather than those of A'.

My point was that originally B did inherit its prior from A, but when A hands its resources to A', B would detect a power shift, and change its prior to match A' (as suggested by Paul, which I assumed you agree with). Do you have a different idea about what to do when power shifts between unaligned AIs? Or propose to have aligned AI distinguish between different kinds of power shift somehow?

Also, it seems easy for A to hide its original prior and pretend that it was A' all along. How would you deal with that?

I see what you're arguing. I'm actually not sure what to do about ongoing power shifts; I'm currently thinking of the problem as "we have a benchmark system A and want to design a system B that is competitive with A". The benchmark system should be the original AI system (before it does things like hiding its prior). Of course we can't do that if A already has weird beliefs, so in this case we have to do something like tracing back to the process that produced A. Hopefully this results in a system that has "symmetric" weird beliefs (e.g. if A falsely thinks it owns a lot of resources, then we design B to also falsely think it owns a lot of resources).

At some meta level, we could see any process that produces an AI (e.g. some AI research/development/deployment strategy implemented by humans) as an agent A and then attempt to design an aligned competitive version B of this process. This high-level picture looks pretty sketchy at the moment.

This is intended as an interim solution, i.e. you would expect to transition to using a "correct" prior before accessing most of the universe's resources (say within 1000 years). The point of this approach is to avoiding losing influence during the interim period.

If there are multiple unaligned AIs with different beliefs, you would take a weighted average of their beliefs using their current influence. As their influence changed, you would update the weighting.

(This might result in an incoherent / dutch-bookable set of beliefs, in which case you are free to run the dutch book and do even better.)

This seems like the right general way of thinking about the problem.

It doesn't seem like we can always copy what other people do, so I suspect we can't dodge the problem in this way (e.g. we are still in trouble if we don't have a benign universal prior).

For example, consider an extreme case where each agent is interested in predicting different facts, and where it is possible to distinguish "predictions that are important for humans" from "predictions that are important for an AI with inhuman values." (Maybe an agent with source code A needs to predict f(A) for a complex function f.) Then a malign component of our prior might decide to make bad predictions only on questions that are important to humans. This seems to leave us back at square one.

Good point; a prior that favors some values over others is going to be a problem, and this is true of the universal prior. The way I'm thinking about it, some of the agents we're going to be competing with are the malign consequentialists in the universal prior. Of course, figuring out what prior these agents have is going to require more analysis.

It does seem like there's something right about your idea, but I'm confused about what that is.

Also, it seems easy for A to hide its original prior and pretend that it was A' all along. How would you deal with that?

(This might result in an incoherent / dutch-bookable set of beliefs, in which case you are free to run the dutch book and do even better.)

This seems like the right general way of thinking about the problem.

It doesn't seem like we can always copy what other people do, so I suspect we can't dodge the problem in this way (e.g. we are still in trouble if we don't have a benign universal prior).