Some clarifications I got from Evan (evhub) on MIRIxDiscord:
I think Evan has or will incorporate some of these clarifications into the post itself, but this may still be helpful for people who read the original post.
I really like this view.
An additional frame of interest is: signed neutrality (just remove the absolute value) as a measure of opportunity cost propensity. That is, highly non-neutral policies lead to polarizing opportunity costs. For example, consider a maze in which half your possible destinations lie through the one-way door on the left, and half through the one-way door on the right. All policies which go anywhere are highly “polarizing” / non-neutral.
I agree that this moment of neutrality is also a facet of the “power/impact” phenomenon. However, I’m not sure I follow this part:
We can think of actions as having objective impact to the extent that they change the distribution over which values have control over which resources—that is, the extent to which they are not value-neutral. Or, phrased another way, actions have objective impact to the extent that they break the strategy-stealing assumption.
Avoiding deactivation is good for almost all goals, so there isn’t much stdev under almost any Y? Or maybe you’re using “objective impact” in a slightly different sense here? In any case, I think I get what you’re pointing at.
You're right, I think the absolute value might actually be a problem—you want the policy to help/hurt all values relative to no-op equally, not hurt some and help others. I just edited the post to reflect that.
As for the connection between neutrality and objective impact, I think this is related to a confusion that Wei Dai pointed out, which is that I was sort of waffling between two different notions of strategy-stealing, those being:
If you believe that most early AGIs will be quite similar in their alignment properties (as I generally do, since I believe that copy-and-paste is quite powerful and will generally be preferred over designing something new), then these two notions of strategy-stealing match up, which was why I was waffling between them. However, conceptually they are quite distinct.
In terms of the connection between neutrality and objective impact, I think there I was thinking about strategy-stealing in terms of notion 1, whereas for most of the rest of the post I was thinking about it in terms of notion 2. In terms of notion 1, objective impact is about changing the distribution of resources among all the agents in the world.
Oh, a thing I forgot to mention about the proposed formalization: if your distribution over utility functions includes some functions that are amenable to change via optimization (e.g. number of paperclips) and some that are not amenable to change via optimization (e.g. number of perpetual motion machines), then any optimization algorithm, including ones we'd naively call "value-neutral", would lead to distributions of changes in attainable utility with large standard deviation. It might be possible to fix this through some sort of normalization scheme, though I'm not sure how.
Planned summary:
So far, most <@uses of impact formalizations@> don't help with inner alignment, because we simply add impact to the (outer) loss function. This post suggests that impact formalizations could also be adapted to verify whether an optimization algorithm is _value-neutral_ -- that is, no matter what objective you apply it towards, it provides approximately the same benefit. In particular, <@AUP@> measures the _expectation_ of the distribution of changes in attainable utilities for a given action. You could get a measure of the value-neutrality of an action by instead computing the _standard deviation_ of this distribution, since that measures how different the changes in utility are. (Evan would use policies instead of actions, but conceptually that's a minor difference.) Verifying value-neutrality could be used to ensure that the <@strategy-stealing assumption@> is true.
Planned opinion:
I continue to be confused about the purpose of the strategy-stealing assumption, so I don't have a strong opinion about the importance of value-neutrality verification. I do think that the distribution of changes to attainable utilities is a powerful mathematical object, and it makes sense that there are other properties of interest that involve analyzing it.
I think this comment by Wei Dai does a good job of clarifying what's going on with the strategy-stealing assumption. I know Wei Dai was also confused about the purpose of the strategy-stealing assumption for a while until he wrote that comment.
I understand the point made in that comment; the part I'm confused about is why the two subpoints in that comment are true:
If "strategy-stealing assumption" is true, we can get most of what we "really" want by doing strategy-stealing. (Example of how this can be false: (Logical) Time is of the essence)
It's not too hard to make "strategy-stealing assumption" true.
Like... why? If we have unaligned AI but not aligned AI, then we have failed to make the strategy-stealing assumption true. If we do succeed in building aligned AI, why are we worried about unaligned AI, since we presumably won't deploy it (and so strategy-stealing is irrelevant)? I could imagine that some people mistakenly think that unaligned AI is actually aligned and so build it, or that some malicious actors build AI aligned with them, and the strategy-stealing assumption means that this is basically fine as long as they don't start out with too many resources, but this doesn't seem like the mainline scenario to worry about: it seems much more relevant whether we can align AI or not.
I could imagine that some people mistakenly think that unaligned AI is actually aligned and so build it, or that some malicious actors build AI aligned with them, and the strategy-stealing assumption means that this is basically fine as long as they don't start out with too many resources, but this doesn't seem like the mainline scenario to worry about: it seems much more relevant whether we can align AI or not.
That's not the scenario I'm thinking about when I think about strategy-stealing. I mentioned this a bit in this comment, but when I think about strategy-stealing I'm generally thinking about it as an alignment property that may or may not hold for a single AI: namely, the property that the AI is equally good at optimizing all of the different things we might want it to optimize. If this property doesn't hold, then you get something like Paul's going out with a whimper where our easy-to-specify values win out over our other values.
Furthermore, I agree with you that I generally expect basically all early AGIs to have similar alignment properties, though I think you push a lot under the rug when you say they'll all either be "aligned" or "unaligned." In particular, I generally imagine producing an AGI that is corrigible in that it's trying to do what you want, but isn't necessarily fully aligned in the sense of figuring out what you want for you. In such a case, it's very important that your AGI not be better at optimizing some of your values over others, as that will shift the distribution of value/resources/etc. away from the real human preference distribution that we want.
Also, value-neutrality verification isn't just about strategy-stealing: it's also about inner alignment, since it could help you separate optimization processes from objectives in a natural way that makes it easier to verify alignment properties (such as compatibility with strategy-stealing, but also possibly corrigibility) on those objects.
Hmm, I somehow never saw this reply, sorry about that.
you get something like Paul's going out with a whimper where our easy-to-specify values win out over our other values [...] it's very important that your AGI not be better at optimizing some of your values over others, as that will shift the distribution of value/resources/etc. away from the real human preference distribution that we want.
Why can't we tell it not to overoptimize the aspects that it understands until it figures out the other aspects?
value-neutrality verification isn't just about strategy-stealing: it's also about inner alignment, since it could help you separate optimization processes from objectives in a natural way that makes it easier to verify alignment properties (such as compatibility with strategy-stealing, but also possibly corrigibility) on those objects.
As you (now) know, my main crux is that I don't expect to be able to cleanly separate optimization and objectives, though I also am unclear whether value-neutral optimization is even a sensible concept taken separately from the environment in which the agent is acting (see this comment).
Very interesting!
Regarding value-neutrality verification: If deceptive alignment occurs, the model might output whatever minimizes the neutrality measure, as an instrumental goal [ETA: and it might not do that when it detects that it is currently not being used for computing the neutrality measure]. In such a case it seems that a successful verification step shouldn't give us much assurance about the behavior of the model.
Note that the model's output isn't what's relevant for the neutrality measure; it's the algorithm it's internally implementing. That being said, this sort of trickery is still possible if your model is non-myopic, which is why it's important to have some sort of myopia guarantee.
Recently, I've been reading and enjoying Alex Turner's Reframing Impact sequence, but I realized that I have some rather idiosyncratic views regarding impact measures that I haven't really written up much yet. This post is my attempt at trying to communicate those views, as well as a response to some of the ideas in Alex's sequence.
What can you do with an impact measure?
In the "Technical Appendix" to his first Reframing Impact post, Alex argues that an impact measure might be "the first proposed safeguard which maybe actually stops a powerful agent with an imperfect objective from ruining things—without assuming anything about the objective."
Personally, I am quite skeptical of this use case for impact measures. As it is phrased—and especially including the link to Robust Delegation—Alex seems to be implying that an impact measure could be used to solve inner alignment issues arising from a model with a mesa-objective that is misaligned relative to the loss function used to train it. However, the standard way in which one uses an impact measure is by including it in said loss function, which doesn't do very much if the problem you're trying to solve is your model not being aligned with that loss.[1]
That being said, using an impact measure as part of your loss could be helpful for outer alignment. In my opinion, however, it seems like that requires your impact measure to capture basically everything you might care about (if you want it to actually solve outer alignment), in which case I don't really see what the impact measure is buying you anymore. I think this is especially true for me because I generally see amplification as being the right solution to outer alignment, which I don't think really benefits at all from adding an impact measure.[2]
Alternatively, if you had a way of mechanistically verifying that a model behaves according to some impact measure, then I would say that you could use something like that to help with inner alignment. However, this is quite different from the standard procedure of including an impact measure as part of your loss. Instead of training your agent to behave according to your impact measure, you would instead have to train it to convince some overseer that it is internally implementing some algorithm which satisfies some minimal impact criterion. It's possible that this is what Alex actually has in mind in terms of how he wants to use impact measures, though it's worth noting that this use case is quite different than the standard one.
That being said, I'm skeptical of this use case as well. In my opinion, developing a mechanistic understanding of corrigibility seems more promising than developing a mechanistic understanding of impact. Alex mentions corrigibility as a possible alternative to impact measures in his appendix, though he notes that he's currently unsure what exactly the core principle behind corrigibility actually is. I think my post on mechanistic corrigibility gets at this somewhat, though there's definitely more work to be done there.
So, I've explained why I don't think impact measures are very promising for solving outer alignment or inner alignment—does that mean I think they're useless? No. In fact, I think a better understanding of impact could be extremely helpful, just not for any of the reasons I've talked about above.
Value-neutrality verification
In Relaxed adversarial training for inner alignment, I argued that one way of mechanistically verifying an acceptability condition might be to split a model into a value-neutral piece (its optimization procedure) and a value-laden piece (its objective). If you can manage to get such a separation, then verifying acceptability just reduces to verifying that the value-laden piece has the right properties[3] and that the the value-neutral piece is actually value-neutral.
Why is this sort of a separation useful? Well, not only might it make mechanistically verifying acceptability much easier, it might also make strategy-stealing possible in a way which it otherwise might not be. In particular, one of the big problems with making strategy-stealing work under an informed-oversight-style scheme is that some strategies which are necessary to stay competitive might nevertheless be quite difficult to justify to an informed overseer. However, if we have a good understanding of the degree to which different algorithms are value-laden vs. value-neutral, then we can use that to short-circuit the normal evaluation process, enabling your agent to pursue any strategies which it can definitely demonstrate are value-neutral.
This is all well and good, but what does it even mean for an algorithm to be value-neutral and how would a model ever actually be able to demonstrate that? Well, here's what I want out of a value-neutrality guarantee: I want to consider some optimization procedure f to be value-neutral if, relative to some set of objectives Y, it doesn't tend to advantage any subset of those objectives over any other. In particular, I want it to be the case that if I start with some distribution of resources/utility/etc. over the different objectives y∈Y then I don't want that distribution to change if I give each y∈Y access to the optimization process f. Specifically, what this does is that it guarantees that the given optimization process is compatible with strategy-stealing in that, if we deploy a corrigible AI running such an optimization process in service of many different values in Y, it won't systematically advantage some over others.
Interestingly, however, what I've just described is quite similar to Attainable Utility Preservation (AUP), the impact measure put forward by Turner et al. Specifically, AUP measures the extent to which an algorithm relative to some set of objectives advantages those objectives relative to doing nothing. This is slightly different from what I want, but it's quite similar in a way which I think is no accident. In particular, I think it's not hard to extend the math of AUP to apply to value-neutrality verification. That is, let f:Y→(X→A) be some optimization procedure over objectives Y, states X, and actions A. Then, we can compute f's value-neutrality by calculating
neutrality(f,Y)=stdev({Vy(f(y))−Vy(no-op) ∣∣ y∈Y})
where Vy(π) measures the expected future discounted utility for some policy π:X→A,[4] no-op is some null policy, and stdev is the operator that finds the standard deviation of the given set. What's being measured here is precisely the extent to which f, if given to each y∈Y, would enable some y to get more value relative to others. Now, compare this to the AUP penalty term, which, for a state x∈X and action a∈A is calculated as
impact(x,a,Y)=1C∑y∈Y|Qy(x,a)−Qy(x,no-op(x))|
where Qy(x,a) measures the expected future discounted utility under the optimal policy after having taken action a in state x and C is some scaling constant.
Comparing these two equations, we can see that there's many similarities between impact and neutrality, but also some major differences. First, neutrality as presented here is a function of an agent's entire policy, whereas impact is only a function of an agent's actions.[5] Conceptually, I don't think this is a real distinction—I think this just comes from the fact that I want neutrality to be an algorithmic/mechanistic property, whereas AUP was developed as something you could use as part of an RL loss. Second—and I think this is a real distinction—neutrality takes a standard deviation, whereas impact takes a mean. This lets us think of both neutrality and impact as effectively being moments of the same distribution—it's just that impact is the first moment and neutrality is the second. Third, neutrality drops the absolute value present in impact, since we care about benefiting all values equally, not just impacting them equally.[6] Outside of those differences, however, the two equations are quite similar—in fact, I wrote neutrality just by straightforwardly adopting the AUP penalty to the value-neutrality verification case.
This is why I'm optimistic about impact measurement work: not because I expect it to greatly help with alignment via the straightforward methods in the first section, but because I think it's extremely applicable to value-neutrality verification, which I think could be quite important to making relaxed adversarial training work. Furthermore, though like I said I think a lot of the current impact measure work is quite applicable to value-neutrality verification, I would be even more excited to see more work on impact measurement specifically from this perspective. (EDIT: I think there's a lot more work to be done here than just my writing down of neutrality. Some examples of future work: removing the need to compute Vy of an entire policy over a distribution (the deployment distribution) that we can't even sample from, removing the need to have some set Y which contains all the values that we care about, translating other impact measures into the value-neutrality setting and seeing what they look like, more exploration of what these sorts of neutrality metrics are really doing, actually running RL experiments, etc.)
Furthermore, not only do I think that value-neutrality verification is the most compelling use case for impact measures, I also think that specifically objective impact can be understood as being about value-neutrality. In "The Gears of Impact" Alex argues that "objective impact, instrumental convergence, opportunity cost, the colloquial meaning of 'power'—these all prove to be facets of one phenomenon, one structure." In my opinion, I think value-neutrality should be added to that list. We can think of actions as having objective impact to the extent that they change the distribution over which values have control over which resources—that is, the extent to which they are not value-neutral. Or, phrased another way, actions have objective impact to the extent that they break the strategy-stealing assumption. Thus, even if you disagree with me that value-neutrality verification is the most compelling use case for impact measures, I still think you should believe that if you want to understand objective impact, it's worth trying to understand strategy-stealing and value neutrality, because I think they're all secretly talking about the same thing.
This isn't entirely true, since changing the loss might shift the loss landscape sufficiently such that the easiest-to-find model is now aligned, though I am generally skeptical of that approach, as it seems quite hard to ever know whether it's actually going to work or not. ↩︎
Or, if it does, then if you're doing things right the amplification tree should just compute the impact itself. ↩︎
On the value-laden piece, you might verify some mechanistic corrigibility property, for example. ↩︎
Also suppose that Vy is normalized to have comparable units across objectives. ↩︎
This might seem bad—and it is if you want to try to use this as part of an RL loss—but if what you want to do instead is verify internal properties of a model, then it's exactly what you want. ↩︎
Thanks to Alex Turner for pointing out that the absolute value bars don't belong in neutrality. ↩︎