sunwillrise - AI Alignment Forum

Richard, I still don't get it, and I think my objections in the comments of the initial post (1, 2), alongside those of rif a. sauros, remain correct. More specifically, there seems to be a very misleading equivocation going on regarding what "simpler" means. I think it's crucial to emphasize that is a 2-place word, but your argument (at least when written in non-rigorous, non-mathematical terms) treats it as if it was a 1-place word, and this is what is causing the confusions.

Consider an agent that gets a "boost" from an ontology $O_{1}$ with the fuzzy-boundary representation of possible belief/goal pairs $(B_{1}, G_{1})$ to an ontology $O_{2}$ with a new set of (still probably fuzzy-boundary) pairs $(B_{2}, O_{2})$ , such that $O_{2}$ corresponds to more "intelligence", meaning it compresses map representations of the underlying territory, in accordance with Prediction = Compression.

The first section of this post argues that, despite the simplicity-speed tradeoff and other related problems, this change will nonetheless likely compress the beliefs, meaning that any belief $B \in B_{1}$ will be mapped to a belief $f (B) \in B_{2}$ that requires fewer bits for the agent to identify, which we can (roughly) think of as having a smaller (ontology-specific analogue of) K-complexity: $K_{ϕ} (f (B), O_{2}) < K_{ϕ} (B, O_{1})$ . I think this is correct.

The second section argues that, because there is no clear belief/goal boundary and because the returns to compression remain as relevant for goals as they are for beliefs, the same will happen to the goals. This means that any goal $G \in G_{1}$ will likely be mapped to a goal $f (G) \in G_{2}$ that requires fewer bits for the agent to identify, which we can (roughly) think of as having a smaller (ontology-specific analogue of) K-complexity: $K_{ϕ} (f (G), O_{2}) < K_{ϕ} (G, O_{1})$ . I think this is also correct.

Finally, the third section argues that this monotonically decreasing process will likely not get stuck in local optima and should instead converge to as small a representation size as possible. I'm not fully convinced of this, but I will accept it for now.

Alright, so we've established that $K_{ϕ} (f (G), B_{2})$ will get really small, and this means that the goal is really compressed and simple. That is like a squiggle-maximizer (as you wrote, AIs that attempt to fill the universe with some very low-level pattern that's meaningless to humans, e.g., "molecular squiggles" of a certain shape), right?

No. This is where the equivocation comes in. The simplicity of a goal is inherently dependent on the ontology you use to view it through: while $K_{ϕ} (f (G), O_{2}) < K_{ϕ} (G, O_{1})$ is (likely) true, pay attention to how this changes the ontology! The goal of the agent is indeed very simple, but not because the "essence" of the goal simplifies; instead, it's merely because it gets access to a more powerful ontology that has more detail, granularity, and degrees of freedom. If you try to view $f (G)$ in $O_{1}$ instead of $O_{2}$ , meaning you look at the preimage $f^{- 1} [f (G)]$ , this should approximately be the same as $G$ : your argument establishes no reason for us to think that there is any force pulling the goal itself, as opposed to its representation, to be made smaller. As I wrote earlier:

The "representations," in the relevant sense that makes Premise 1 worth taking seriously, are object-level, positive rather than normative internal representations of the underlying territory. But the "goal" lies in another, separate magisterium. Yes, it refers to reality, so when the map approximating reality changes, so does its description. But the core of the goal does not, for it is normative rather than positive; it simply gets reinterpreted, as faithfully as possible, in the new ontology. [...] That the goal is independent (i.e., orthogonal, implying uncorrelated) of the factual beliefs about reality.
Put differently, the mapping from the initial ontology to the final, more "compressed" ontology does not shrink the representation of the goal before or after mapping it; it simply maps it. If it all (approximately) adds up to normality, meaning that the new ontology is capable of replicating (perhaps with more detail, granularity, or degrees of freedom) the observations of the old one ^[4], I expect the "relative measure" of the goal representation to stay approximately ^[5] the same. And more importantly, I expect the "inverse transformation" from the new ontology to the old one to map the new representation back to the old one (since the new representation is supposed to be more compressed, i.e. informationally richer than the old one, in mathematical terms I would expect the preimage of the new representation to be approximately the old one).
[4] Such as how the small-mass, low-velocity limit of General Relativity replicates standard Newtonian mechanics.
[5] I say "approximately" because of potential issues due to stuff analogous to Wentworth's "Pointers problem" and the way in which some (presumably small) parts of the goal in its old representation might be entirely incoherent and impossible to rescue in the new one.

Imagine the following scenario for illustrative purposes: a (dumb) AI has in front of it the integers from 1 to 10, and its goal is to select a single number among them that is either 2, 4, 6, 8, or 10. Now the AI gets the "ontology boost" and its understanding of its goal gets more compressed and simpler: it needs to select one of the even numbers. Is this a simpler goal?

Well, from one perspective, yes: the boosted AI has in its world-model a representation of the goal that requires fewer bits. But from the more important perspective, no: the goal hasn't changed, and if you map "evenness" back into the more primitive ontology of the unboosted AI, you get the same goal. So, from the perspective of the unboosted AI, the goal of the boosted one is not any simpler; it's just smart enough to represent the goal with fewer bits.

So goals that seem simple to humans (in our faulty ontology) or goals that seem like they would be relatively simpler compared to the rest in a more advanced ontology (like the squiggle-maximizer) are of a completely different kind of "simple" than what your argument shows: the AI doesn't look through the set of goals to pick the one that is simplest (beware the Orthogonality Thesis, as in our previous exchange), it just simplifies ~ everything. That kind of goal simplification says more about the ontology than it does about the goal.

You also said earlier, in response to my comment:

And so, given this, when I postulate a pressure to simplify representations my default assumption is that this will apply to both types of representations—as it seems to in my own brain, which often tries very hard to simplify my moral goals in a roughly analogous way to how it tries to simplify my beliefs.

This still equivocates in the same way between the different meanings of "simple", but let's set that aside for now. I would be curious what your response would be to what I and rif a. sauros said in response:

sunwillrise: The thing about this is that you don't seem to be currently undergoing the type of ontological crisis or massive shift in capabilities that would be analogous to an AI getting meaningfully more intelligent due to algorithmic improvements or increased compute or data (if you actually are, godspeed!)
So would you argue that this type of goal simplification and compression happens organically and continuously even in the absence of such a "phase transition"? I have a non-rigorous feeling that this argument would prove too much by implying more short-term modification of human desires than we actually observe in real life.
Relatedly, would you say that your moral goals are simpler now than they were, say, back when you were a child? I am pretty sure that the answer, at least for me, is "definitely not," and that basically every single time I have grown "wiser" and had my belief system meaningfully altered, I came out of that process with a deeper appreciation for the complexity of life and for the intricacies and details of what I care about.
rif a. sauros: As we examine successively more intelligent agents and their representations, the representation of any particular thing will perhaps be more compressed, but also and importantly, more intelligent agents represent things that less intelligent agents don't represent at all. I'm more intelligent than a mouse, but I wouldn't say I have a more compressed representation of differential calculus than a mouse does. Terry Tao is likely more intelligent than I am, likely has a more compressed representation of differential calculus than I do, but he also has representations of a bunch of other mathematics I can't represent at all, so the overall complexity of his representations in total is plausibly higher.

Why wouldn't the same thing happen for goals? I'm perfectly willing to say I'm smarter than a dog and a dog is smarter than a paramecium, but it sure seems like the dog's goals are more complex than the paramecium's, and mine are more complex than the dog's. Any given fixed goal might have a more compressed representation in the more intelligent animal (I'm not sure it does, but that's the premise so let's accept it), but the set of things being represented is also increasing in complexity across organisms. Driving the point home, Terry Tao seems to have goals of proving theorems I don't even understand the statement of, and these seem like complex goals to me.

A more systematic case for inner misalignment

sunwillrise5mo20

One way of framing our disagreement: I'm not convinced that the f operation makes sense as you've defined it. That is, I don't think it can both be invertible and map to a goal with low complexity in the new ontology.

To clarify, I don't think is invertible, and that is why I talked about the preimage and not the inverse. I find it very plausible that $f$ is not injective, i.e. that in a more compact ontology coming from a more intelligent agent, ideas/configurations etc that were different in the old ontology get mapped to the same thing in the new ontology (because the more intelligent agent realizes that they are somehow the same on a deeper level). I also believe f would not be surjective, as I wrote in response to rif a. sauros:

I'd suspect one possible counterargument is that, just like how more intelligent agents with more compressed models can more compactly represent complex goals, they are also capable of drawing ever-finer distinctions that allow them to identify possible goals that have very short encodings in the new ontology, but which don't make sense at all as stand-alone, mostly-coherent targets in the old ontology (because it is simply too weak to represent them). So it's not just that goals get compressed, but also that new possible kinds of goals (many of them really simple) get added to the game.
But this process should also allow new goals to arise that have ~ any arbitrary encoding length in the new ontology, because it should be just as easy to draw new, subtle distinctions inside a complex goal (which outputs a new medium- or large-complexity goal) as it would be inside a really simple goal (which outputs the type of new super-small-complexity goal that the previous paragraph talks about). So I don't think this counterargument ultimately works, and I suspect it shouldn't change our expectations in any meaningful way.

Nonetheless, I still expect $f^{- 1} [f (G)]$ (viewed as the preimage of $f (G)$ under the $f$ mapping) and $G$ to only differ very slightly.

sunwillrise5mo54

Consent across power differentials

sunwillrise5mo10

Relevant post by Richard Ngo: "Moral Strategies at different capability levels". Crucial excerpt:

Let’s consider three ways you can be altruistic towards another agent:
You care about their welfare: some metric of how good their life is (as defined by you). I’ll call this care-morality - it endorses things like promoting their happiness, reducing their suffering, and hedonic utilitarian behavior (if you care about many agents).
You care about their agency: their ability to achieve their goals (as defined by them). I’ll call this cooperation-morality - it endorses things like honesty, fairness, deontological behavior towards others, and some virtues (like honor).
You care about obedience to them. I’ll call this deference-morality - it endorses things like loyalty, humility, and respect for authority.
[...]
Care-morality mainly makes sense as an attitude towards agents who are much less capable than you, and/or can't make decisions for themselves - for example animals, future people, and infants.
[...]
Cooperation-morality mainly makes sense as an attitude towards agents whose capabilities are comparable to yours - for example others around us who are trying to influence the world.
[...]
Deference-morality mainly makes sense as an attitude towards trustworthy agents who are much more capable than you - for example effective leaders, organizations, communities, and sometimes society as a whole.

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments