Martín: Any write-up on the thing that "natural abstractions depend on your goals"? As in, pressure is a useful abstraction because we care about / were trained on a certain kind of macroscopic patterns (because we ourselves are such macroscopic patterns), but if you cared about "this exact particle's position", it wouldn't be.[1]
John: Nope, no writeup on that.
And in the case of pressure, it would still be natural even if you care about the exact position of one particular particle at a later time, and are trying to predict that from data on the same gas at an earlier time. The usual high level variables (e.g. pressure, temperature, volume) are summary stats (to very good approximation) between earlier and later states (not too close together in time), and the position of one particular particle is a component of that state, so (pressure, temperature, volume) are still summary stats for that problem.
The main loophole there is that if e.g. you're interested in the 10th-most-significant bit of the position of a particular particle, then you just can't predict it any better than the prior, so the empty set is a summary stat and you don't care about any abstractions at all.
Martín: Wait, I don't buy that:
The gas could be in many possible microstates. Pressure partitions them into macrostates in a certain particular way. That is, every possible numerical value for pressure is a different macrostate, that could be instantiated by many different microstates (where the particle in question is in very different positions). Say instead of caring about this partition, you care about the macrostate partition which tracks where that one particle is. It seems like these two partitions are orthogonal, meaning that conditioning on a pressure level gives you no information about where the particle is (because the system is symmetric with respect to all particles, or something like that). [This could be false due to small effects like "higher pressure makes it less likely all particles are near the center" or whatever, but I don't think that's what we're talking about. Ignore them for now, or assume I care about a partition which is truly orthogonal to pressure level.] So tracking the pressure level partition won't help you.
It's still true that "the position of one particular particle is a component of the (micro)state", but we're discussing which macrostates to track, and pressure is only a summary stat for some macrostates (variables), but not others.
John: Roughly speaking, you don't get to pick which macrostates to track. There are things you're able to observe, and those observations determine what you're able to distinguish.
You do have degrees of freedom in what additional information to throw away from your observations, but for something like pressure, the observations (and specifically the fact that observations are not infinite-precision) already pick out (P, V, T) as the summary stats; the only remaining degree of freedom is to throw away even more information than that.
Applied to the one-particle example in particular: because of chaos, you can't predict where the one particle will be (any better than P, V, T would) at a significantly later time without extremely-high-precision observations of the particle states at an earlier time.
Martín: Okay, so I have a limited number of macrostate partitions I can track (because of how my sensory receptors are arranged), call this set S, and my only choice is which information from that to throw out (due to computational constraints), and still be left with approximately good models of the environment.
(P, V, T) is considered a natural abstraction in this situation because it contains almost all information from S (or almost all information relevant to a certain partition I care about, which I can track by using all of S). That's the definition of natural abstraction (being a summary stat). And so, natural abstractions are observer-dependent: a different observer with different sensory receptors would have a different S, and so possibly different summary stats. (And possibly also goal-dependent, if instead of defining them as "summary stats for this whole S", you define them as "summary stats for this concrete variable (that can be tracked using S)".)
And, if as an agent with only access to S, I want to track a partition which is not deducible from S, then I'm just screwed. [And we can even make the argument that, due to how evolved agents like humans come about, it is to be expected that our goals are controlling partitions that we can deduce from S. Otherwise our mechanisms pursuing these goals would serve no purpose.]
Any disagreements?
John: Yup, that all sounds right.
Martín: So, to the extent your main plan is "use our understanding of Natural Abstractions to look inside an AI and retarget the search", why aren't you worried about natural abstractions being goal-dependent?
Natural abstractions being sensory-perceptors-dependent might not be that much of a worry, because if necessary we can purposefully train the AI on the kinds of macroscopic tasks we want. But if you buy the high internal decoupling (inner misalignment) story, maybe by the time you look into the trained AI it has already developed a goal about molecular squiggles, and correspondingly some of its natural abstractions have reshaped to better track those macrostates (and you don't understand them). Probably the plan is closer to "we check natural abstractions continuously during training, and so we'll be able to notice if something like this starts happening". Even then, a high enough goal-dependence might make the plan unworkable (because you would need a trillion checks during training). But intuitively this doesn't seem likely (opinions?)
[Also note this is in the vicinity of an argument for alignment-by-default: maybe since its natural abstractions have been shaped by our Earthly macroscopic tasks, it would be hard for it to develop a very alien goal. This could hold either because internal concepts track our kind of macrostates and this makes it less likely for an alien goal to crystallize, or because alien goals do crystallize but then the agent doesn't perform well (since it doesn't have natural abstractions to correctly pursue those) and so the alien goal is dissolved in the next training steps. I expect you to believe something like "goal-dependence is not so low that you'd get alignment by default, but it's low enough that if you know what you're looking for inside the AI you don't need a trillion checks".]
John: Good question.
Short answer: some goals incentivize general intelligence, which incentivizes tracking lots of abstractions and also includes the ability to pick up and use basically-any natural abstractions in the environment at run-time.
Longer answer: one qualitative idea from the Gooder Regulator Theorem is that, for some goals in some environments, the agent won't find out until later what its proximate goals are. As a somewhat-toy example: imagine playing a board game or video game in which you don't find out the win conditions until relatively late into the game. There's still a lot of useful stuff to do earlier on - instrumental convergence means that e.g. accumulating resources and gathering information and building general-purpose tools are all likely to be useful for whatever the win condition turns out to be.
That's the sort of goal (and environment) which incentivizes general intelligence.
In terms of cognitive architecture, that sort of goal incentivizes learning lots of natural abstractions (since they're likely to be useful for many different goals), and being able to pick up new abstractions on the fly as one gains new information about one's goals.
I claim that humans have that sort of "general intelligence". One implication is that, while there are many natural abstractions which we don't currently track (because the world is big, and I can't track every single object in it), there basically aren't any natural abstractions which we can't pick up on the fly if we need to. Even if an AI develops a goal involving molecular squiggles, I can still probably understand that abstraction just fine once I pay attention to it.I also claim the kind of AI we're interested in, the kind which is dangerous and highly useful, will also have that sort of "general intelligence". Implication is that, while there will be many natural abstractions which such an AI isn't tracking at any given time, there basically aren't any natural abstractions which it can't pick up on the fly if it needs to. Furthermore, it will probably already track the "most broadly relevant" abstractions in the environment, which likely includes most general properties of humans (as opposed to properties of specific humans, which would be more locally-specific).
Martín: Okay, so maybe the formal version is the following:
For some kind of agents and some kind of environments, while it remains strictly true that they only have direct observations of the macrostate partitions in S (my eyes cannot see into the infrared), they can (and are incentivized to) use these in clever ways (like building a thermal camera) to come to correct hypotheses about the other partitions (assuming a certain simple conceptual structure governs and binds all partitions, like the laws of physics). This will de facto mean that the agent develops internal concepts (natural abstractions, summary stats) that efficiently track these additional macrostates S’, almost as natively as if they could directly observe them.
Two agents who have gotten past the point of doing that will continue building more general cognitive tools to expand S’ further, because having more macrostate partitions is sometimes more useful: maybe whether a human survives is determined by the position of an electron, and so I start caring about controlling that macrostate as well (even if it wasn’t initially observable by me). Thus the particularity of their abstractions (first set by their perceptors and goals) will dissolve into more general all-purpose mechanisms, and they’ll end up with pretty similar S’. There might still be some path-dependence here, but we quantitatively expect it to be low. In particular, you expect the abstractions we track and study (inside the human S’) to be enough to understand those within an AI (or at least, to be enough of a starting point to build more abstractions and end up achieving that). [That expectation sounds reasonable, although I guess eventually I would like more quantitative arguments to that effect.]
John: Yep!
Some afterthoughts
I think this is a philosophically important point that most people don't have in mind when thinking about natural abstractions (nor I had seen addressed in John's work): we have some vague intuition that an abstraction like pressure will always be useful, because of some fundamental statistical property of reality (non-dependent on the macrostates we are trying to track), and that's not quite true.
As discussed, it's unclear whether this philosophical point poses a pragmatic problem, or any hindrance to the realistic implementation of John's agenda. My intuition is no, but this is a subtle question.[2]
This discussion is very similar to the question of whether agency is observer-dependent. There again I think the correct answer is the intentional stance: an agent is whatever is useful for me to model as intention-driven. And so, since different observers will find different concepts useful (due to their goals or sensors or computational capabilities), different observers will define agency differently. And as above, this true philosophical point doesn't prohibit the possibility that, in practice, observers who have been shaped enough by learning (and have learned to shape themselves further and improve their abstractions) might agree on which concepts and computations (including agency-related heuristics) are most useful to deal with physical systems under different constraints, because this is a property of physics/math.
I'm not sure I independently discovered this consideration, maybe a similar one was floating around some Agent Foundations workshop, possibly voiced by Sam Eisenstat. But the way it came up more recently was in my thinking about why generalization works in our universe (post coming soon).
A way to get rid of the observer(and goal)-dependence is by integrating over all observers (and goals) or a subclass of them, thus the new definition of natural abstraction could be "this property of reality is on average a good summary stat for most observers (and goals)" (for some definition of "on average", which would probably be tricky to decide on). But if different observers have quite different natural abstractions, this won't be much useful. So the important question is, quantitatively, how much convergence there is.
The conversation
Martín:
Any write-up on the thing that "natural abstractions depend on your goals"? As in, pressure is a useful abstraction because we care about / were trained on a certain kind of macroscopic patterns (because we ourselves are such macroscopic patterns), but if you cared about "this exact particle's position", it wouldn't be.[1]
John:
Nope, no writeup on that.
And in the case of pressure, it would still be natural even if you care about the exact position of one particular particle at a later time, and are trying to predict that from data on the same gas at an earlier time. The usual high level variables (e.g. pressure, temperature, volume) are summary stats (to very good approximation) between earlier and later states (not too close together in time), and the position of one particular particle is a component of that state, so (pressure, temperature, volume) are still summary stats for that problem.
The main loophole there is that if e.g. you're interested in the 10th-most-significant bit of the position of a particular particle, then you just can't predict it any better than the prior, so the empty set is a summary stat and you don't care about any abstractions at all.
Martín:
Wait, I don't buy that:
The gas could be in many possible microstates. Pressure partitions them into macrostates in a certain particular way. That is, every possible numerical value for pressure is a different macrostate, that could be instantiated by many different microstates (where the particle in question is in very different positions).
Say instead of caring about this partition, you care about the macrostate partition which tracks where that one particle is.
It seems like these two partitions are orthogonal, meaning that conditioning on a pressure level gives you no information about where the particle is (because the system is symmetric with respect to all particles, or something like that).
[This could be false due to small effects like "higher pressure makes it less likely all particles are near the center" or whatever, but I don't think that's what we're talking about. Ignore them for now, or assume I care about a partition which is truly orthogonal to pressure level.]
So tracking the pressure level partition won't help you.
It's still true that "the position of one particular particle is a component of the (micro)state", but we're discussing which macrostates to track, and pressure is only a summary stat for some macrostates (variables), but not others.
John:
Roughly speaking, you don't get to pick which macrostates to track. There are things you're able to observe, and those observations determine what you're able to distinguish.
You do have degrees of freedom in what additional information to throw away from your observations, but for something like pressure, the observations (and specifically the fact that observations are not infinite-precision) already pick out (P, V, T) as the summary stats; the only remaining degree of freedom is to throw away even more information than that.
Applied to the one-particle example in particular: because of chaos, you can't predict where the one particle will be (any better than P, V, T would) at a significantly later time without extremely-high-precision observations of the particle states at an earlier time.
Martín:
Okay, so I have a limited number of macrostate partitions I can track (because of how my sensory receptors are arranged), call this set S, and my only choice is which information from that to throw out (due to computational constraints), and still be left with approximately good models of the environment.
(P, V, T) is considered a natural abstraction in this situation because it contains almost all information from S (or almost all information relevant to a certain partition I care about, which I can track by using all of S). That's the definition of natural abstraction (being a summary stat). And so, natural abstractions are observer-dependent: a different observer with different sensory receptors would have a different S, and so possibly different summary stats. (And possibly also goal-dependent, if instead of defining them as "summary stats for this whole S", you define them as "summary stats for this concrete variable (that can be tracked using S)".)
And, if as an agent with only access to S, I want to track a partition which is not deducible from S, then I'm just screwed.
[And we can even make the argument that, due to how evolved agents like humans come about, it is to be expected that our goals are controlling partitions that we can deduce from S. Otherwise our mechanisms pursuing these goals would serve no purpose.]
Any disagreements?
John:
Yup, that all sounds right.
Martín:
So, to the extent your main plan is "use our understanding of Natural Abstractions to look inside an AI and retarget the search", why aren't you worried about natural abstractions being goal-dependent?
Natural abstractions being sensory-perceptors-dependent might not be that much of a worry, because if necessary we can purposefully train the AI on the kinds of macroscopic tasks we want.
But if you buy the high internal decoupling (inner misalignment) story, maybe by the time you look into the trained AI it has already developed a goal about molecular squiggles, and correspondingly some of its natural abstractions have reshaped to better track those macrostates (and you don't understand them).
Probably the plan is closer to "we check natural abstractions continuously during training, and so we'll be able to notice if something like this starts happening". Even then, a high enough goal-dependence might make the plan unworkable (because you would need a trillion checks during training). But intuitively this doesn't seem likely (opinions?)
[Also note this is in the vicinity of an argument for alignment-by-default: maybe since its natural abstractions have been shaped by our Earthly macroscopic tasks, it would be hard for it to develop a very alien goal. This could hold either because internal concepts track our kind of macrostates and this makes it less likely for an alien goal to crystallize, or because alien goals do crystallize but then the agent doesn't perform well (since it doesn't have natural abstractions to correctly pursue those) and so the alien goal is dissolved in the next training steps. I expect you to believe something like "goal-dependence is not so low that you'd get alignment by default, but it's low enough that if you know what you're looking for inside the AI you don't need a trillion checks".]
John:
Good question.
Short answer: some goals incentivize general intelligence, which incentivizes tracking lots of abstractions and also includes the ability to pick up and use basically-any natural abstractions in the environment at run-time.
Longer answer: one qualitative idea from the Gooder Regulator Theorem is that, for some goals in some environments, the agent won't find out until later what its proximate goals are. As a somewhat-toy example: imagine playing a board game or video game in which you don't find out the win conditions until relatively late into the game. There's still a lot of useful stuff to do earlier on - instrumental convergence means that e.g. accumulating resources and gathering information and building general-purpose tools are all likely to be useful for whatever the win condition turns out to be.
That's the sort of goal (and environment) which incentivizes general intelligence.
In terms of cognitive architecture, that sort of goal incentivizes learning lots of natural abstractions (since they're likely to be useful for many different goals), and being able to pick up new abstractions on the fly as one gains new information about one's goals.
I claim that humans have that sort of "general intelligence". One implication is that, while there are many natural abstractions which we don't currently track (because the world is big, and I can't track every single object in it), there basically aren't any natural abstractions which we can't pick up on the fly if we need to. Even if an AI develops a goal involving molecular squiggles, I can still probably understand that abstraction just fine once I pay attention to it.I also claim the kind of AI we're interested in, the kind which is dangerous and highly useful, will also have that sort of "general intelligence". Implication is that, while there will be many natural abstractions which such an AI isn't tracking at any given time, there basically aren't any natural abstractions which it can't pick up on the fly if it needs to. Furthermore, it will probably already track the "most broadly relevant" abstractions in the environment, which likely includes most general properties of humans (as opposed to properties of specific humans, which would be more locally-specific).
Martín:
Okay, so maybe the formal version is the following:
For some kind of agents and some kind of environments, while it remains strictly true that they only have direct observations of the macrostate partitions in S (my eyes cannot see into the infrared), they can (and are incentivized to) use these in clever ways (like building a thermal camera) to come to correct hypotheses about the other partitions (assuming a certain simple conceptual structure governs and binds all partitions, like the laws of physics). This will de facto mean that the agent develops internal concepts (natural abstractions, summary stats) that efficiently track these additional macrostates S’, almost as natively as if they could directly observe them.
Two agents who have gotten past the point of doing that will continue building more general cognitive tools to expand S’ further, because having more macrostate partitions is sometimes more useful: maybe whether a human survives is determined by the position of an electron, and so I start caring about controlling that macrostate as well (even if it wasn’t initially observable by me). Thus the particularity of their abstractions (first set by their perceptors and goals) will dissolve into more general all-purpose mechanisms, and they’ll end up with pretty similar S’. There might still be some path-dependence here, but we quantitatively expect it to be low. In particular, you expect the abstractions we track and study (inside the human S’) to be enough to understand those within an AI (or at least, to be enough of a starting point to build more abstractions and end up achieving that).
[That expectation sounds reasonable, although I guess eventually I would like more quantitative arguments to that effect.]
John:
Yep!
Some afterthoughts
I think this is a philosophically important point that most people don't have in mind when thinking about natural abstractions (nor I had seen addressed in John's work): we have some vague intuition that an abstraction like pressure will always be useful, because of some fundamental statistical property of reality (non-dependent on the macrostates we are trying to track), and that's not quite true.
As discussed, it's unclear whether this philosophical point poses a pragmatic problem, or any hindrance to the realistic implementation of John's agenda. My intuition is no, but this is a subtle question.[2]
This discussion is very similar to the question of whether agency is observer-dependent. There again I think the correct answer is the intentional stance: an agent is whatever is useful for me to model as intention-driven. And so, since different observers will find different concepts useful (due to their goals or sensors or computational capabilities), different observers will define agency differently.
And as above, this true philosophical point doesn't prohibit the possibility that, in practice, observers who have been shaped enough by learning (and have learned to shape themselves further and improve their abstractions) might agree on which concepts and computations (including agency-related heuristics) are most useful to deal with physical systems under different constraints, because this is a property of physics/math.
Related post: Why does generalization work?
I'm not sure I independently discovered this consideration, maybe a similar one was floating around some Agent Foundations workshop, possibly voiced by Sam Eisenstat. But the way it came up more recently was in my thinking about why generalization works in our universe (post coming soon).
A way to get rid of the observer(and goal)-dependence is by integrating over all observers (and goals) or a subclass of them, thus the new definition of natural abstraction could be "this property of reality is on average a good summary stat for most observers (and goals)" (for some definition of "on average", which would probably be tricky to decide on). But if different observers have quite different natural abstractions, this won't be much useful. So the important question is, quantitatively, how much convergence there is.