This was a cool post, I found the core point interesting. Very similar to gradient hacker design.
As a general approach to avoiding value drift, it does have a couple very big issues (which I'm guessing TurnTrout already understands, but which I'll point out for others). First very big issue: it requires the agent basically decouple its cognition from reality when the relevant reward is applied. That's only useful if the value-drift-inducing events only occur once in a while and are very predictable. If value drift just occurs continuously due to everyday interactions, or if it occurs unpredictably, then the strategy probably can't be implemented without making the agent useless.
Second big issue: it only applies to reward-induced value drift within an RL system. That's not the only setting in which value drift is an issue - for instance, MIRI's work on value drift focused mainly on parent-child value drift in chains of successor AIs. Value drift induced by gradual ontology shifts is another example.
As a general approach to avoiding value drift
One interpretation of this phrase is that we want AI to generally avoid value drift -- to get good values in the AI, and then leave it. (This probably isn't what you meant, but I'll leave a comment for other readers!) For AI and for humans, value drift need not be bad. In the human case, going to anger management can be humanely-good value drift. And human-aligned shards of a seed AI can deliberately steer into more situations where the AI gets rewarded while helping people, in order to reinforce the human-aligned coalitional weight.
I am confused by the part, where the Rick-shard can anticipate wich plan the other shards will bit for. If I understood shard-theory correctly, shards do not have their own world model, they can just bid up or down actions, according to the consequences they might have according to the worldmodel that is available to all shards. Please correct me if I am wrong about this point.
So I don’t see how the Rick-Shard could really „trick“ the atheism-shard via rationalisation.
If the Rick-shard sees that „church-going for respect-reasons“ will lead to conversion, then the atheism-shard has to see that too, because they query the same world-model. So the atheism-shard should bid against that plan just as heavily as against „going to church for conversion reasons“.
I think there is something else going on here. I think the Rick-shard does not trick the Atheism-Shard, but the Concious-Part that is not described by shard theory.
I think your comment highlights an important uncertainty of mine. Here's my best guess:
I think planning involves world-model invocations (ie the predictive machinery which predicts relevant observables for plan stubs, like "get in my car"). It seems to me that there is subconscious planning, to some degree. If true, you wouldn't notice the world-model being invoked because it's sub-conscious. Insofar as "you" are in part composed of some set of shards or some algorithm which aggregates shard outputs, it's therefore true that the world-model invocations aren't globally visible. Therefore, it's possible for certain kinds of WM invocations to not be visible to certain shards, even though those shards usually "hook into the WM" (eg check # of diamonds the plan leads to).
Separately, I'd guess that shards can be shaped to invoke the world model (e.g. "if this plan gets considered, will it be executed?") without themselves being agents.
I don't think that shards are distinct - neither physically nor logically, so they can't hide stuff in the sense of keeping it out of view of the other shards.
Also, I don't think "querying for plans" is a good summary of what goes on in the brain.
I'm coming more from a brain-like AGI lens, and my account of what goes on would be a bit different. I'm trying to phrase this in shard theory terminology.
First, a prerequisite: Why do Alice's shards generate thoughts that value Rick's state, to begin with? The Risk-shard has learned that actions that make Rick happy result in states of Alice that are reinforced (Alice being happy/healthy).
Given that, I see the process as follows:
In short: There is no top-down planning but bottom-up action generation. All planning is constructed out of plan fragments that are compatible with all (existing) shards.
Thanks for detailing it. I understand you to describe ~iterative filtering and refinement of a crude proto-plan (church-related thoughts and impulses) which filter down into a more detailed plan, where each piece is selected to be amenable to all relevant shards (without explicit planning).
I think it doesn't sound quite right to me, still, for a few reasons I'll think about more.
I regret that this post doesn't focus on practical advice derived from shard theory. Instead, I mostly focused on a really cool ideal-agency trick ("pretend really hard to wholly fool your own credit assignment"), which is cool but impracticable for real people (joining the menagerie currently inhabited by e.g. logical inductors, value handshakes, and open-source game theory).
I think that shard theory suggests a range of practical ways to improve your own value formation and rationality. For example, suppose I log in and see that my friend John complimented this post. This causes a positive reward event. By default, I might (subconsciously) think "this feels good because John complimented me." Which causes me to be more likely to act to make John (and others) approve of me.
However, that's not how I want to structure my motivation. Instead, in this situation, I can focus on the cognition I want reinforced:
this feels good because John complimented me, which happened because I thought carefully this spring and came up with new ideas, and then communicated them clearly. I'm glad I thought carefully, that was great. I noticed confusion when Quintin claimed (IIRC) that wireheading always makes you more of a wireheader—I stopped to ask whether that was actually true. What do I think I know, and why do I think I know it? Noticing that confusion was also responsible for this moment.
I'm basically repeating and focusing the parts I want to be reinforced. While I don't have a tight first-principles argument that this conscious attention does in fact redirect my credit assignment in the right way, I really think it should, and so I've started this practice on that hunch.
I use the shard theory of human values to clarify what value drift is, how it happens, and how it might be avoided by a highly intelligent agent—even if that agent doesn't have any control over its future experiences. Along the way, I give a shard theory account of rationalization.
Defining "value drift"
Recapitulating part of shard theory. Reward is that which reinforces. Considering the case of reinforcement learning in humans, reward causes your brain’s credit assignment algorithms[1] to reinforce the actions and thoughts which led to that reward, making those actions and thoughts more likely to be selected in the future.
For example, suppose you recognize a lollipop, and move to pick it up, and then lick the lollipop. Since the lollipop produces reward, these thoughts will be reinforced and you will be more likely to act similarly in such situations in the future. You become more of the kind of person who will move to pick up a lollipop when you recognize lollipops, and who will navigate to lollipop-containing locations to begin with.
With that in mind, I think that shard theory offers a straightforward definition of "value drift":
For example, consider the classic "example" of taking a pill which makes you enjoy killing people. Under shard theory, this change would be implemented as a murder-shard that activates in a wide range of contexts in order to steer planning towards murder, and therefore starts steering your decision-making substantially differently.
But it's better to try to explain phenomena which, you know, are known to actually happen in real life. Another simple example of value drift is when someone snorts cocaine. At a (substantial) gloss, the huge hit of reward extremely strongly upweights the decision to do cocaine; the strength of the reward leads to an unusually strong cocaine-shard which activates in an unusually wide range of situations.
Here's a more complicated example of value drift. I'll give one possible mechanistic story for the "value drift" which occurs to an atheist (Alice) dating a religious person (Rick), and why that situation might predictably lead to Alice converting or Rick deconverting. I'll consider a scenario where Alice converts.
First, reinforcement events cause Alice to develop shards of value around making Rick happy and making Rick like her. Alice's new shards (non-introspectively-apparently) query her world model for plans which make Rick happier and which make Rick like her more. Obviously, if Alice converted, they would have more in common, and Rick would be happy. Since these plans lead to Rick being happy and liking Alice more, these shards bid for those plans.
Only, the plan is not bid for directly in an introspectively obvious manner. That would provoke opposition from Alice's other values (which oppose deliberately changing her religious status just to make Rick happy). Alice's self-model predicts this opposition, and so her Rick-happiness- and Rick-approval-shards don't bid for the "direct" conversion plan, because it isn't predicted to work (and therefore won't lead to a future where Rick is happier and approves of Alice more). No, instead, these two shards rationalize internally-observable reasons why Alice should start going to Rick's church: "it's respectful", "church is interesting", "if I notice myself being persuaded I can just leave", "I'll get to spend more time with Rick."[2]
Here, then, is the account:
Her Rick-shards plan to induce value drift, and optimize the plan to make sure that it's hard for her other shards to realize the implicitly-planned outcome (Alice converting) and bid against it. This is one kind of decision-making algorithm which rationalizes against itself.
Under shard theory, rationality is sometimes hard because "conscious-you" has to actually fight deception by other parts of yourself.
One simple trick for avoiding value drift
Imagine you’ve been kidnapped by an evil, mustache-twirling villain who wants to corrupt your value system. They tie you to a chair and prepare to stimulate your reward circuitry. They want to ruin your current values by making you into an addict and a wireheader.
Exercise: How do you come out of the experience with your values intact?
In principle, the answer is simple. You just convince yourself you’re experiencing a situation congruent with your endorsed values, in a sufficiently convincing way that your brain’s credit assignment algorithm reinforces your pretend-actions when the brain stimulation reward occurs!
Consider that the brain does not directly observe the outside world. The outside world’s influence on your thinking is screened off by the state of your brain. The state of the brain constitutes the mental context. If you want to determine the output of a brain circuit, the mental context[3] screens off the state of the world. In particular, this applies to the value updating process by which you become more or less likely to invoke certain bundles of heuristics (“value shards”) in certain mental contexts.
For example, suppose you lick a red lollipop, but that produces a large negative reward (maybe it was treated with awful-tasting chemicals). Mental context: “It’s Tuesday. I am in a room with a red lollipop. It looks good. I’m going to lick it. I think it will be good.” The negative reward reshapes your cognition, making you less likely to think similar thoughts and take similar actions in similar future situations.
Of the thoughts which were thunk before the negative reward, the credit assignment algorithm somehow identifies the relevant thoughts to include “It looks good”, “I’m going to lick it”, “I think it will be good”, and the various motor commands. You become less likely to think these thoughts in the future. In summary, the reason you become less likely to think these thoughts is that you thought them while executing the plan which produced negative reward, and credit assignment identified them as relevant to that result.
Credit assignment cannot and will not penalize thoughts[4] which do not get thunk at all, or which it deems “not relevant” to the result at hand. Therefore, in principle, you could just pretend really hard that you’re in a mental context where you save a puppy’s life. When the electrically stimulated reward hits, the altruism-circuits get reinforced in the imagined mental context. You become more altruistic overall.
Of course, you have to actually dupe the credit assignment algorithm into ignoring the latent “true” mental context. But your credit assignment is not infinitely clever. And if it were, well, you could (in principle) add an edge-case for situations like this. So there is, in principle, a way to do it.
Therefore, your values can always be safe in your own mind, if you’re clever, foresightful, and have enough write access to fool credit assignment. Even if you don’t have control over your own future observations.
If this point still does not seem obvious, consider a scenario where you are blindfolded, and made to believe that you are about to taste a lollipop. Then, your captors fake the texture and smell and feel of a lollipop in your mouth, while directly stimulating your taste buds in the same way the lollipop would have. They remove the apparatus, and you go home. Do you think you have become reshaped to value electrical stimulation of your tongue? No. That is impossible, since your brain has no idea about what actually happened. Credit assignment responds to reward depending on the mental context, not on the external situation.
Misunderstanding this point can lead to confusion. If you have a wire stuck in your brain’s reward center, surely that reward reinforces having a wire stuck in your brain! Usually so, but not logically so. Your brain can only reward based on its cognitive context, based on the thoughts it actually thought which it identifies as relevant to the achievement of the reward. Your brain is not directly peering out at reality and making you more likely to enter that state in the future.
Conclusion
Value drift occurs when your values shift. In shard theory, this means that your internal decision-making influences (i.e. shards) are rebalanced by reinforcement events. For example, if you try cocaine, that causes your brain's credit assignment to strongly upweight decision-making which uses cocaine and which pursues rewarding activities.
Value drift is caused by credit assignment. Credit assignment can only depend on its observable mental context, and can't directly peer out at the world to objectively figure out what caused the reward event. Therefore, you can (in theory) avoid value drift by tricking credit assignment into thinking that the reward was caused by a decision to e.g. save a puppy's life. In that case, credit assignment would reinforce your altruism-shard. While humans probably can't dupe their own credit assignment algorithm to this extent, AI can probably include edge cases to their own updating process. But knowing value drift works—on this theory, via "unendorsed" reinforcement events—seems practically helpful for avoiding/navigating value-risky situations (like gaining lots of power or money).
Thanks to Justis Mills for proofreading.
These credit assignment algorithms may be hardcoded and/or learned.
I feel confused about how, mechanistically, other shards wouldn't fully notice the proto-deceptive plan being evaluated by the self-model, but presently think this "partial obfuscation" happens in shard dynamics for human beings. I think the other shards do somewhat observe the proto-deception, and this is why good rationalists can learn to rationalize less.
In The shard theory of human values, we defined the "mental context" of a circuit to be the inputs to that circuit which determine whether it fires or not. Here, I use "mental context" to also refer to the state of the entire brain, without considering a specific circuit. I think both meanings are appropriate and expect the meaning will be clear from the context.
"Credit assignment penalizes thoughts" seems like a reasonable frame to me, but I'm flagging that this could misrepresent the mechanistic story of human cognition in some unknown-to-me way.