I definitely agree that the player vs character distinction is meaningful, although I would define it a bit differently.
I would identify it with cortical vs subcortical, a.k.a. neocortex vs everything else. (...with the usual footnotes, e.g. the hippocampus counts as "cortical" :-D)
(ETA: See my later post Inner alignment on the brain for a better discussion of some of the below.)
The cortical system basically solves the following problem:
Here is (1) a bunch of sensory & other input data, in the form of spatiotemporal patterns of spikes on input neurons, (2) occasional labels about what's going on right now (e.g. "something good / bad / important is happening"), (3) a bunch of outgoing neurons. Your task is to build a predictive model of the inputs, and use that to choose signals to send into the outgoing neurons, to make more good things happen.
The result is our understanding of the world, our consciousness, imagination, memory, etc. Anything we do that requires understanding the world is done by the cortical system. This is your "character".
The subcortical system is responsible for everything else your brain does to survive, one of which is providing the "labels" mentioned above (that something good / bad / important / whatever is happening right now).
For example, take the fear-of-spiders instinct. If there is a black scuttling blob in your visual field, there's a subcortical vision system (in the superior colliculus) that pattern-matches that moving blob to a genetically-coded template, and thus activates a "Scary!!" flag. The cortical system sees the flag, sees the spider, and thus learns that spiders are scary, and it can plan intelligent actions to avoid spiders in the future.
I have a lot of thoughts on how to describe these two systems at a computational level, including what the neocortex is doing, and especially how the cortical and subcortical systems exchange information. I am hoping to write lots more posts with more details about the latter, especially about emotions.
even the reward and optimization mechanisms themselves may end up getting at least partially rewritten.
Well, there is such a thing as subcortical learning, particularly for things like fine-tuning motor control programs in the midbrain and cerebellum, but I think most or all of the "interesting" learning happens in the cortical system, not subcortical.
In particular, I'm not really expecting the core emotion-control algorithms to be editable by learning or thinking (if we draw an appropriately tight boundary around them).
More specifically: somewhere in the brain is an algorithm that takes a bunch of inputs and calculates "How guilty / angry / happy / smug / etc. should I feel right now?" The inputs to this algorithm come from various places, including from the body (e.g. pain, hunger, hormone levels), and from the cortex (what emotions am I expecting or imagining or remembering?), and from other emotion circuits (e.g. some emotions inhibit or reinforce each other). The inputs to the emotion calculation can certainly change, but I don't expect that the emotion calculation itself changes over time.
It feels like emotion-control calculations can change, because the cortex can be a really dominant input to those calculations, and the cortex really can change, including by conscious effort. Why is the cortex such a dominant input? Think about it: the emotion-calculation circuits don't know whether I'm likely to eat tomorrow, or whether I'm in debt, or whether Alice stole my cookie, or whether I just got promoted. That information is all in the cortex! The emotion circuits get only tiny glimpses of what's going on in the world, particularly through the cortex predicting & imagining emotions, including in empathetic simulation of others' emotions. If the cortex is predicting fear, well, the amygdala obliges by creating actual fear, and then the cortex sees that and concludes that its prediction was right all along! There's very little "ground truth" that the emotion circuits have to go on. Thus, there's a wide space of self-reinforcing habits of thought. It's a terrible system! Totally under-determined. Thus we get self-destructive habits of thought that linger on for decades.
Anyway, I have this long-term vision of writing down the exact algorithm that each of the emotion-control circuits is implementing. I think AGI programmers might find those algorithms helpful, and so might people trying to pin down "human values". I have a long way to go in that quest :-D
there's also a sense in which the player doesn't have anything that we could call values ...
I basically agree; I would describe it by saying that the subcortical systems are kinda dumb. Sure, the superior colliculus can recognize scuttling spiders, and the emotion circuits can "dislike" pain. But any sophisticated concept like "flourishing", "fairness", "virtue", etc. can only be represented in the form of something like "Neocortex World Model Entity ID #30962758", and these things cannot have any built-in relationship to subcortical circuits.
So the player's "values" are going to (1) simple things like "less pain is good", and (2) things that don't have an obvious relation to the outside world, like complicated "preferences" over the emotions inside our empathetic simulations of other people.
If a "purely character-level" model of human values is wrong, how do we incorporate the player level?
Is it really "wrong"? It's a normative assumption ... we get to decide what values we want, right? As "I" am a character, I don't particularly care what the player wants :-P
But either way, I'm all for trying to get a better understanding of how I (the character / cortical system) am "built" by the player / subcortical system. :-)
Great comment, thanks!
Is it really "wrong"? It's a normative assumption ... we get to decide what values we want, right? As "I" am a character, I don't particularly care what the player wants :-P
Well, to make up a silly example, let's suppose that you have a conscious belief that you want there to be as much cheesecake as possible. This is because you are feeling generally unsafe, and a part of your brain has associated cheesecakes with a feeling of safety, so it has formed the unconscious prediction that if only there was enough cheesecake, then you would finally feel good and safe.
So you program the AI to extract your character-level values, it correctly notices that you want to have lots of cheesecake, and goes on to fill the world with cheesecake... only for you to realize that now that you have your world full of cheesecake, you still don't feel as happy as you were on some level expecting to feel, and all of your elaborate rational theories of how cheesecake is the optimal use of atoms start feeling somehow hollow.
There is a missmatch in saying cortex=charcter and subcortex=player.
If I understand the player-character model right, then uncosuios coping strategies would be player level tactic. But these are learned behaviours, and would therfore be part of cortex.
In Kaj's example, the idea that cheescake will make the bad go away exist in the cortex's world model.
According to Steven's model of how the brain works (which I think is probably ture), the subcortex is part of the game the player is playing. Specificcally, the subcortex provides the reward signal, and some other importat game stats (stamina level, hit-points, etc). The subcortex is also sort of like a tutorial, drawing your attention to things that the game creator (evoulution) thinks might be usefull, and occational cut scenes (acting out pre-programed behaviour).
ML comparasion:
* The character is the pre trained nerual net
* The player is the backprop
* The cortex is the neural net and backprop
* Subcortex is the reward signarl and sometimes supervisory signal.
Also, I don't like the the player-character model much. Like all models it is at best a simplification, and it does catch some of what is going on, but I think it is more wrong than right and I think something like multi-agent model is much better. I.e. there are coping mechanmisms and other less consious strategies living in your brains side by side with who you think you are. But I don't think these are compleetly invissible the way the player is invissible to the character. They are predictive models (e.g. "cheescake will make me safe"), and it is possible to query them for predictions. And almost all of these models are in the cortex.
I have been thinking about Stuart Armstrong's preference synthesis research agenda, and have long had the feeling that there's something off about the way that it is currently framed. In the post I try to describe why. I start by describing my current model of human values, how I interpret Stuart's implicit assumptions to conflict with it, and then talk about my confusion with regard to reconciling the two views.
The two-layer/ULM model of human values
In Player vs. Character: A Two-Level Model of Ethics, Sarah Constantin describes a model where the mind is divided, in game terms, into a "player" and a "character". The character is everything that we consciously experience, but our conscious experiences are not our true reasons for acting. As Sarah puts it:
I think that this model is basically correct, and that our emotional responses, preferences, etc. are all the result of a deeper-level optimization process. This optimization process, then, is something like that described in The Brain as a Universal Learning Machine:
Rephrasing these posts in terms of each other, in a person's brain "the player" is the underlying learning machinery, which is searching the space of programs (brains) in order to find a suitable configuration; the "character" is whatever set of emotional responses, aesthetics, identities, and so forth the learning program has currently hit upon.
Many of the things about the character that seem fixed, can in fact be modified by the learning machinery. One's sense of aesthetics can be updated by propagating new facts into it, and strongly-held identities (such as "I am a technical person") can change in response to new kinds of strategies becoming viable. Unlocking the Emotional Brain describes a number of such updates, such as - in these terms - the ULM eliminating subprograms blocking confidence after receiving an update saying that the consequences of expressing confidence will not be as bad as previously predicted.
Another example of this kind of a thing was the framework that I sketched in Building up to an Internal Family Systems model: if a system has certain kinds of bad experiences, it makes sense for it to spawn subsystems dedicated to ensuring that those experiences do not repeat. Moral psychology's social intuitionist model claims that people often have an existing conviction that certain actions or outcomes are bad, and that they then level seemingly rational arguments for the sake of preventing those outcomes. Even if you rebut the arguments, the conviction remains. This kind of a model is compatible with an IFS/ULM style model, where the learning machinery sets the goal of preventing particular outcomes, and then applies the "reasoning module" for that purpose.
Qiaochu Yuan notes that once you see people being upset at their coworker for criticizing them and you do therapy approaches with them, and this gets to the point where they are crying about how their father never told them that they were proud of them... then it gets really hard to take people's reactions to things at face value. Many of our consciously experienced motivations, actually have nothing to do with our real motivations. (See also: Nobody does the thing that they are supposedly doing, The Elephant in the Brain, The Intelligent Social Web.)
Preference synthesis as a character-level model
While I like a lot of the work that Stuart Armstrong has done on synthesizing human preferences, I have a serious concern about it which is best described as: everything in it is based on the character level, rather than the player/ULM level.
For example, in "Our values are underdefined, changeable, and manipulable", Stuart - in my view, correctly - argues for the claim stated in the title... except that, it is not clear to me to what extent the things we intuitively consider our "values", are actually our values. Stuart opens with this example:
From this, Stuart suggests that people's values on these questions should be thought of as underdetermined. I think that this has a grain of truth to it, but that calling these opinions "values" in the first place is misleading.
My preferred framing would rather be that people's values - in the sense of some deeper set of rewards which the underlying machinery is optimizing for - are in fact underdetermined, but that is not what's going on in this particular example. The order of the questions does not change those values, which remain stable under this kind of a consideration. Rather, consciously-held political opinions are strategies for carrying out the underlying values. Receiving the questions in a different order caused the system to consider different kinds of information when it was choosing its initial strategy, causing different strategic choices.
Stuart's research agenda does talk about incorporating meta-preferences, but as far as I can tell, all the meta-preferences are about the character level too. Stuart mentions "I want to be more generous" and "I want to have consistent preferences" as examples of meta-preferences; in actuality, these meta-preferences might exist because of something like "the learning system has identified generosity as a socially admirable strategy and predicts that to lead to better social outcomes" and "the learning system has formulated consistency as a generally valuable heuristic and one which affirms the 'logical thinker' identity, which in turn is being optimized because of its predicted social outcomes".
My confusion about a better theory of values
If a "purely character-level" model of human values is wrong, how do we incorporate the player level?
I'm not sure and am mostly confused about it, so I will just babble & boggle at my confusion for a while, in the hopes that it would help.
The optimistic take would be that there exists some set of universal human values which the learning machinery is optimizing for. There exist various therapy frameworks which claim to have found something like this.
For example, the NEDERA model claims that there exist nine negative core feelings whose avoidance humans are optimizing for: people may feel Alone, Bad, Helpless, Hopeless, Inadequate, Insignificant, Lost/Disoriented, Lost/Empty, and Worthless. And pjeby mentions that in his empirical work, he has found three clusters of underlying fears which seem similar to these nine:
So - assuming for the sake of argument that these findings are correct - one might think something like "okay, here are the things the brain is trying to avoid, we can take those as the basic human values".
But not so fast. After all, emotions are all computed in the brain, so "avoidance of these emotions" can't be the only goal any more than "optimizing happiness" can. It would only lead to wireheading.
Furthermore, it seems like one of the things that the underlying machinery also learns, is situations in which it should trigger these feelings. E.g. feelings of irresponsibility can be used as an internal carrot and stick scheme, in which the system comes to predict that if it will feel persistently bad, this will cause parts of it to pursue specific goals in an attempt to make those negative feelings go away.
Also, we are not only trying to avoid negative feelings. Empirically, it doesn't look like happy people end up doing less than unhappy people, and guilt-free people may in fact do more than guilt-driven people. The relationship is nowhere linear, but it seems like there are plenty of happy, energetic people who are happy in part because they are doing all kinds of fulfilling things.
So maybe we could look at the inverse of negative feelings: positive feelings. The current mainstream model of human motivation and basic needs is self-determination theory, which explicitly holds that there exist three separate basic needs:
So one model could be that the basic learning machinery is, first, optimizing for avoiding bad feelings; and then, optimizing for things that have been associated with good feelings (even when doing those things is locally unrewarding, e.g. taking care of your children even when it's unpleasant). But this too risks running into the wireheading issue.
A problem here is that while it might make intuitive sense to say "okay, if the character's values aren't the real values, let's use the player's values instead", the split isn't actually anywhere that clean. In a sense the player's values are the real ones - but there's also a sense in which the player doesn't have anything that we could call values. It's just a learning system which observes a stream of rewards and optimizes it according to some set of mechanisms, and even the reward and optimization mechanisms themselves may end up getting at least partially rewritten. The underlying machinery has no idea about things like "existential risk" or "avoiding wireheading" or necessarily even "personal survival" - thinking about those is a character-level strategy, even if it is chosen by the player using criteria that it does not actually understand.
For a moment it felt like looking at the player level would help with the underdefinability and mutability of values, but the player's values seem like they could be even less defined and even more mutable. It's not clear to me that we can call them values in the first place, either - any more than it makes meaningful sense to say that a neuron in the brain "values" firing and releasing neurotransmitters. The player is just a set of code, or going one abstraction level down, just a bunch of cells.
To the extent that there exists something that intuitively resembles what we call "human values", it feels like it exists in some hybrid level which incorporates parts of the player and parts of the character. That is, assuming that the two can even be very clearly distinguished from each other in the first place.
Or something. I'm confused.