I would identify it with cortical vs subcortical, a.k.a. neocortex vs everything else. (...with the usual footnotes, e.g. the hippocampus counts as "cortical" :-D)

(ETA: See my later post Inner alignment on the brain for a better discussion of some of the below.)

The cortical system basically solves the following problem:

Here is (1) a bunch of sensory & other input data, in the form of spatiotemporal patterns of spikes on input neurons, (2) occasional labels about what's going on right now (e.g. "something good / bad / important is happening"), (3) a bunch of outgoing neurons. Your task is to build a predictive model of the inputs, and use that to choose signals to send into the outgoing neurons, to make more good things happen.

The result is our understanding of the world, our consciousness, imagination, memory, etc. Anything we do that requires understanding the world is done by the cortical system. This is your "character".

The subcortical system is responsible for everything else your brain does to survive, one of which is providing the "labels" mentioned above (that something good / bad / important / whatever is happening right now).

For example, take the fear-of-spiders instinct. If there is a black scuttling blob in your visual field, there's a subcortical vision system (in the superior colliculus) that pattern-matches that moving blob to a genetically-coded template, and thus activates a "Scary!!" flag. The cortical system sees the flag, sees the spider, and thus learns that spiders are scary, and it can plan intelligent actions to avoid spiders in the future.

I have a lot of thoughts on how to describe these two systems at a computational level, including what the neocortex is doing, and especially how the cortical and subcortical systems exchange information. I am hoping to write lots more posts with more details about the latter, especially about emotions.

even the reward and optimization mechanisms themselves may end up getting at least partially rewritten.

Well, there is such a thing as subcortical learning, particularly for things like fine-tuning motor control programs in the midbrain and cerebellum, but I think most or all of the "interesting" learning happens in the cortical system, not subcortical.

In particular, I'm not really expecting the core emotion-control algorithms to be editable by learning or thinking (if we draw an appropriately tight boundary around them).

More specifically: somewhere in the brain is an algorithm that takes a bunch of inputs and calculates "How guilty / angry / happy / smug / etc. should I feel right now?" The inputs to this algorithm come from various places, including from the body (e.g. pain, hunger, hormone levels), and from the cortex (what emotions am I expecting or imagining or remembering?), and from other emotion circuits (e.g. some emotions inhibit or reinforce each other). The inputs to the emotion calculation can certainly change, but I don't expect that the emotion calculation itself changes over time.

It feels like emotion-control calculations can change, because the cortex can be a really dominant input to those calculations, and the cortex really can change, including by conscious effort. Why is the cortex such a dominant input? Think about it: the emotion-calculation circuits don't know whether I'm likely to eat tomorrow, or whether I'm in debt, or whether Alice stole my cookie, or whether I just got promoted. That information is all in the cortex! The emotion circuits get only tiny glimpses of what's going on in the world, particularly through the cortex predicting & imagining emotions, including in empathetic simulation of others' emotions. If the cortex is predicting fear, well, the amygdala obliges by creating actual fear, and then the cortex sees that and concludes that its prediction was right all along! There's very little "ground truth" that the emotion circuits have to go on. Thus, there's a wide space of self-reinforcing habits of thought. It's a terrible system! Totally under-determined. Thus we get self-destructive habits of thought that linger on for decades.

Anyway, I have this long-term vision of writing down the exact algorithm that each of the emotion-control circuits is implementing. I think AGI programmers might find those algorithms helpful, and so might people trying to pin down "human values". I have a long way to go in that quest :-D

there's also a sense in which the player doesn't have anything that we could call values ...

I basically agree; I would describe it by saying that the subcortical systems are kinda dumb. Sure, the superior colliculus can recognize scuttling spiders, and the emotion circuits can "dislike" pain. But any sophisticated concept like "flourishing", "fairness", "virtue", etc. can only be represented in the form of something like "Neocortex World Model Entity ID #30962758", and these things cannot have any built-in relationship to subcortical circuits.

So the player's "values" are going to (1) simple things like "less pain is good", and (2) things that don't have an obvious relation to the outside world, like complicated "preferences" over the emotions inside our empathetic simulations of other people.

If a "purely character-level" model of human values is wrong, how do we incorporate the player level?

Is it really "wrong"? It's a normative assumption ... we get to decide what values we want, right? As "I" am a character, I don't particularly care what the player wants :-P

But either way, I'm all for trying to get a better understanding of how I (the character / cortical system) am "built" by the player / subcortical system. :-)

Reply

[-]Kaj_Sotala6y00

Great comment, thanks!

Is it really "wrong"? It's a normative assumption ... we get to decide what values we want, right? As "I" am a character, I don't particularly care what the player wants :-P

Well, to make up a silly example, let's suppose that you have a conscious belief that you want there to be as much cheesecake as possible. This is because you are feeling generally unsafe, and a part of your brain has associated cheesecakes with a feeling of safety, so it has formed the unconscious prediction that if only there was enough cheesecake, then you would finally feel good and safe.

So you program the AI to extract your character-level values, it correctly notices that you want to have lots of cheesecake, and goes on to fill the world with cheesecake... only for you to realize that now that you have your world full of cheesecake, you still don't feel as happy as you were on some level expecting to feel, and all of your elaborate rational theories of how cheesecake is the optimal use of atoms start feeling somehow hollow.

Reply

[-]Linda Linsefors4y30

There is a missmatch in saying cortex=charcter and subcortex=player.

If I understand the player-character model right, then uncosuios coping strategies would be player level tactic. But these are learned behaviours, and would therfore be part of cortex.

In Kaj's example, the idea that cheescake will make the bad go away exist in the cortex's world model.

According to Steven's model of how the brain works (which I think is probably ture), the subcortex is part of the game the player is playing. Specificcally, the subcortex provides the reward signal, and some other importat game stats (stamina level, hit-points, etc). The subcortex is also sort of like a tutorial, drawing your attention to things that the game creator (evoulution) thinks might be usefull, and occational cut scenes (acting out pre-programed behaviour).

ML comparasion:
* The character is the pre trained nerual net
* The player is the backprop
* The cortex is the neural net and backprop
* Subcortex is the reward signarl and sometimes supervisory signal.

Also, I don't like the the player-character model much. Like all models it is at best a simplification, and it does catch some of what is going on, but I think it is more wrong than right and I think something like multi-agent model is much better. I.e. there are coping mechanmisms and other less consious strategies living in your brains side by side with who you think you are. But I don't think these are compleetly invissible the way the player is invissible to the character. They are predictive models (e.g. "cheescake will make me safe"), and it is possible to query them for predictions. And almost all of these models are in the cortex.

Reply

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

26

The two-layer model of human values, and problems with synthesizing preferences

26

The two-layer/ULM model of human values

Preference synthesis as a character-level model

My confusion about a better theory of values