Now, for the rats, there’s an evolutionarily-adaptive goal of "when in a salt-deprived state, try to eat salt". The genome is “trying” to install that goal in the rat’s brain. And apparently, it worked! That goal was installed!
This is importantly technically false in a way that should not be forgotten on pain of planetary extinction:
The outer loss function training the rat genome was strictly inclusive genetic fitness. The rats ended up with zero internal concept of inclusive genetic fitness, and indeed, no coherent utility function; and instead ended up with complicated internal machinery running off of millions of humanly illegible neural activations; whose properties included attaching positive motivational valence to imagined states that the rat had never experienced before, but which shared a regularity with states experienced by past rats during the "training" phase.
A human, who works quite similarly to the rat due to common ancestry, may find it natural to think of this as a very simple 'goal'; because things similar to us appear to have falsely low algorithmic complexity when we model them by empathy; because the empathy can model them using short codes. A human may imagine that natural selection successfully created rats with a simple salt-balance term in their simple generalization of a utility function, simply by natural-selection-training them on environmental scenarios with salt deficits and simple loss-function penalties for not balancing the salt deficits, which were then straightforwardly encoded into equally simple concepts in the rat.
This isn't what actually happened. Natural selection applied a very simple loss function of 'inclusive genetic fitness'. It ended up as much more complicated internal machinery in the rat that made zero mention of the far more compact concept behind the original loss function. You share the complicated machinery so it looks simpler to you than it should be, and you find the results sympathetic so they seem like natural outcomes to you. But from the standpoint of natural-selection-the-programmer the results were bizarre, and involved huge inner divergences and huge inner novel complexity relative to the outer optimization pressures.
Thanks for your comment! I think that you're implicitly relying on a different flavor of "inner alignment" than the one I have in mind.
(And confusingly, the brain can be described using either version of "inner alignment"! With different resulting mental pictures in the two cases!!)
See my post Mesa-Optimizers vs "Steered Optimizers" for details on those two flavors of inner alignment.
I'll summarize here for convenience.
I think you're imagining that the AGI programmer will set up SGD (or equivalent) and the thing SGD does is analogous to evolution acting on the entire brain. In that case I would agree with your perspective.
I'm imagining something different:
The first background step is: I argue (for example here) that one part of the brain (I'm calling it the "neocortex subsystem") is effectively implementing a relatively simple, quasi-general-purpose learning-and-acting algorithm. This algorithm is capable of foresighted goal seeking, predictive-world-model-building, inventing new concepts, etc. etc. I don't think this algorithm looks much like a deep neural net trained by SGD; I think it looks like an learning algorithm that no human has invented yet, one which is more closely related to learning probabilistic graphical models than to deep neural nets. So that's one part of the brain, comprising maybe 80% of the weight of a human brain. Then there are other parts of the brain (brainstem, amygdala, etc.) that are not part of this subsystem. Instead, one thing they do is run calculations and interact with the "neocortex subsystem" in a way that tends to "steer" that "neocortex subsystem" towards behaving in ways that are evolutionarily adaptive. I think there are many different "steering" brain circuits, and they are designed to steer the neocortex subsystem towards seeking a variety of goals using a variety of mechanisms.
So that's the background picture in my head—and if you don't buy into that, nothing else I say will make sense.
Then the second step is: I'm imagining that the AGI programmers will build a learning-and-acting algorithm that resembles the "neocortex subsystem"'s learning-and-acting algorithm—not by some blind search over algorithm space, but by directly programming it, just as people have directly programmed AlphaGo and many other learning algorithms in the past. (They will do this either by studying how the neocortex works, or by reinventing the same ideas.) Once those programmers succeed, then OK, now these programmers will have in their hands a powerful quasi-general-purpose learning-and-acting (and foresighted goal-seeking, concept-inventing, etc.) algorithm. And then the programmer will be in a position analogous to the position of the genes wiring up those other brain modules (brainstem, etc.): the programmers will be writing code that tries to get this neocortex-like algorithm to do the things they want it to do. Let's say the code they write is a "steering subsystem".
The simplest possible "steering subsystem" is ridiculously simple and obvious: just a reward calculator that sends rewards for the exact thing that the programmer wants it to do. (Note: the "neocortex subsystem" algorithm has an input for reward signals, and these signals are involved in creating and modifying its internal goals.) And if the programmer unthinkingly build that kind of simple "steering subsystem", it would kinda work, but not reliably, for the usual reasons like ambiguity in extrapolating out-of-distribution, the neocortex-like algorithm sabotaging the steering subsystem, etc. But, we can hope that there are more complicated possible "steering subsystems" that would work better.
So then this article is part of a research program of trying to understand the space of possibilities for the "steering subsystem", and figuring out which of them (if any!!) would work well enough to keep arbitrarily powerful AGIs (of this basic architecture) doing what we want them to do.
Finally, if you can load that whole perspective into your head, I think from that perspective it's appropriate to say "the genome is “trying” to install that goal in the rat’s brain", just as you can say that a particular gene is "trying" to do a certain step of assembling a white blood cell or whatever. (The "trying" is metaphorical but sometimes helpful.) I suppose I should have said "rat's neocortex subsystem" instead of "rat's brain". Sorry about that.
Does that help? Sorry if I'm misunderstanding you. :-)
I'm a bit confused by the intro saying that RL can't do this, especially since you later on say the neocortex is doing model-based RL. I think current model-based RL algorithms would likely do fine on a toy version of this task, with e.g. a 2D binary state space (salt deprived or not; salt water or not) and two actions (press lever or no-op). The idea would be:
- Agent explores by pressing lever, learns transition dynamics that pressing lever => spray of salt water.
- Planner concludes that any sequence of actions involving pressing lever will result in salt water spray. In a non salt-deprived state this has negative reward, so the agent avoids it.
- Once the agent becomes salt deprived, the planner will conclude this has positive reward, and so take that action.
I do agree that a typical model-free RL algorithm is not capable of doing this directly (it could perhaps meta-learn a policy with memory that can solve this).
Good question! Sorry I didn't really explain. The missing piece is "the planner will conclude this has positive reward". The planner has no basis for coming up with this conclusion, that I can see.
In typical RL as I understand it, regardless of whether it's model-based or model-free, you learn about what is rewarding by seeing the outputs of the reward function. Like, if an RL agent is playing an Atari game, it does not see the source code that calculates the reward function. It can try to figure out how the reward function works, for sure, but when it does that, all it has to go on is the observations of what the reward function has output in the past. (Related discussion.)
So yeah, in the salt-deprived state, the reward function has changed. But how does the planner know that? It hasn't seen the salt-deprived state before. Presumably if you built such a planner, it would go in with a default assumption of "the salt-deprivation state is different now than I've ever seen before—I'll just assume that that doesn't affect the reward function!" Or at best, its default assumption would be "the salt deprivation state is different now than I've ever seen before—I don't know how and whether that impacts the reward function. I should increase my uncertainty. Maybe explore more.". In this experiment the rats were neither of those, instead they were acting like "the salt deprivation state is different than I've ever seen, and I specifically know that, in this new state, very salty things are now very rewarding". They were not behaving as if they were newly uncertain about the reward consequences of the lever, they were absolutely gung-ho about pressing it.
Sorry if I'm misunderstanding :-)
Thanks for the clarification! I agree if the planner does not have access to the reward function then it will not be able to solve it. Though, as you say, it could explore more given the uncertainty.
Most model-based RL algorithms I've seen assume they can evaluate the reward functions in arbitrary states. Moreover, it seems to me like this is the key thing that lets rats solve the problem. I don't see how you solve this problem in general in a sample-efficient manner otherwise.
One class of model-based RL approaches is based on [model-predictive control](https://en.wikipedia.org/wiki/Model_predictive_control): sample random actions, "rollout" the trajectories in the model, pick the trajectory that had the highest return and then take the first action from that trajectory, then replan. That said, assumptions vary. [iLQR](https://en.wikipedia.org/wiki/Linear%E2%80%93quadratic_regulator) makes the stronger assumption that reward is quadratic and differentiable.
I think methods based on [Monte Carlo tree search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) might exhibit something like the problem you discuss. Since they sample actions from a policy trained to maximize reward, they might end up not exploring enough in this novel state if the policy is very confident it should not drink the salt water. That said, they typically include explicit methods for exploration like [UCB](https://en.wikipedia.org/wiki/Thompson_sampling#Upper-Confidence-Bound_(UCB)_algorithms) which should mitigate this.
Most model-based RL algorithms I've seen assume they can evaluate the reward functions in arbitrary states.
Hmm. AlphaZero can evaluate the true reward function in arbitrary states. MuZero can't—it tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I googled "model-based RL Atari" and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I'm not intimately familiar with the deep RL literature, I wouldn't know what's typical and I'll take your word for it, but it does seem that both possibilities are out there.
Anyway, I don't think the neocortex can evaluate the true reward function in arbitrary states, because it's not a neat mathematical function, it involves messy things like the outputs of millions of pain receptors, hormones sloshing around, the input-output relationships of entire brain subsystems containing tens of millions of neurons, etc. So I presume that the neocortex tries to learn the reward function by supervised learning from observations of past rewards—and that's the whole thing with TD learning and dopamine.
I added a new sub-bullet at the top to clarify that it's hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the "other possible explanations" section at the bottom saying what I said in the paragraph just above. Thank you.
I don't see how you solve this problem in general in a sample-efficient manner otherwise.
Well, the rats are trying to do the rewarding thing after zero samples, so I don't think "sample-efficiency" is quite the right framing.
In ML today, the reward function is typically a function of states and actions, not "thoughts". In a brain, the reward can depend directly on what you're imagining doing or planning to do, or even just what you're thinking about. That's my proposal here.
Well, I guess you could say that this is still a "normal MDP", but where "having thoughts" and "having ideas" etc. are part of the state / action space. But anyway, I think that's a bit different than how most ML people would normally think about things.
I googled "model-based RL Atari" and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly)
Ah, the "model-based using a model-free RL algorithm" approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You're right that in this setup, as the actions are being chosen by the (model-free RL) policy, you don't get any zero-shot generalization.
I added a new sub-bullet at the top to clarify that it's hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the "other possible explanations" section at the bottom saying what I said in the paragraph just above. Thank you.
Thanks for updating the post to clarify this point -- I agree with you with the new wording.
In ML today, the reward function is typically a function of states and actions, not "thoughts". In a brain, the reward can depend directly on what you're imagining doing or planning to do, or even just what you're thinking about. That's my proposal here.
Yes indeed, your proposal is quite different from RL. The closest I can think of to rewards over "thoughts" in ML would be regularization terms that take into account weights or, occasionally, activations -- but that's very crude compared to what you're proposing.
This might just be me not grokking predictive processing, but...
I feel like I do a version of the rat's task all the time to decide what to have for dinner—I imagine different food options, feel which one seems most appetizing, and then push the button (on Seamless) that will make that food appear.
Introspectively, this feels to me there's such a thing as 'hypothetical reward'. When I imagine a particular food, I feel like I get a signal from... somewhere... that tells me whether I would feel reward if I ate that food, but does not itself constitute reward. I don't generally feel any desire to spend time fantasizing about the food I'm waiting for.
To turn this into a brain model, this seems like the neocortex calling an API the subcortex exposes. Roughly, the neocortex can give the subcortex hypothetical sensory data and get a hypothetical reward in exchange. I suppose this is basically hypothesis two with a modification to avoid the pitfall you identify, although that's not how I arrived at the idea.
This does require a second dimension of subcortex-to-neocortex signal alongside the reward. Is there a reason to think there isn't one?
I don't generally feel any desire to spend time fantasizing about the food I'm waiting for.
Haha, yeah, there's a song about that.
So anyway, I think you're onto something, and I think that something is that "reward" and "reward prediction" are two distinct concepts, but they're all jumbled up in my mind, and therefore presumably also jumbled up in my writings. I've been vaguely aware of this for a while, but thanks for calling me out on it, I should clean up my act. So I'm thinking out loud here, bear with me, and I'm happy for any help. :-)
The TD learning algorithm is:
where is the previous state, is the new state, V is the value function a.k.a. reward prediction, and r is the reward from this step.
(I'm ignoring discounting. BTW, I don't think the brain literally does TD learning in the exact form that computer scientists do, but I think it's close enough to get the right idea.)
So let's go through two scenarios.
Scenario A: I'm going to eat candy, anticipating a large reward (high ). I eat the candy (high r) then don't anticipate any reward after that (low ). RPE=0 here. That's just what I expected. V went down, but it went down in lock-step with the arrival of the reward r.
Scenario B: I'm going to eat candy, anticipating a large reward (high ). Then I see that we're out of candy! So I get no reward and have nothing to look forward to (, low ). Now this is a negative (bad) RPE! Subjectively, this feels like crushing disappointment. The TD learning rule kicks in here, so that next time when I go to eat candy, I won't be expecting as much reward as I did this time (lower than before), because I will be preemptively braced for the possibility that we'll be out of candy.
OK, makes sense so far.
Interestingly, the reward r, as such, barely matters here! It's not decision-relevant, right? Good actions can be determined entirely by the following rule:
Each step, do whatever maximizes RPE.
(right?)
Or subjectively, thoughts and sensory inputs with positive RPE are attractive, while thoughts and sensory inputs with negative RPE are aversive.
OK, so when the rat first considers the possibility that it's going to eat salt, it gets a big injection of positive RPE. It now (implicitly) expects a large upcoming reward. Let's say for the sake of argument that it decides to not eat the salt, and go do something else. Well now we're not expecting to eat the salt, whereas previously we were, so that's a big injection of negative RPE. So basically, once it gets the idea that it can eat salt, it's very aversive (negative RPE) to drop that idea, without actually consummating it (by eating the salt and getting the anticipated reward r).
Back to your food example, you go in with some baseline expectation for what dinner's going to be like. Then you invoke the idea "I'm going to eat yam". You get a negative RPE in response. OK, go back to the baseline plan then. You get a compensatory positive RPE. Then you invoke the idea "I'm going to eat beans". You get a positive RPE. Alright! You think about it some more. Oh, I can't have beans tonight, I don't have any. You drop the idea and suffer a negative RPE. That's aversive, but you're stuck. Then you invoke another idea "I'm going to eat porridge". Positive RPE! As you flesh out the plan, it becomes more confident, which activates the model more strongly, the idea in your head of having porridge becomes more vivid, so to speak. Each increment of increasing confidence that you're going to eat porridge is rewarded by a corresponding spurt of RPE. Then you eat the porridge. Back to low RPE, but there's a reward at the same time, so that's fine, there's no RPE.
Let's go to fantasizing in general. Let's say you get the idea that a wad of cash has magically appeared in your wallet. That idea is attractive (positive RPE). But sooner or later you're going to actually look in the wallet and find that there's no wad of cash (negative RPE). The negative RPE triggers the TD learning rule such that next time "the idea that a wad of cash has magically appeared in your wallet" will not be such an attractive idea, it will be tinged with a negative memory of it failing to happen. Of course, you could go the other way and try to avoid the negative RPE by clinging to the original story—like, don't look in your wallet, or if you see that the cash isn't there you think "guess I must have deposited in the bank already", etc. This is unhealthy but certainly a known human foible. For example, as of this writing, in the USA, each of the two major presidential candidates has millions of followers who believe that their preferred candidate will be president for the next four years. It's painful to let go of an idea that something good is going to happen, so you resist if at all possible. Luckily the brain has some defense systems against wishful thinking. For example you can't not expect something to happen that you've directly experienced multiple times. See here. Another is: if you do eventually come back to earth, and the negative RPE finally does happen, then TD learning kicks in, and all the ideas and strategies that contributed to your resisting the truth until now get tarred with a reduction in associated RPE, which makes them less likely to be used next time.
Hmm, so maybe I had it right in the diagram here: I had the neocortex sending reward predictions to the subcortex, and the subcortex sending back RPEs to the neocortex. So if the neocortex sends a high reward prediction, then a low reward prediction, that might or might not be a RPE, depending on whether you just ate candy in between. Here, the subcortex sends a positive RPE when the neocortex starts imagining tasting salt, and sends a negative RPE when it stops imagining salt (unless it actually ate the salt at that moment). And if the salt imagination / expectation signal gets suddenly stronger, it sends a positive RPE for the difference, and so on.
(I could make a better diagram by pulling a "basal ganglia" box out of the neocortex subsystem into a separate box in the diagram. My understanding, definitely oversimplified, is that the basal ganglia has a dense web of connections across the (frontal lobe of the) neocortex, and just memorizes reward predictions associated with different arbitrary neocortical patterns. And it also suppresses patterns that lead to lower reward predictions and amplifies patterns that lead to higher reward predictions. So in the diagram, the neocortex would sends "information" to the basal ganglia, the basal ganglia calculates a reward prediction and sends it to the subcortex, and the subcortex sends the RPE to the basal ganglia (to alter the reward predictions) and to the neocortex (to reinforce or weaken the associated patterns). Something like that...).
Does that make sense? Sorry this is so long. Happy for any thoughts if you've read this far.
Another update: Actually maybe it's simpler (and equivalent) to say the subcortex gives a reward proportional to the time-derivative of how strongly the salt-expectation signal is activated.
I really liked this post. Not used to thinking about brain algorithms, but I believe I followed most of your points.
That being said, I'm not sure I get how your hypotheses explain the actual behavior of the rats. Just looking at hypothesis 3, you posit that thinking about salt gets an improved reward, and so does actions that make the rat expect salt-tasting. But that doesn't remove the need for exploration! The neocortex still needs to choose a course of action before getting a reward. Actually, if thinking about salt is rewarded anyway, this might reinforce any behavior decided after thinking about salt. And if the interpretability is better and only rewards actions that are expected to result in tasting salt, there is still need for exploring to find such a plan and having it reinforced.
Am I getting something wrong?
You're right. "Thinking about salt is rewarded anyway" doesn't make sense and isn't right. You're one of two people to call me out on it, and I just posted a long comment replying to the other here. Thank you!! I just added a correction to the article:
(UPDATE: Commenters point out that this description isn't quite right—it doesn't make sense to say that the idea of tasting salt is rewarding per se. Rather, when the rat starts expecting to taste salt, the subcortex sends a positive reward-prediction-error signal, and conversely if the rat stops expecting to taste salt, the subcortex sends a negative reward-prediction-error signal. Something like that. Sorry for the mistake / confusion. Thanks commenters!)
Does that answer your question?
It also transfers in an obvious way to AGI programming, where it would correspond to something like an automated "interpretability" module that tries to make sense of the AGI's latent variables by correlating them with some other labeled properties of the AGI's inputs, and then rewarding the AGI for "thinking about the right things" (according to the interpretability module's output), which in turn helps turn those thoughts into the AGI's goals.
(Is this a good design idea that AGI programmers should adopt? I don't know, but I find it interesting, and at least worthy of further thought. I don't recall coming across this idea before in the context of inner alignment.)
Fwiw, I think this is basically a form of relaxed adversarial training, which is my favored solution for inner alignment.
FYI: In the original version I wrote that the saltwater spray happens when the rats press the lever. It turns out I misunderstood. During training, the saltwater spray happens immediately after the lever appears, whether or not the rats touch the lever. Still, the rats act as if the lever deserves the credit / blame for causing the spray, by nibbling the lever when they like the spray (e.g. with the sugar-water control lever), or staying away from the lever if they don't. Thanks Kent Berridge for the correction. I've made a few other corrections since posting this but I think I marked all the other ones in the text.
Do you posit that it learns over the course of its life that salt taste cures salt definiency, or do you allow this information to be encoded in the genome?
I think the circuitry which monitors salt homeostasis, and which sends a reward signal when both salt is deficient and the neocortex starts imagining the taste of salt ... I think that circuitry is innate and in the genome. I don't think it's learned.
That's just a guess from first principles: there's no reason it couldn't be innate, it's not that complicated, and it's very important to get the behavior right. (Trial-and-error learning of salt homeostasis would, I imagine, be often fatal.)
I do think there are some non-neocortex aspects of food-related behavior that are learned—I'm thinking especially of how, if you eat food X, then get sick a few hours later, you develop a revulsion to the taste of food X. That's obviously learned, and I really don't think that this learning is happening in the neocortex. It's too specific. It's only one specific type of association, it has to occur in a specific time-window, etc.
But I suspect that the subcortical systems governing salt homeostasis in particular are entirely innate (or at least mostly innate), i.e. not involving learning.
Does that answer your question? Sorry if I'm misunderstanding.
This all sounds reasonable. I just saw that you were arguing for more being learned at runtime (as some sort of Steven Reknip), and I thought that surely not all the salt machinery can be learnt, and I wanted to see which of those expectations would win.
(Oh, I get it, Reknip is Pinker backwards.) If you're interested in my take more generally see My Computational Framework for the Brain. :-)
Now, for the rats, there’s an evolutionarily-adaptive goal of "when in a salt-deprived state, try to eat salt". The genome is “trying” to install that goal in the rat’s brain. And apparently, it worked! That goal was installed! And remarkably, that goal was installed even before that situation was ever encountered!
I don't think this is remarkable. Plenty of human activities work this way, where some goal has been encoded through evolution. For example, heterosexual teenage boys often find teenage girls to be attractive and want to get them naked, even before they have ever managed to do it successfully, without a true conscious understanding of their eventual goals. Or babies know to seek out nipple-shaped objects, before they have ever interacted with a nipple.
Well, the brain does a lot of impressive things :-) We shouldn't be less impressed by any one impressive thing just because there are many other impressive things too.
Anyway I wrote this blog post last year where I went through a list of universal human behaviors and tried to think about how they could work. I've learned more since writing that, and I think I got some of the explanations wrong, but it's still a good starting point.
What about sexual attraction?
Without getting into too much detail, I would say that sexual attraction involves the same "supervised learning" mechanism I talked about here, but with one extra complication: For salt, it's trivial to get ground truth about whether you are tasting salt—you have salt taste buds sending their signals straight into the brain, it's crystal clear. But for sexual attraction, you need an extra computational step to get (approximate) ground truth about whether or not you are interacting with a sexually-attractive (to you) person. (Then that approximate ground truth can be the supervisory signal of the supervised learning algorithm.)
So, where does the "ground truth" for "I am interacting with a sexually-attractive (to me) person" come from? First, I think there are hardwired sight cues, sound cues, smell cues, touch cues, etc. See here for details, particularly my claim that these cues are detected by circuitry in the subcortical sensory processing systems (especially the tectum), not in the neocortex. Second, I'm big into empathetic simulation and think they're central to all social emotions, and I think that it plays a role in all aspects of sexual attraction too, both physical and emotional. That's a bit of a long story, I think.
I think newborns finding their mother's nipple are just going by hardwired smell and touch cues, and hardwired movement routines, I presume in the brainstem. Hardwired stimulus-response, nothing complicated! Well, I don't really know, I'm just guessing.
(See comment here for some corrections and retractions. —Steve, 2022)
Introduction: The Dead Sea Salt Experiment
In this 2014 paper by Mike Robinson and Kent Berridge at University of Michigan (see also this more theoretical follow-up discussion by Berridge and Peter Dayan), rats were raised in an environment where they were well-nourished, and in particular, where they were never salt-deprived—not once in their life. The rats were sometimes put into a test cage with a lever which, when it appeared, was immediately followed by a device spraying ridiculously salty water directly into their mouth. The rats were disgusted and repulsed by the extreme salt taste, and quickly learned to hate the lever—which from their perspective would seem to be somehow causing the saltwater spray. One of the rats went so far as to stay tight against the opposite wall—as far from the lever as possible!
Then the experimenters made the rats feel severely salt-deprived, by depriving them of salt. Haha, just kidding! They made the rats feel severely salt-deprived by injecting the rats with a pair of chemicals that are known to induce the sensation of severe salt-deprivation. Ah, the wonders of modern science!
...And wouldn't you know it, almost instantly upon injection, the rats changed their behavior! When shown the lever (this time without the salt-water spray), they now went right over to that lever and jumped on it and gnawed at it, obviously desperate for that super-salty water.
The end.
Aren't you impressed? Aren’t you floored? You should be!!! I don’t think any standard ML algorithm would be able to do what these rats just did!
Think about it:
So what’s the algorithm here? How did their brains know that this was a good plan? That’s the subject of this post.
What does this have to do with inner alignment? What is inner alignment anyway? Why should we care about any of this?
With apologies to the regulars on this forum who already know all this, the so-called “inner alignment problem” occurs when you, a programmer, build an intelligent, foresighted, goal-seeking agent. You want it to be trying to achieve a certain goal, like maybe “do whatever I, the programmer, want you to do” or something. The inner alignment problem is: how do you ensure that the agent you programmed is actually trying to pursue that goal? (Meanwhile, the “outer alignment problem” is about choosing a good goal in the first place.) The inner alignment problem is obviously an important safety issue, and will become increasingly important as our AI systems get more powerful in the future.
(See my earlier post mesa-optimizers vs “steered optimizers” for specifics about how I frame the inner alignment problem in the context of brain-like algorithms.)
Now, for the rats, there’s an evolutionarily-adaptive goal of "when in a salt-deprived state, try to eat salt". The genome is “trying” to install that goal in the rat’s brain. And apparently, it worked! That goal was installed! And remarkably, that goal was installed even before that situation was ever encountered! So it’s worth studying this example—perhaps we can learn from it!
Before we get going on that, one more boring but necessary thing:
Aside: Obligatory post-replication-crisis discussion
The dead sea salt experiment strikes me as trustworthy. Pretty much all the rats—and for key aspects literally every tested rat—displayed an obvious qualitative behavioral change almost instantaneously upon injection. There were sensible tests with control levers and with control rats. The authors seem to have tested exactly one hypothesis, and it's a hypothesis that was a priori plausible and interesting. And so on. I can't assess every aspect of the experiment, but from what I see, I believe this experiment, and I'm taking its results at face value. Please do comment if you see anything questionable.
Outline of the rest of the post
Next I'll go through my hypothesis for how the rat brain works its magic here. Actually, I've come up with three variants of this hypothesis over the past year or so, and I’ll talk through all of them, in chronological order. Then I’ll speculate briefly on other possible explanations.
My hypothesis for how the rat brain did what it did
The overall story
As I discussed in My Computational Framework for the Brain, my starting-point assumption is that the rat brain has a “neocortex subsystem” (really the neocortex, hippocampus, parts of thalamus and basal ganglia, maybe other things too). The neocortex subsystem takes sensory inputs and reward inputs, builds a predictive model from scratch, and then chooses thoughts and actions that maximize reward. The reward, in turn, is issued by a different subsystem of the brain that I’ll call “subcortex”.
To grossly oversimplify the “neocortex builds a predictive model” part of that, let’s just say for present purposes that the neocortex subsystem memorizes patterns in the inputs, and then patterns in the patterns, and so on.
To grossly oversimplify the “neocortex chooses thoughts and actions that maximize reward” part, let’s just say for present purposes that different parts of the predictive model are associated with different reward predictions, the reward predictions are updated by a TD learning system that has something to do with dopamine and the basal ganglia, and parts of the model that predict higher reward are favored while parts of the model that predict lower reward are pushed out of mind.
Since the “predictive model” part is invoked for the “reward-maximization” part, we can say that the neocortex does model-based RL.
(Aside: It's sometimes claimed in the literature that brains do both model-based and model-free RL. I disagree that this is a fundamental distinction; I think "model-free" = "model-based with a dead-simple model". See my old comment here.)
Why is this important? Because that brings us to imagination! The neocortex can activate parts of the predictive model not just to anticipate what is about to happen, but also to imagine what may happen, and (relatedly) to remember what has happened.
Now we get a crucial ingredient: I hypothesize that the subcortex somehow knows when the neocortex is imagining the taste of salt. How? This is the part where I have three versions of the story, which I’ll go through shortly. For now, let’s just assume that there is a wire going into the subcortex, and when it’s firing, that means the neocortex is activating the parts of the predictive model that correspond (semantically) to tasting salt.
And once we have that, the last ingredient is simple: The subcortex has an innate, hardwired circuit that says “If the neocortex is imagining tasting salt, and I am currently salt-deprived, then send a reward to the neocortex.”
OK! So now the experiment begins. The rat is salt-deprived, and it sees the lever appear. That naturally evokes its previous memory of tasting salt, and that thought is rewarded! When the rat imagines walking over and nibbling the lever, it finds that to be a very pleasing (high-reward-prediction) thought indeed! So it goes and does it!
(UPDATE: Commenters point out that this description isn't quite right—it doesn't make sense to say that the idea of tasting salt is rewarding per se. Rather, I propose that the subcortex sends a reward related to the time-derivative of how strongly the neocortex is imagining / expecting to taste salt. So the neocortex gets a reward for first entertaining the idea of tasting salt, and another incremental reward for growing that idea into a definite plan. But then it would get a negative reward for dropping that idea. Sorry for the mistake / confusion. Thanks commenters!)
Now let's fill in that missing ingredient: How does the subcortex get its hands on a signal flagging that the neocortex is imagining the taste of salt? I have three hypotheses.
Hypothesis 1 for the “imagining taste of salt” signal: The neocortex API enables outputting a prediction for any given input channel
This was my first theory, I guess from last year. As argued by the “predictive coding” people, Jeff Hawkins, Yann LeCun, and many others, the neocortex is constantly predicting what input signals it will receive next, and updating its models when the predictions are wrong. This suggests that it should be possible to stick an arbitrary input line into the neocortex, and then pull out a signal carrying the neocortex’s predictions for that input line. (It would look like a slightly-earlier copy of the input line, with sporadic errors for when the neocortex is surprised.) I can imagine, for example, that if you put an input signal into cortical mini-column #592843 layer 4, then you look at a certain neuron in the same mini-column, you find those predictions.
If this is the case, then the rest is pretty straightforward. The genome wires the salt taste bud signal to wherever in the neocortex, pulls out the corresponding prediction, and we're done! For the reason described above, that line will also fire when merely imagining salt taste.
Commentary on hypothesis 1: I have mixed feelings.
On the one hand, I haven’t really come across any independent evidence that this mechanism exists. And, having learned more about the nitty-gritty of neocortex algorithms (the outputs come from layer 5, blah blah blah), I don’t think the neocortex outputs carry this type of data.
On the other hand, I have a strong prior belief that if there are ten ways for the brain to do a certain calculation, and each is biologically and computationally plausible without dramatic architectural change, the brain will do all ten! (Probably in ten different areas of the brain.) After all, evolution doesn't care much about keeping things elegant and simple. I mean, there is a predictive signal for each input—it has to be there somewhere! And I don’t currently see any reason that this signal couldn’t be extracted from the neocortex. So I feel sorta obligated to believe that this mechanism probably exists.
So anyway, all things considered, I don’t put much weight on this hypothesis, but I also won’t strongly reject it.
With that, let’s move on to the later ideas that I like better.
Hypothesis 2 for the “neocortex is imagining the taste of salt” signal: The neocortex is rewarded for “communicating its thoughts”
This was my second guess, I guess dating to several months ago.
The neocortex subsystem has a bunch of output lines for motor control and whatever else, and it has a special output line S (S for salt).
Meanwhile, the subcortex sends rewards under various circumstances, and one of those things is that the neocortex is rewarded for sending a signal into S whenever salt is tasted. (The subcortex knows when salt is tasted, because it gets a copy of that same input.)
So now, as the rat lives its life, it stumbles upon the behavior of outputting a signal into S when eating a bite of saltier-than-usual food. This is reinforced, and gradually becomes routine.
The rest is as before: when the rat imagines a salty taste, it reuses the same model. We did it!
Commentary on hypothesis 2: A minor problem (from the point-of-view of evolution) is that it would take a while for the neocortex to learn to send a signal into S when eating salt. Maybe that’s OK.
A much bigger potential problem is that the neocortex could learn a pattern where it sends a signal into S when tasting salt, and also learns a different pattern where it sends a signal into S whenever salt-deprived, whether thinking about salt or not. This pattern would, after all, be rewarded, and I can’t immediately see how to stop it from developing.
So I’m pretty skeptical about this hypothesis now.
Hypothesis 3 for the “neocortex is imagining the taste of salt” signal (my favorite!): Sorta an “interpretability” approach, probably involving the amygdala
This one comes out of my last post, Supervised Learning of Outputs in the Brain. Now we have a separate brain module that I labeled “supervised learning algorithm”, and which I suspect is primarily located in the amygdala. This module does supervised learning: the salt signal (from the taste buds) functions as the supervisory signal, and a random assortment of neurons in the neocortex subsystem (describing latent variables in the neocortex’s predictive model) function as the inputs to the learned model. Then the supervised learning module learns which patterns in those latent variables tend to reliably predict that salt is about to be tasted. Having done that, when it sees those patterns recur, that’s our signal that the neocortex is probably expecting the taste of salt … and as described above, it will also see those same patterns when the neocortex is merely imagining or remembering the taste of salt. So we have our signal!
Commentary on Hypothesis 3: There’s a lot I really like about this. It seems to at-least-vaguely match various things I’ve seen in the literature about the functionality and connectivity of the amygdala. It makes a lot of sense from a design perspective—the patterns would be learned quickly and reliably, etc., as far as I can tell. I find it satisfyingly obvious and natural (in retrospect). So I would put this forward as my favorite hypothesis by far.
It also transfers in an obvious way to AGI programming, where it would correspond to something like an automated "interpretability" module that tries to make sense of the AGI's latent variables by correlating them with some other labeled properties of the AGI's inputs, and then rewarding the AGI for "thinking about the right things" (according to the interpretability module's output), which in turn helps turn those thoughts into the AGI's goals, using the time-derivative reward-shaping trick as described above.
(Is this a good design idea that AGI programmers should adopt? I don't know, but I find it interesting, and at least worthy of further thought. I don't recall coming across this idea before in the context of the inner alignment problem.)
(Update 6 months later: I'm now more confident that this hypothesis is basically right, except maybe I should have said "medial prefrontal cortex and ventral striatum" where I said "amygdala". Or maybe it's all of the above. Anyway, see my later post Big Picture Of Phasic Dopamine.)
What would other possible explanations for the rat experiment look like?
The theoretical follow-up by Dayan & Berridge is worth reading, but I don’t think they propose any real answers, just lots of literature and interesting ideas at a somewhat-more-vague level.
(Update to add this paragraph) Next: At the top I mentioned "a version of model-based RL where the agent can submit arbitrary hypothetical queries to the true reward function" (this category includes AlphaZero). If the neocortex had a black-box ground-truth reward calculator (not a learned-from-observations model of the reward) and a way to query it, that would seem to resolve the mystery of how the rats knew to get the salt. But I can't see how this would work. First, the ground-truth reward is super complicated. There are thousands of pain receptors, there are hormones sloshing around, there are multiple subcortical brain regions doing huge complicated calculations involving millions of neurons that provide input to the reward calculation (I believe), and so on. You can learn to model this reward-calculating system by observing it, of course, but actually running this system (or a copy of it) on hypotheticals seems unrealistic to me. Second, how exactly would you query the ground-truth reward calculator? Third, there seems to be good evidence that the neocortex subsystem chooses thoughts and actions based on reward predictions that are updated by TD learning, and I can't immediately see how you can simultaneously have that system and a mechanism that chooses thoughts and actions by querying a ground-truth reward calculator. I think my preferred mechanism, "reward depends in part on what you're thinking" (which we know is true anyway), is more plausible and flexible than "your imagination has special access to the reward function".
Next: What would Steven Pinker say? He is my representative advocate of a certain branch of cognitive neuroscience—a branch to which I do not subscribe. Of course I don’t know what he would say, but maybe it’s a worthwhile exercise for me to at least try. Well, first, I think he would reject the idea that there's a “neocortex subsystem”. And I think he would more generally reject the idea that there is any interesting question along the lines of "how does the reward system know that the rat is thinking about salt?". Of course I want to pose that question, because I come from a perspective of “things like this need to learned from scratch” (again see My Computational Framework for the Brain). But Pinker would not be coming from that perspective. I think he wants to assume that a comparatively elaborate world-modeling infrastructure is already in place, having been hardcoded by the genome. So maybe he would say there's a built-in “diet module” which can model and understand food, taste, satiety, etc., and he would say there's a built-in “navigation module” which can plan a route to walk over to the lever, and he would there's a built-in “3D modeling module” which can make sense of the room and lever, etc. etc.
OK, now that possibly-strawman-Steven-Pinker has had his say in the previous paragraph, I can respond. I don't think this is so far off as a description of the calculations done by an adult brain. In ML we talk about “how the learning algorithm works” (SGD, BatchNorm, etc.), and separately (and much less frequently!) we talk about “how the trained model works” (OpenAI Microscope, etc.). I want to put all that infrastructure in the previous paragraph at the "trained model" level, not the "learning algorithm" level. Why? First, because I think there’s pretty good evidence for cortical uniformity. Second—and I know this sounds stupid—because I personally am unable to imagine how this setup would work in detail. How exactly do you insert learned content into the innate framework? How exactly do you interface the different modules with each other? And so on. Obviously, yes I know, it’s possible that answers exist, even if I can’t figure them out. But that’s where I’m at right now.