Big picture of phasic dopamine

[-]Dalcy2y50

What are the errors in this essay? As I'm reading through the Brain-like AGI sequence I keep seeing this post being referenced (but this post says I should instead read the sequence!)

I would really like to have a single reference post of yours that contains the core ideas about phasic dopamine rather than the reference being the sequence posts (which is heavily dependent on a bunch of previous posts; also Post 5 and 6 feels more high-level than this one?)

[-]Steven Byrnes2y20

I think that if you read the later Intro to Brain-Like AGI Safety series, then the only reason you might want to read this post (other than historical interest) is that the section “Dopamine category #2: RPE for “local” sub-circuit rewards” is talking about a topic that was omitted from Intro to Brain-Like AGI Safety (for brevity).

For example, practically everything I said about neuroanatomy in this post is at least partly wrong and sometimes very wrong. (E.g. the “toy loop model” diagrams are pretty bad.) The “Finally, the “prediction” part of reward prediction error” section has a very strange proposal for how RPE works; I don’t even remember why I ever believed that.

The main strengths of the post are the “normative” discussions: why might supervised learning be useful? why might more than one reward signal be useful? etc. I mostly stand by those. I also stand by “learning from scratch” being a very useful concept, and elaborated on it much more later.

[-]Michaël Trazzi5y30

Awesome post! I happen to also have tried to distill links between RPE and phasic dopamine in the "Prefrontal Cortex as a Meta-RL System" of this blog.

In particular I reference this paper on DL in the brain & this other one for RL in the brain. Also, I feel like the part 3 about links between RL and neuro of the RL book is a great resource for this.

[-]Steven Byrnes5y10

Thanks!

If you Ctrl-F the post you'll find my little paragraph on how my take differs from Marblestone, Wayne, Kording 2016.

I haven't found "meta-RL" to be a helpful way to frame either the bandit thing or the follow-up paper relating it to the brain, more-or-less for reasons here, i.e. that the normal RL / POMDP expectation is that actions have to depend on previous observations—like think of playing an Atari game—and I guess we can call that "learning", but then we have to say that a large fraction of every RL paper ever is actually a meta-RL paper, and more importantly I just don't find that thinking in those terms leads me to a better understanding of anything, but whatever, YMMV.

I don't agree with everything in the RL book chapter but it's still interesting, thanks for the link.

[-]Michaël Trazzi5y10

Right I just googled Marblestone and so you're approaching it with the dopamine side and not the acetylcholine. Without debating about words, their neuroscience paper is still at least trying to model the phasic dopamine signal as some RPE & the prefrontal network as an LSTM (IIRC), which is not acetylcholine based. I haven't read in detail this post & the one linked, I'll comment again when I do, thanks!

[-]buckithed3y20

I haven't read the entire thing yet, so maybe I am missing something, but isn't the globus pallidus inhibitory? In this you stated that it amplifies signals. there should be a path from the cortex to the subthalamic nucleus that where the globus pallidus shuts down the idea fed to striatum. I like to think of the striatum as the dad, and the globus pallidus as the mom. Then the cortex is the idea that the kids(thalamus) come up with. The dad and mom both see the idea, and the mom always says no, unless the dad convinces her to do the thing. The globus pallidus internal can also send the thalamus to time out to suppress new ideas.

[-]Steven Byrnes3y20

My main answer is that the “toy loop model”s here are pretty bad and shouldn’t be taken literally. I have an updated discussion here (posts 5-6 mostly), but even that has some issues; I made more progress in the last six months that I haven’t written up yet.

I’m more confident in the “‘Context’ in the striatum value function” section here. The convergence of many striatal neurons onto few “final answer” neurons (in both pallidum and SNr) seems pretty central to me. Kinda vaguely like the striatum is the final hidden layer, and pallidum / SNr / whatever neurons are “heads”, in a loose ML analogy.

To answer your question slightly, I’m working at a pretty high level (Marr’s “algorithm level”, I suppose) here. It’s possible to have a signal which is best thought of as exciting something, but is actually implemented by an inhibitory connection. For example, it could be “disinhibitory” (inhibiting an inhibitor). Swanson 2000 does indeed claim that pallidum-to-brainstem signals are disinhibitory, specifically by inhibiting the inhibitory striatum-to-brainstem signals.

But anyway, yeah, I would read this post as kinda “early attempt” rather than correct. A lot of the details are very much wrong. I’ll make the top-note more prominent.

[-]Charlie Steiner5y20

One thing that strikes me as odd about this model is that it doesn't have the blessing of dimensionality - each plan is one loop, and evaluating feedback to a winning plan just involves feedback to one loop. When it's general reward we can simplify this with just rewarding recent winning plans, but in some places it seems like you do imply highly specific feedback, for which you need N feedback channels to give feedback on ~N possible plans. The "blessing of dimensionality" kicks in when you can use more diverse combinations of a smaller number of feedback channels to encode more specific feedback.

Maybe what seems to be specific feedback is actually a smaller number of general types? Like rather than specific feedback to snake-fleeing plans or whatever, a broad signal (like how Success-In-Life Reward is a general signal rewarding whatever just got planned) could be sent out that means "whatever the amygdala just did to make the snake go away, good job" (or something). Note that I have no idea what I'm talking about.

[-]Steven Byrnes5y10

Right, so I'm saying that the "supervised learning loops" get highly specific feedback, e.g. "if you get whacked in the head, then you should have flinched a second or two ago", "if a salty taste is in your mouth, then you should have salivated a second or two ago", "if you just started being scared, then you should have been scared a second or two ago", etc. etc. That's the part that I'm saying trains the amygdala and agranular prefrontal cortex.

Then I'm suggesting that the Success-In-Life thing is a 1D reward signal to guide search in a high-dimensional space of possible thoughts to think, just like RL. In this case, it's not "each plan is one loop", because there's a combinatorial explosion of possible thoughts you can think, and there are not enough loops for that. (It also wouldn't work because for pretty much every thought you think, you've never thought that exact thought before—like you've never put on this particular jacket while humming this particular song and musing about this particular upcoming party...) Instead I think compositionality is involved, such that one plan / thought can involve many simultaneous loops.

[-]Charlie Steiner5y10

How does the section of the amygdala that a particular dopamine neuron connects to even get trained to do the right thing in the first place? It seems like there should be enough chance in connections that there's really only this one neuron linking a brainstem's particular output to this specific spot in the amygdala - it doesn't have a whole bundle of different signals available to send to this exact spot.

SL in the brain seems tricky because not only does the brainstem have to reinforce behaviors in appropriate contexts, it might have to train certain outputs to correspond to certain behaviors in the first place, all with only one wire to each location! Maybe you could do this with a single signal that means both "imitate the current behavior" and also "learn to do your behavior in this context"? Alternatively we might imagine some separate mechanism for of priming the developing amygdala to start out with a diverse yet sensible array of behavior proposals, and the brainstem could learn what its outputs correspond to and then signal them appropriately.

[-]Steven Byrnes5y20

I'm proposing that (1) the hypothalamus has an input slot for "flinch now", (2) VTA has an output signal for "should have flinched", (3) there is a bundle of partially-redundant side-by-side loops (see the "probability distribution" comment) that connect specifically to both (1) and (2), by a genetically-hardcoded mechanism.

I take your comment to be saying: Wouldn't it be hard for the brain to orchestrate such a specific pair of connections across a considerable distance?

Well, I'm very much not an expert on how the brain wires itself up. But I think there's gotta be some way that it can do things like that. I feel like those kinds of feats of wiring are absolutely required for all kinds of reasons. Like, I think motor cortex connects directly to spinal hand-control nerves, but not foot-control nerves. How do the output neurons aim their paths so accurately, such that they don't miss and connect to the foot nerves by mistake? Um, I don't know, but it's clearly possible. "Molecular signaling" or something, I guess?

Alternatively we might imagine some separate mechanism for of priming the developing amygdala to start out with a diverse yet sensible array of behavior proposals, and the brainstem could learn what its outputs correspond to and then signal them appropriately.

Hmm, one reasonable (to me) possibility along these lines would be something like: "VTA has 20 dopamine output signals, and they're guided to wind up spread out across the amygdala, but not with surgical precision. Meanwhile the corresponding amygdala loops terminate in an "input zone" of the lateral hypothalamus, but not to any particular spot, instead they float around unsure of exactly what hypothalamus "entry point" to connect to. And there are 20 of these intended "entry points" (collections of neurons for flinching, scowling, etc.). OK, then during embryonic development, the entry-point neurons are firing randomly, and that signal goes around the loop—within the hypothalamus and to VTA, then up to the amygdala, then back down to that floating neuron. Then Hebbian learning—i.e. matching the random code—helps the right loop neuron find its way to the matching hypothalamus entry point."

I'm not sure if that's exactly what you're proposing, but that seems like a perfectly plausible way for the brain to orchestrate these connections during embryonic development. I do have a hunch that this isn't what happens, that the real mechanism is "molecular signaling" instead. But like I said, I'm not an expert, and I certainly wouldn't be shocked to learn that the brain embryonic wiring mechanism involves this kind of thing where it closes a loop by sending a random code around the loop and Hebbian-learning the final connection.

[-][anonymous]5y20

This was an amazing article, thank you for posting it!

Side tangent: There’s an annoying paradox that: (1) In RL, there’s no “zero of reward”, you can uniformly add 99999999 to every reward signal and it makes no difference whatsoever; (2) In life, we have a strong intuition that experiences can be good, bad, or neutral; (3) ...Yet presumably what our brain is doing has something to do with RL! That “evolutionary prior” I just mentioned is maybe relevant to that? Not sure … food for thought ...

The above isn't quite true in all senses in all RL algorithms. For example, in policy gradient algorithms (http://www.scholarpedia.org/article/Policy_gradient_methods for a good but fairly technical introduction) it is quite important in practice to subtract a baseline value from the reward that is fed into the policy gradient update. (Note that the baseline can be and most profitably is chosen to be dynamic - it's a function of the state the agent is in. I think it's usually just chosen to be V(s) = max Q(s,a).) The algorithm will in theory converge to the right value without the baseline, but subtracting the baseline speeds convergence up significantly. If one guesses that the brain is using a policy-gradients-like algorithm, a similar principle would presumably apply. This actually dovetails quite nicely with observed human psychology - good/bad/neutral is a thing, but it seems to be defined largely with respect to our expectation of what was going to happen in the situation we were in. For example, many people get shitty when it turns out they aren't going to end up having sex that they thought they were going to have - so here the theory would be that the baseline value was actually quite high (they were anticipating a peak experience) and the policy gradients update will essentially treat this as an aversive stimulus, which makes no sense without the existence of the baseline.

It's closer to being true of Q-learning algorithms, but here too there is a catch - whatever value you assign to never-before-seen states can have a pretty dramatic effect on exploration dynamics, at least in tabular environments (i.e. environments with negligible generalization). So here too one would expect that there is a evolutionarily appropriate level of optimism to apply to genuinely novel situations about which it is difficult to form an a priori judgment, and the difference between this and the value you assign to known situations is at least probably known-to-evolution.

[-]Steven Byrnes5y10

That's interesting, thanks!

good/bad/neutral is a thing, but it seems to be defined largely with respect to our expectation of what was going to happen in the situation we were in.

I agree that this is a very important dynamic. But I also feel like, if someone says to me, "I keep a kitten in my basement and torture him every second of every day, but it's no big deal, he must have gotten used to it by now", I mean, I don't think that reasoning is correct, even if I can't quite prove it or put my finger on what's wrong. I guess that's what I was trying to get at with that "evolutionary prior" comment: maybe there's a hardcoded absolute threshold such that you just can't "get used to" being tortured, and set that as your new baseline, and stop actively disliking it? But I don't know, I need to think about it more, there's also a book I want to read on the neuroscience of pleasure and pain, and I've also been meaning to look up what endorphins do to the brain. (And I'm happy to keep chatting here!)

I don't have a full explanation of comparing-to-baseline. At first I was gonna say "it's just the reward-prediction-error thing I described: if you expect candy based on your beliefs at 5:05:38, and then you no longer expect candy based on your beliefs at 5:05:39, then that's a big negative reward prediction error. (Because the reward-predictor makes its prediction based on slightly-stale brain status information.) But that doesn't explain why maybe we still feel raw about it 3 minutes later. Maybe it's like, you had this active piece-of-a-thought "I'm gonna get candy", but it's contradicted by the other piece-of-a-thought "no I'm not", but that appealing piece-of-a-thought "I'm gonna get candy" keeps popping back up for a while, and then keeps getting crushed by reality, and the net result is a bad feeling. Or something? I dunno.

Oh, I think there's also a thing where the brainstem can force the high-level planner to think about a certain thing; like if you get poked on the shoulder it's kinda impossible to ignore. I think I have an idea of what mechanism is involved here … involving acetylcholine and how specific and confident the top-down predictions are, I'm hoping to write this up soon … That might be relevant too. Like if you're being tortured then you can't think about anything else, because of this mechanism. Then that would be like an objective sense in which you can't get used to a baseline of torture the way you can get used to other things.

[-]Vaniver4y10

But in experiments, they’re not synchronized; the former happens faster than the latter.

This has the effect of incentivizing learning, right? (A system that you don't yet understand is, in total, more rewarding than an equally yummy system that you do understand.) So it reminds me of exploration in bandit algorithms, which makes sense given the connection to motivation.

[-]Steven Byrnes4y10

Hmm, I guess I mostly disagree because:

I see this as sorta an unavoidable aspect of how the system works, so it doesn't really need an explanation;
You're jumping to "the system will maximize sum of future rewards" but I think RL in the brain is based on "maximize rewards for this step right now" (…and by the way "rewards for this step right now" implicitly involves an approximate assessment of future prospects.) See my comment "Humans are absolute rubbish at calculating a time-integral of reward".
I'm all for exploration, value-of-information, curiosity, etc., just not involving this particular mechanism.

[-]Vaniver4y10

I guess my sense is that most biological systems are going to be 'package deals' instead of 'cleanly separable' as much as possible--if you already have a system that's doing learning, and you can tweak that system in order to get something that gets you some of the benefits of a VoI framework (without actually calculating VoI), I expect biology to do that.

[-]Steven Byrnes4y20

I agree about the general principle, even if I don't think this particular thing is an example because of the "not maximizing sum of future rewards" thing.

[-]Hoagy5y10

Cheers for the post, I find the whole series fascinating.

One thing I was particularly curious about is how these 'proposals' are made. Do you have a picture of what kind of embedding is used to present a potential action?

For example, is a proposal encoded in the activations of set of neurons that are isomorphic to the motor neurons and it could then propose tightening a set of finger muscles through specific neurons? Or is the embedding jointly learned between the two in some large unstructured connection, or smaller latent space, or something completely different?

[-]Steven Byrnes5y30

The least-complicated case (I think) is: I (tentatively) think that the hippocampus is more-or-less a lookup table with a finite number of discrete thoughts / memories / locations / whatever (the type of content in different in different species), and a "proposal" is just "which of the discrete things should be activated right now".

A medium-difficulty case is: I think motor cortex stores a bunch of sequences of motor commands which execute different common action sequences. (I'm a believer in the Graziano theory that primary motor cortex, secondary motor cortex, supplementary motor cortex, etc. etc., are all doing the same kind of thing and should be lumped together.) The exact details of the data structures that the brain uses to store these sequences of motor commands are controversial and I don't want to get into it here…

Then the hardest case is the areas that "think thoughts", spawn new ideas, etc., all the cool stuff that leads to human intelligence. (e.g. dorsolateral prefrontal cortex I think.) Things like "I'm going to go to the store" or "what if I differentiate both sides of the equation?". Those things are clearly not isomorphic to a sequence of motor commands. It's higher-level than that. Again, the exact data structures and algorithms involved in representing and searching for these "thoughts" is a very big and controversial topic that I don't want to get into here…

Cortex-like part of the loops	Hippo- campus	Amygdala [basolateral part]	Piriform cortex	Ventromedial prefrontal cortex	Motor & “planning” cortex
Striatum-like part of the loops	Lateral septum	Amygdala [central part]	Olfactory tubercle	Ventral striatum	Dorsal striatum
Pallidum-like part of the loops	Medial septum	BNST	Substantia innominata	Ventral pallidum	Globus pallidus

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

23

23

Summary / Table of Contents

Backstory: Dopamine and TD learning on computers and in animals

Reward Prediction Errors (RPEs), and Temporal Difference (TD) learning

The Schultz dopamine experiments

Some wrinkles in the story

The cortico-basal ganglia-thalamo-cortical loop

The loop

The telencephalon: simpler than it looks, and full of loops

Telencephalic “learning-from-scratch-ism”

Telencephalic uniformity? Nah, let’s call it “family resemblance”

Toy loop model

“Context” in the striatum value function

Three categories of dopamine signals

Overview: why heterogeneous dopamine?

Dopamine category #1: RPE for “Success-In-Life Reward”

Dopamine category #2: RPE for “local” sub-circuit rewards

Example 2A: Birds teaching themselves to sing by RL

Example 2B: Animals teaching themselves to move smoothly and efficiently by RL

Example 2C: Visual attention

Dopamine category #3: Supervised Learning (SL) error signals

Finally, the “prediction” part of reward prediction error

Zooming back out: lessons for Artificial General Intelligence (AGI) safety