User Comment Replies — AI Alignment Forum

The easy goal inference problem is still hard

I guess this falls into the category of "Well, we’ll deal with that problem when it comes up", but I'd imagine when a human preference in a particular dilemma is undefined or even just highly uncertain, one can often defer to other rules like--rather than maximize an uncertain preference, default to maximizing the human's agency, in scenarios where preference is unclear, even if this predictably leads to less-than-optimal preference satisfaction.

[Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts

Ben Smith3y10

Still working my way through reading this series--it is the best thing I have read in quite a while and I'm very grateful you wrote it!

I feel like I agree with your take on "little glimpses of empathy" 100%.

I think fear of strangers could be implemented without a steering subsystem circuit maybe? (Should say up front I don't know more about developmental psychology/neuroscience than you do, but here's my 2c anyway). Put aside whether there's another more basic steering subsystem circuit for agency detection; we know that pretty early on, through some combi... (read more)

[Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering

Ben Smith3y40

Hey Steve, I am reading through this series now and am really enjoying it! Your work is incredibly original and wide-ranging as far as I can see--it's impressive how many different topics you have synthesized.

I have one question on this post--maybe doesn't rise above the level of 'nitpick', I'm not sure. You mention a "curiosity drive" and other Category A things that the "Steering Subsystem needs to do in order to get general intelligence". You've also identified the human Steering Subsystem as the hypothalamus and brain stem.

Is it possible things like a ... (read more)

4Steve Byrnes3y

Thanks! First of all, to make sure we’re on the same page, there’s a difference between “self-supervised learning” and “motivation to reduce prediction error”, right? The former involves weight update, the latter involves decisions and rewards. The former is definitely a thing in the neocortex—I don’t think that’s controversial. As for the latter, well I don’t know the full suite of human motivations, but novelty-seeking is definitely a thing, and spending all day in a dark room is not much of a thing, and both of those would go against a motivation to reduce prediction error. On the other hand, people sometimes dislike being confused, which would be consistent with a motivation to reduce prediction error. So I figure, maybe there’s a general motivation to reduce prediction error (but there are also other motivations that sometimes outweigh it), or maybe there isn’t such a motivation at all (but other motivations can sometimes coincidentally point in that direction). Hard to say. ¯\_(ツ)_/¯ I absolutely believe that there are signals from the telencephalon, communicating telencephalon activity / outputs, which are used as inputs to the calculations leading up to the final reward prediction error (RPE) signal in the brainstem. Then there has to be some circuitry somewhere setting things up such that some particular type of telencephalon activity / outputs have some particular effect on RPE. Where is this circuitry? Telencephalon or brainstem? Well, I guess you can say that if a connection from Telencephalon Point A to Brainstem Point B is doing something specific and important, then it’s a little bit arbitrary whether we call this “telencephalon circuitry” versus “brainstem circuitry”. In all the examples I’ve seen, it’s tended to make more sense to lump it in with the brainstem / hypothalamus. But it’s hard for me to argue that without a better understanding of what you have in mind here.

Multi-dimensional rewards for AGI interpretability and control

Ben Smith3y10

Very late to the party here. I don't know how much of the thinking in this post you still endorse or are still interested in. But this was a nice read. I wanted to add a few things:

- since you wrote this piece back in 2021, I have learned there is a whole mini-field of computer science dealing with multi-objective reward learning, maybe centered around . Maybe a good place to start there is https://link.springer.com/article/10.1007/s10458-022-09552-y

- The shard theory folks have done a fairly good job sketching out broad principles but it seems... (read more)

A brief review of the reasons multi-objective RL could be important in AI Safety Research

Ben Smith3y20

That's right. What I mainly have in mind is a vector of Q-learned values V and a scalarization function that combines them in some (probably non-linear) way. Note that in our technical work, the combination occurs during action selection, not during reward assignment and learning.

I guess whether one calls this "multi-objective RL" is semantic. Because objectives are combined during action selection, not during learning itself, I would not call it "single objective RL with a complicated objective". If you combined objectives during reward, then I could call... (read more)

Supplement to "Big picture of phasic dopamine"

Ben Smith4y00

Interesting. Is it fair to say that Mollick's system is relatively more "serial" with fewer parallelisms at the subcortical level, whereas you're proposing a system that's much more "parallel" because there are separate systems doing analogous things at each level? I think that parallel arrangement is probably the thing I've learned most personally from reading your work. Maybe I just hadn't thought about it because I focus too much on valuation and PFC decision-making stuff and don't look broadly enough at movement and other systems.

Apropos of nothing, is... (read more)

2Steve Byrnes4y

Hmm, I guess I'm not really sure what you're referring to. If I recall, V1 isn't involved in basal ganglia loops, and some higher-level visual areas might project to striatum as "context" but not as part of basal ganglia loops. (I'm not 100% clear on the anatomy here though; I think the literature is confusing to me partly because it took me a while to realize that rat visual cortex is a lot simpler than primate, I've heard it's kinda like "just V1"). So that's the message of "Is RL Involved in Sensory Processing?": there's no RL in the visual cortex AFAICT. Instead I think there's predictive learning, see for example Randall O'Reilly's model. I talk in the main article about "proposal selection". I think the cortex is just full of little models that make predictions about other little models, and/or predictions about sensory inputs, and/or (self-fulfilling) "predictions" about motor outputs. And if a model is making wrong predictions, it gets thrown out, and over time it gets outright deleted from the system. (The proposals are models too.) So if you're staring at a dog, you just can't seriously entertain the proposal "I'm going to milk this cow". That model involves a prediction that the thing you're looking at is a cow, and that model in turn is making lower-level predictions about the sensory inputs, and those predictions are being falsified by the actual sensory input, which is a dog not a cow. So the model gets thrown out. It doesn't matter how high reward you would get for milking a cow, it's not on the table as a possible proposal. I believe I noted that the within-cortex proposal-selection / predictive learning algorithms are important things, but declared them out of scope for this particular post. The last time I wrote anything about the within-cortex algorithm was I guess last year here. These days I'm more excited by the question of "how might we control neocortex-like algorithms?" rather than "how exactly would a neocortex-like algorithm work?" Th

AI ALIGNMENT FORUM
AF

All of Ben Smith's Comments + Replies