All of Dalcy's Comments + Replies

a Markov blanket represents a probabilistic fact about the model without any knowledge you possess about values of specific variables, so it doesn't matter if you actually do know which way the agent chooses to go.

The usual definition of Markov blankets is in terms of the model without any knowledge of the specific values as you say, but I think in Critch's formalism this isn't the case. Specifically, he defines the 'Markov Boundary' of  (being the non-abstracted physics-ish model) as a function of the random variable  (where he wri... (read more)

2Abram Demski
Critch's formalism isn't a markov blanket anyway, as far as I understand it, since he cares about approximate information boundaries rather than perfect Markov properties. Possibly he should not have called his thing "directed markov blankets" although I could be missing something. If I take your point in isolation, and try to imagine a Markov blanket where the variables of the boundary Btcan depend on the value of Wt, then I have questions about how you define conditional independence, to generalize the usual definition of Markov blankets. My initial thought is that your point will end up equivalent to John's comment. IE we can construct random variables which allow us to define Markov blankets in the usual fixed way, while still respecting the intuition of "changing our selection of random variables depending on the world state".

What are the errors in this essay? As I'm reading through the Brain-like AGI sequence I keep seeing this post being referenced (but this post says I should instead read the sequence!)

I would really like to have a single reference post of yours that contains the core ideas about phasic dopamine rather than the reference being the sequence posts (which is heavily dependent on a bunch of previous posts; also Post 5 and 6 feels more high-level than this one?)

2Steve Byrnes
I think that if you read the later Intro to Brain-Like AGI Safety series, then the only reason you might want to read this post (other than historical interest) is that the section “Dopamine category #2: RPE for “local” sub-circuit rewards” is talking about a topic that was omitted from Intro to Brain-Like AGI Safety (for brevity). For example, practically everything I said about neuroanatomy in this post is at least partly wrong and sometimes very wrong. (E.g. the “toy loop model” diagrams are pretty bad.) The “Finally, the “prediction” part of reward prediction error” section has a very strange proposal for how RPE works; I don’t even remember why I ever believed that. The main strengths of the post are the “normative” discussions: why might supervised learning be useful? why might more than one reward signal be useful? etc. I mostly stand by those. I also stand by “learning from scratch” being a very useful concept, and elaborated on it much more later.

Especially because we’re working with toy models that ostensibly fit the description of an optimizer, we may end up with a model that mechanistically doesn’t have an explicit notion of objective.

I think this is very likely to be the default for most toy models one trains RL on. In my model of agent value formation (which looks very much like this post), explicit representation of objectives is useful inasmuch the model already has some sort of internal "optimizer" or search process. And before that, simple "heuristics" (or shards) should suffice—especially in small training regimes.

1Arun Jose
Yeah, this is definitely something I consider plausible. But I don't have a strong stance because RL mechanics could lead to there being an internal search process for toy models (unless this is just my lack of awareness of some work that proves otherwise). That said, I definitely think that work on slightly larger models would be pretty useful and plausibly alleviates this, and is one of the things I'm planning on working on.

I think that RLHF is reasonably likely to be safer than prompt engineering: RLHF is probably a more powerful technique for eliciting your model’s capabilities than prompt engineering is. And so if you need to make a system which has some particular level of performance, you can probably achieve that level of performance with a less generally capable model if you use RLHF than if you use prompt engineering.

Wait, that doesn't really follow. RLHF can elicit more capabilities than prompt engineering, yes, but how is that a reason for RLHF being safer than prompt engineering?

4davidad (David A. Dalrymple)
Here's my steelman of this argument: 1. There is some quantity called a "level of performance". 2. A certain level of performance, P1, is necessary to assist humans in ending the acute risk period. 3. A higher level of performance, P2, is necessary for a treacherous turn. 4. Any given alignment strategy A is associated with a factor λA∈[0,1], such that it can convert an unaligned model with performance P into an aligned model with performance λAP. 5. The maximum achievable performance of unaligned models increases somewhat gradually as a function of time P(t). 6. Given two alignment strategies A and B such that λA>λB, it is more likely that ∃t.λAP(t)≥P1∧P(t)<P2 than that ∃t.λBP(t)≥P1∧P(t)<P2. 7. Therefore, a treacherous turn is less likely in a world with alignment strategy A than in a world with only alignment strategy B. I'm pretty skeptical of premises 3, 4, and 5, but I think the argument actually is valid, and my guess is that Buck and a lot of other folks in the prosaic-alignment space essentially just believe those premises are plausible.