Yeah, this is a pretty common technique at CHAI (relevant search terms: pragmatics, pedagogy, Gricean semantics). Some related work:
I agree that it should be possible to do this over behavior instead of rewards, but behavior-space is much larger or more complex than reward-space and so it would require significantly more data in order to work as well.
I don't think it can be significantly harder for behavior-space than reward-space. If it were, then one of our first messages would be (a mathematical version of) "the behavior I want is approximately reward-maximizing". I don't think that's actually the right way to do things, but it should at least give a reduction of the problem.
Anyway, I'd say the most important difference between this and various existing strategies is that we can learn "at the outermost level". We can treat the code as message, so there can potentially be a basin of attraction even for bugs in the code. The entire ontology of the agent-model can potentially be wrong, but still end up in the basin. We can decide to play an entirely different game. Some of that could potentially be incorporated into other approaches (maybe it has and I just didn't know about it), though it's tricky to really make everything subject to override later on.
Of course, the trade-off is that if everything is subject to override then we really need to start in the basin of attraction - there's no hardcoded assumptions to fall back on if things go off the rails. Thus, robustness tradeoff.
If it were, then one of our first messages would be (a mathematical version of) "the behavior I want is approximately reward-maximizing".
Yeah, I agree that if we had a space of messages that was expressive enough to encode this, then it would be fine to work in behavior space.
Yeah, this is basically CIRL, when the human-model is smart enough to do Gricean communication. The important open problems left over after starting with CIRL are basically "how do you make sure that your model of communicating humans infers the right things about human preferences?", both due to very obvious problems like human irrationality, and also due to weirder stuff like the human intuition that we can't put complete confidence in any single model.
Roughly, yeah, though there are some differences - e.g. here the AI has no prior "directly about" values, it's all mediated by the "messages", which are themselves informing intended AI behavior directly. So e.g. we don't need to assume that "human values" live in the space of utility functions, or that the AI is going to explicitly optimize for something, or anything like that. But most of the things which are hard in CIRL are indeed still hard here; it doesn't really solve anything in itself.
One way to interpret it: this approach uses a similar game to CIRL, but strips out most of the assumptions about the AI and human being expected utility maximizers. To the extent we're modelling the human as an optimizer, it's just an approximation to kick off communication, and can be discarded later on.
I just want to mention that my recent critique of the definition of communication used here does not imply that this is any more inadequate for alignment than your remarks here suggest; in order to "do what I mean, not what I say," we actually want to include connotations and implicature rather than only the literal meaning.
That being said, a theory of meaning which addressed the critique might potentially open the path for a definition much better than the one here. In particular, it might help address the question of what ontology the beliefs should even be in (in order to represent human values etc).
Nice post! It was clear, and I agree that knowing more about the basin of attraction is useful. I also like that you caveat the usefulness of this idea yourself.
Communication priors suggest an approach to certain problems in AI alignment. Intuitively, rather than saying “I want X” and the AI taking that completely literally (as computers generally do), the AI instead updates on the fact that I said “I want X”, and tries to figure out what those words imply about what I actually want. It’s like pushing the “do what I mean” button - the AI would try to figure out what we mean, rather than just doing what we say.
This makes me think about Inverse Reward Design, when the reward signal given is interpreted as an intention with the context of these specific training environments.
More generally: each player’s optimal choices depends heavily on their model of the other player. Alice wants to act like Bob’s model of Alice, and Bob wants to act like Alice’s model of Bob. Then there’s the whole tower of Alice’s model of Bob’s model of Alice’s model of…. Our sequence shows what that tower looks like for one particular model of Alice/Bob.
Makes me think of Common Knowledge, as defined for distributed computing: is common knowledge iff everyone know that everyone knows that .... that everyone knows . That probably only apply to the idealized case, but it might be another way to look at it.
Alice has one of three objects:
She wants Bob to learn which object she has. However, Alice may only send one of three messages:
The rules of the game (i.e. the available messages) are common knowledge before the game starts. What message should Alice send for each object, and what object should Bob deduce from each message?
Let’s think it through from Bob’s standpoint. A clever human might reason like this:
If you’ve played the game CodeNames, then this sort of reasoning might look familiar: "well, 'blue' seems like a good hint for both sky+sapphire and sky+water, but if it were sky+water they would have said 'weather' or something like that instead, so it's probably sky+sapphire...".
Intuitively, this sort of reasoning follows from a communication prior - a prior that someone is choosing their words in order to communicate. In everyday life, this comes up in e.g. the difference between connotation and denotation: when someone uses a connotation-heavy word, the fact that they used that word rather than some more neutral synonym is itself important information. More generally: the implication of words is not the same as their literal content. A communication prior contains a model of how-and-why-the-words-were-chosen, so we can update on the words to figure out their implications, not just their literal meanings.
Communication priors suggest an approach to certain problems in AI alignment. Intuitively, rather than saying “I want X” and the AI taking that completely literally (as computers generally do), the AI instead updates on the fact that I said “I want X”, and tries to figure out what those words imply about what I actually want. It’s like pushing the “do what I mean” button - the AI would try to figure out what we mean, rather than just doing what we say. Indeed, we could even have the AI treat its own source code as a signal about what I mean, rather than as instructions to be taken literally - potentially recognizing when the program we wrote is not quite the program we intended, and doing what we intended instead. (Obviously the program itself would need some open-ended introspection/self-modification capabilities to allow this.) As long as the initial code and initial model of me is “close enough”, the AI could figure out what I meant, and we’d have a “basin of convergence” - any close-enough code/model would converge to what we actually intended.
Of course, that all requires formalizing communication priors. This post sketches out a relatively simple version based on the Alice/Bob example above, then talks about the more complicated version needed for alignment purposes, and about what the approach does and does not do.
Formalizing a Communication Prior
We’ll continue to use the Alice/Bob example with the colored shapes from earlier, though we’ll use more general formulas. We’ll call the message M and the intended meaning (i.e. object) X.
Our receiver (i.e. Bob) starts with some naive guess at the meaning X, just based on the literal content of the message - i.e. “My object is red” would, taken literally, imply that it’s either the triangle or the circle. We’ll write this naive guess as
P0[X|‘‘M”]=1P[M]P[M|X]P[X]
This is basically just a Bayesian update. The only subtlety is the quotes around ‘‘M” - this makes a distinction between the message ‘‘M” (i.e. the letters “My object is red” on a screen) and the literal meaning M of the message (the fact that the object is red). The formula says that the naive guess at the intended meaning given the message (i.e. P0[X|‘‘M”]) is just a Bayesian update on the literal meaning of the message.
At this stage, assuming a uniform prior on the three objects, Bob would say that:
But at this point, Bob hasn’t accounted for all his information. He also knows that Alice chose the message to maximize the chance that Bob would guess the right object. So, let’s do another Bayesian update on the assumption that Alice chose the message to maximize the probability assigned to X under P0.
P1[X|‘‘M”]=1ZP[(‘‘M”maximizesP0[X|‘‘M”])|X]P0[X|‘‘M”]
(Side note: Z here is a generic symbol for the normalizer in the update, which would normally be P[‘‘M”]. I’ll continue to use it going forward, since the exact things we’re implicitly conditioning P[‘‘M”] on can be a bit confusing in a way which doesn’t add anything.) This is another Bayesian update, but this time starting from P0 rather than the original prior. At this stage, Bob would say that:
Let’s do one more step, just to illustrate. Bob still hasn’t used all his information - it’s not just that Alice chose the message to maximize the probability assigned to X under P0, she also chose it to maximize the probability assigned to X under P1. How did she choose the message to maximize both of these simultaneously? Well, given our formulas above, if ‘‘M” maximizes P1[X|‘‘M”], then that implies that ‘‘M” maximizes P0[X|‘‘M”] as well. However, the implication does not go back the other way in general; the fact that ‘‘M” maximizes P1[X|‘‘M”] is stronger.
Intuitively, we’re “ruling out” messages for each X at each stage. Any message not ruled out at stage 1 was also not ruled out at stage 0 - the messages “not ruled out” for X are precisely those which assign maximal probability to X at all earlier stages.
Upshot: by choosing X to maximize P1[X|‘‘M”], Alice also implicitly chose X to maximize P0[X|‘‘M”].
Anyway, next step: we form P2[X|‘‘M”] by updating on the fact that ‘‘M” maximizes the probability assigned to X under P1.
P2[X|‘‘M”]=1ZP[(‘‘M”maximizesP1[X|‘‘M”])|X]P0[X|‘‘M”]
Note that we’re still using P0 as our prior in this update; that’s to avoid double-counting the fact that Alice is maximizing P0, while still accounting for the literal content M. If we continue the chain, each subsequent step will look like
Pk+1[X|‘‘M”]=1ZP[(‘‘M”maximizesPk[X|‘‘M”])|X]P0[X|‘‘M”]
In this case, we find that P2 is exactly the same as P1 - the calculation has converged in finite time. More generally, we can say that Bob’s final probabilities should be
P[X|‘‘M"]=P∞[X|‘‘M”]=limk→∞Pk[X|‘‘M”]
As a Fixed Point
The argument above is very meta, and hard to follow. We can simplify it by using a fixed point argument instead.
Instead of the whole sequence of updates, we’ll just start from P0 (i.e. the literal content of the message), and update in a single step on the fact that Alice is optimizing the message: Alice chooses the message ‘‘M” to maximize the final probability P[X|‘‘M”].
P[X|‘‘M”]=1ZP[(‘‘M”maximizesP[X|‘‘M”])|X]P0[X|‘‘M”]
This is a fixed-point formula for P[X|‘‘M”]. Formally, the “communication prior” itself is (‘‘M”maximizesP[X|‘‘M”]).
This is intuitively simple, but unfortunately P[X|‘‘M”] is extremely underdetermined by the fixed-point formula; there are many possible P[X|‘‘M”] we could choose, and limk→∞Pk[X|‘‘M”] is just one of them. Intuitively: we could map messages to objects any way we want, as long as we respect the literal content of the message. As long as Alice and Bob both know the mapping, we choose P[X|‘‘M”] according to the mapping, and everything works out.
The fixed point formula is a criterion which any winning strategy must satisfy, but there are still many winning strategies.
Our particular choice of P[X|‘‘M”]=limk→∞Pk[X|‘‘M”] comes from iteratively expanding the fixed-point formula, with initial point P0. If either Alice or Bob decides to use this model, and the other knows that they’re using it, then it’s locked in.
More generally: each player’s optimal choices depends heavily on their model of the other player. Alice wants to act like Bob’s model of Alice, and Bob wants to act like Alice’s model of Bob. Then there’s the whole tower of Alice’s model of Bob’s model of Alice’s model of…. Our Pk[X|‘‘M”] sequence shows what that tower looks like for one particular model of Alice/Bob.
Beyond Idealized Agents
The (‘‘M”maximizesP[X|‘‘M”]) communication prior is where Alice and Bob’s models of each other enter. In this case, we’re effectively assuming that Alice is a perfect agent - i.e. she picks her message to perfectly optimize Bob’s posterior. This is an idealized communication prior for idealized agents.
For alignment, we instead want a model of how humans communicate - as people who’ve played CodeNames can confirm, humans do not reliably think through many levels of implications of their word-choices! We really want to update on something like (<rough-model-of-human> thinks ‘‘M” results in high P[X|‘‘M”]). The better the model of how the human chose ‘‘M" based on what they want, the better the AI will be able to guess what we want (i.e. X) from our “messages”.
To the extent that the AI is modelling the human modelling the AI, we still get the meta-tower and possibly a fixed point formula (depending on how good the model of the AI in the human’s head is). The AI can treat both its own code and the human-model as “messages”, and so potentially correct sufficiently-small errors in them.
What This Does And Does Not Do
In some sense, this idea solves basically none of the core problems of alignment. We still need a good-enough model of a human and a good-enough pointer to human values. We’d still like an AI architecture with goals stable under successor-construction. For maximum safety, we’d still ideally like some good-enough scaled-down tests and/or proofs that some subcomponents actually work the way we intuitively expect. Etc.
What this does buy us is a basin of convergence. On all of the key pieces, we just need to be “close enough” for the whole thing to work. Potentially being able to recover even from small bugs in the source code is a pretty nice selling point. Of course, there are probably basins of convergence for many approaches, but this one offers at least the possibility of being able to explicitly model the basin. How sensitive is the end result to errors along different dimensions of the human-model? That’s the sort of question which could be addressed (either theoretically or empirically) in toy models along these lines, and potentially lead to generalizable insights about which pieces matter more or less. In other words: we could potentially say things about how big the basin of convergence is, and along which directions it’s wide/narrow.
That said, I still think the biggest blocker - both for this approach and many others - is figuring out pointers to human values, and how pointers to real-world abstract objects/concepts work more generally. Right now, we don’t even understand the type-signature of a “pointer” in this sense, so it’s rather difficult to talk about a basin-of-convergence for human-value-pointers.