Planned summary for the Alignment Newsletter:
We could try creating AI systems that take the "artificial intentional stance" towards humans: that is, they model humans as agents that are trying to achieve some goals, and then we get the AI system to optimize for those inferred goals. We could do this by training an agent that jointly models the world and understands natural language, in order to ground the language into actual states of the world. The hope is that with this scheme, as the agent gets more capable, its understanding of what we want improves as well, so that it is robust to scaling up. However, the scheme has no protection against Goodharting, and doesn't obviously care about metaethics.
Planned opinion:
I agree with the general spirit of "get the AI system to understand common sense; then give it instructions that it interprets correctly". I usually expect future ML research to figure out the common sense part, so I don't look for particular implementations (in this case, simultaneous training on vision and natural language), but just assume we'll have that capability somehow. The hard part is then how to leverage that capability to provide _correctly interpreted_ instructions. It may be as simple as providing instructions in natural language, as this post suggests. I'm much less worried about instrumental subgoals in such a scenario, since part of "understanding what we mean" includes "and don't pursue this instruction literally to extremes". But we still need to figure out how to translate natural language instructions into actions.
I don't have a complete picture of the scheme. Is it: "From a trajectory of actions and observations, an English text sample is presented with each observation, and the agent has to predict this text alongside the observations, and then it acts according to some reward function like (and this is simplified) 1 if it sees the text 'you did what we wanted' and 0 otherwise"? If the scheme you're proposing is different than that, my guess is that you're imagining a recurrent neural network architecture and most of the weights are only trained to predict the observations, and then other weights are trained to predict the text samples. Am I in the right ballpark here?
Sorta the right ballpark. Lack of specificity is definitely my fault - I have more sympathy now for those academics who have a dozen publications that are restatements of the same thing.
I'm a bit more specific in my reply to steve2152 above. I'm thinking about this scheme as a couple of encoder-decoders stiched together at the point of maximal compression, which can do several different encoding/decoding tasks and therefore can be (and for practical purposes should be) trained on several different kinds of data.
For example, it can encode sensory information into an abstract representation, and then decode it back, so you can train that task. It can encode descriptive sentences into the same representation, and then decode them back, so you can train that task. This should reduce the amount of actual annotated text-sensorium pairs you need.
As for what to tell it to pattern-match for as a good state, I was thinking with a little subtlety, but not much. "You did what we wanted" is too bare bones; it will try to change what we want. But I think we might get it to do metaethics for us by talking about "human values" in the abstract, ot maybe "human values as of 2020." And I don't think it can do much harm to further specify things like enjoyment, interesting lives, friendship, love, learning, sensory experience, etc etc.
This "wish" picks out a vector in the abstract representation space for the AI to treat as the axis of goodness. And the entire dream is that this abstract space encodes enough of common sense that small perturbations of the vector won't screw up the future. Which now that I say it like that, sounds like the sort of thing that should imply some statistical properties we could test for.
In the scheme I described, the behavior can be described as "the agent tries to get the text 'you did what we wanted' to be sent to it." A great way to do this would be to intervene in the provision of text. So the scheme I described doesn't make any progress in avoiding the classic wireheading scenario. The second possibility I described, where there are some games played regarding how different parameters are trained (the RNN is only trained to predict observations, and then another neural network originates from a narrow hidden layer in the RNN and produces text predictions as output) has the exact same wireheading pathology too.
Changing the nature of the goal as a function of what text it sees also doesn't stop "take over world, and in particular, the provision of text" from being an optimal solution.
I still am uncertain if I'm missing some key detail in your proposal, but right now my impression is that it falls prey to the same sort of wireheading incentive that a standard reinforcement learner does.
Ah, I see what you mean. Yes, this is a serious problem, but (I think) this scheme does have forces that act against it - which makes more sense if you imagine what supervised vs unsupervised learning does to our encoder/decoder. (As opposed to lumping everything together into a semi-supervised training process.)
Supervised learning is the root of the problem, because the most accurate way to predict the supervised text from the world state is to realize that it's the output of a specific physical process (the keyboard). If we only had supervised learning, we'd have to make the training optimum different from the most accurate prediction, by adding a regularization term and the crossing our fingers that we'd correctly set the arbitrary parameters in it.
But the other thing going on in the scheme is that the AI is trying to compress text and sensory experience to the same representation using unsupervised learning. This is going to help to the extent that language shares important patterns with the world.
For example, if the AI hacks its text channel so that it's just a buffer full of "Human values are highly satisfied," this might (in the limit of lots of data and compute) make supervised learning happy. But unsupervised learning just cares about is the patterns it discovered that language and the world share.
(Though now that I think about it, in the limit of infinite compute, unsupervised learning also discovers the relationship between the text and the physical channel. But it still also cares about the usual correspondence between description and reality, and seems like it should accurately make a level distinction between reality and the text, so I need to think about whether this matters)
To the unsupervised learning, hacking the text channel looks (to the extent that you can do translation by compressing to a shared representation) like the sort of thing that might be described by sentences like "The AI is just sitting there" or "A swarm of nanomachines has been released to protect the text channel," not "Human values are highly satisfied."
So why consider supervised text/history pairs at all? Well, I guess just because supervised learning is way more efficient at picking out something that's at least sort of like the correspondence that we mean. Not just as a practical benefit - there might be multiple optima that unsupervised learning could end up in, and I think we want something close-ish to the supervised case.
0
I promised in a previous post that I would give a post-mortem for a scheme for learning the intentional stance from natural language. This is that post. But first, I should explain why such an idea might seem good in the first place.
Some people think of AI as a genie. The goal of AI research, in this picture, is to "tell the AI what to do," sometimes explicitly in natural language. And then since the AI is smart, it will understand what we mean and do that, because to do something else would be stupid.
This is, in a sense, very naive. Making an AI that does what we want is not at all like instructing a human - see the relevant Eliezer post - the methods, dangers, and goals are all different. But... if the AI understood what we meant, maybe we could just tell it what to do.
Of course, "understood what we meant" captures more or less the whole problem, because meaning isn't like the charge of the electron, it's nowhere in the words themselves. When you understand moral language, you're implicitly using your morals. But what if we trained an AI so that it functionally understood moral language - would that be implicitly using your morals too, and isn't that exactly what we want?
1 - The Artificial Intentional Stance
I like to think of myself as having preferences, but at the same time I am made of atoms, and my preferences are not-like-the-charge-of-the-electron, they're nowhere in the atoms. Instead, my preferences are an abstraction that I (and others) use when thinking about me.
So part of the this artificial intentional stance stuff can be summed up as: get the AI to think about humans like humans think about humans. (Another part is that abstractions are contagious. If I want to go to the gym, to handle this correctly this you need abstractions not just for me but also for the gym.)
We often put too much magic into the word "understand." If the AI can hold a good conversation and extract real-world information from human speech, it's reasonable to say it understands what we're saying. And then once it understands us, you might think "communicating our goals to it is a lot like communicating with a human."
But it's easy to hold a decent conversation without taking the intentional stance towards humans, and easier still to extract real-world information from human speech without the intentional stance. This leads to problems that become clear if you try to take an AI that does a good job at modeling language, and follow step by step how to get it to choose actions that are good for humans.
The dream is to learn the intentional stance by using the information implicit in our use of language. The intentional stance requires picking out good levels of abstraction to model humans on, and using language implies that the good levels of abstraction are the ones humans implicitly use in language. Is this what we want? I don't know, it might be?
It certainly isn't the only option - we might imagine other schemes involving trying to amplify emulations of humans, semi-supervised learning from examples of good and bad behavior, or multi-stage chains of making increasingly trustworthy AIs. But the question is whether it's an option.
3 - Process
It's not hard to hook up a videocamera to an image captioner to a deep reinforcement learner and say you can input goals with natural language because when you set the goal to "cat," your camera will look for cats. It's a lot harder to get that camera to look for what's best in life.
This is the bind I got myself into, writing this post. Value learning schemes that are simple are wrong (issues with the cat camera above are left as an exercise to the reader), and value learning schemes that seem promising have been selected for incomprehensibility and poor epistemic luck. So I tried to split the difference, if favoring interestingness over simplicity.
Here's some of the rules of thumb I used when thinking of ways to apply natural language processing to value learning:
First, I wanted to avoid the scheme having glaringly unspecified parts. It's very easy to be lazy and not specify something enough that it actually chooses actions, or feel like I've made progress but not be able to apply it. Usually either of these meant I was sweeping problems under the rug - the right level of specificity involves sweeping out some of those cobwebs.
Second, I needed to encourage myself to be specific about the intended purpose of natural language processing in each particular scheme. Yes, the dream is that it "includes common sense" or something like that, but that's not specific enough mental technology to tell whether you're solving the intended problem without unnecessary side effects, or explain why different methods get different results.
It was profitable to think of natural language processing as being targeted at the problem of alien concepts: when the AI can match your training examples but still fail to generalize how you want because it's representing your examples in an alien way. For example, an image classifier might learn to distinguish dogs by the texture of their fur, but we're not going to be happy with how that generalizes to fur-less dogs or dog-less fur. Now replace "dogs" with "human values" and "fur" with "superficial features that work well on the training set."
An even more specific purpose of natural language would be "greedy reification" - actively trying to form concepts that correspond to linguistic tokens. So if we have a word "dog," we want to incentivize the AI to form a concept that picks out dogs in the world-model, and then the hope is that this also works on "human values."
4 - Mort
So here's a value learning scheme: try to squish the world and natural language into the same latent space, just with different input/output functions.
Training this simultaneous model might just be separately trying to do encoding-decoding or prediction tasks with sensory data and text, but more plausibly it should involve translation tasks where we can associate words with sensory environments. The model required is somewhat subtle, because we don't want words associated with raw sense data, we want words associated with the state of the AI's model of the world. This mandates that to the world-model, this latent space should look like the persistent state associated with sequence prediction or encoding-decoding of sequences of sense data, with transition dynamics partially included in the shared information. This means the language model should also look like sequence prediction or encoding with some local state consisting of the high-level features.
If I haven't said anything impossible so far, we could use sufficiently advanced technology to train this simultaneous model so that it's good at understanding both the world and language, and competent at turning one into the other when it comes to simple training examples. Can you now solve value learning by giving it a bunch of English descriptions what we want ("human values satisfied," "do the right thing," "a fulfilling and cosmopolitan future for the galaxy," et c.), and coding it to choose actions that make the state of the world like that?
Looking on the bright side first, what advantages might this have?
What are some big issues with this? Take a second if you like.
No, really, I'd be interested in what people come up with on their own. I don't understand this family of schemes as well as I'd like.
Ready? Okay:
5 - Wrap-up
Going back to the artificial intentional stance and the problem of alien concepts, it seems like this helps in some ways but not in others.
It seems to help with the intentional stance at the object level - the everyday work of translating "Charlie wants to go to the gym" into reasonable actions - but not at the meta-level. It's doubtful that it's modeling humans how they want to be modeled. Maybe this indicates that it would be profitable to break down this concept further. It also might spark your imagination about how to take the same information about humans and end up with something that models humans in a variable way.
A different thing is going on in the department of alien concepts, where we've run into the stress that Goodhart's law places on concepts. Instead of thinking about human-modeling, this makes me want to focus on the decision procedure and training. Can we find a decision procedure that leverages prediction and translation tasks in a way that puts less stress on the concepts? Can we use a training procedure that reduces the context shift when trying to use the model to choose actions?
Overall I think this avenue is pretty interesting to think about. Maybe this also serves as a concrete example of what I mean by trying to create the artificial intentional stance, which can be generalized from language to other options for learning about humans.