I broadly agree with a lot of shard theory claims. However, the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values. Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.
With AGI, the key will be to work primarily top-down since our linguistic constructs of values tend to reflect much better our ideal values than our actually realised behaviours. Using the AGI's 'linguistic cortex' which already has encoded verbal knowledge about human morality and values to evaluate potential courses of action and as a reward signal which can then get crystallised into learnt policies. The key difficulty is understanding how, in humans, the base reward functions interact with behaviour to make us 'truly want' specific outcomes (if humans even do) as opposed to reward or their correlated social assessments. It is possible, even likely, that this is just the default outcome of model-free RL experienced from the inside and in this case our AGIs would look highly anthropomorphic.
Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization -- i.e .effectively planning over a world model -- is necessary in situations where a.) you can't behaviourally clone existing behaviour and b.) you can't self-play too much with a model-free RL algorithms and so must rely on the world-model. In such a scenario you do not have ground truth reward signals and the only way to amake progresss is to optimise against some implicit learnt reward function.
I also am not sure that an agent that explicitly optimises this is hard to align and the major threat is goodhearting. We can perfectly align Go-playing AIs with this scheme because we have a ground truth exact reward function. Goodhearting is essentially isomorphic to a case of overfitting and can in theory be solved with various kinds of regularisation, especially if the AI maintains a well-calibrated sense of reward function uncertainty then in theory we can derive quantification bounds on its divergence from the true reward function.
Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization -- i.e .effectively planning over a world model
FWIW I don't consider myself to be arguing against planning over a world model.
the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values.
Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.
Can you give me some examples here? I don't know that I follow what you're pointing at.
(Evolution) → (human values) is not the only case of inner alignment failure which we know about. I have argued that human values themselves are inner alignment failures on the human reward system. This has happened billions of times in slightly different learning setups.
I expect that it has also happened to an extent with animals as well. I wonder if anyone has ever looked into this.
This was an appendix of Inner and outer alignment decompose one hard problem into two extremely hard problems. However, I think the material is self-contained and worth sharing separately, especially since AGI Ruin: A List of Lethalities has become so influential.
(I agree with most of the points made in AGI Ruin, but I'm going to focus on disagreements in this essay.)(Stricken on 1/9/24)Here are some quotes with which I disagree, in light of points I made in Inner and outer alignment decompose one hard problem into two extremely hard problems (consult its TL;DR and detailed summary for a refresher, if need be).
List of Lethalities
(Evolution) → (human values) is not the only case of inner alignment failure which we know about. I have argued that human values themselves are inner alignment failures on the human reward system. This has happened billions of times in slightly different learning setups.
Strictly separately, it seems to me that people draw rather strong inferences from a rather loose analogy with evolution. I think that (evolution) → (human values) is far less informative for alignment than (human reward circuitry) → (human values). I don’t agree with a strong focus on the former, given the latter is available as a source of information.
We want to draw inferences about the mapping from (AI reward circuitry) → (AI values), which is an iterative training process using reinforcement learning and self-supervised learning. Therefore, we should consider existing evidence about the (human reward circuitry) → (human values) setup, which (AFAICT) also takes place using an iterative, local update process using reinforcement learning and self-supervised learning.
Brain architecture and training is not AI architecture and training, so the evidence is going to be weakened. But for nearly every way in which (human reward circuitry) → (human values) is disanalogous to (AI reward circuitry) → (AI values), (evolution) → (human values) is even more disanalogous! For more on this, see Quintin's post.
My summary: Sensory reward signals are not ground truth on the agent’s alignment to our goals. Even if you solve inner alignment, you’re still dead.
My response: We don’t want to end up with an AI which primarily values its own reward, because then it wouldn’t value humans. Beyond that, this item is not a “central” lethality (and a bunch of these central-to-EY lethalities are in fact about outer/inner). We don’t need a function of sensory input which is safe to maximize, that’s not the function of the reward signal. Reward chisels cognition. Reward is not necessarily—nor do we want it to be—a ground-truth signal about alignment.
My summary: The theory in the current paradigm only tells you how to, at best, align an agent to direct functions of sensory observables. Even if we achieve this kind of alignment, we die. It’s just a fact that sensory observables can’t discriminate between good and bad latent world-trajectories.
My response: I understand “the on-paper design properties” and “insofar as the current paradigm works at all” to represent Eliezer’s understanding of the properties and the paradigm (he did describe these points as “central difficulties of outer and inner alignment”[1]). But on my view, this lethality does not see very relevant or central to alignment. Use reward to supply good cognitive updates to the agent. I don't find myself thinking about reward as that which gets maximized, or which should get maximized.
Also, if you ignore the oft-repeated wrong/under-hedged claim that “RL agents maximize reward” or whatever, the on-paper design properties suggest that reward aligns agents to objectives in reality according to the computations which reward reinforces. I think that machine learning does not, in general, align agents to sense data and reward functions. I think that focusing on the sensory-alignment question can be misleading as to the nature of the reward-chiseling challenge which we confront.
It's true that we don't know that we know how to reliably make superintelligent agents learn human-compatible values. However, by the same coin (e.g. by the arguments in reward is not the optimization target), I can just as equally ask "how do I get agents to care about sensory observables and reward data?". It's not like we know how to ensure deep learning-trained agents care about their sensory observables and reward data.
My summary: Perceived alignment on the training distribution is all we know how to run gradients over, but historically, alignment on training does not generalize to alignment on deployment. Furthermore, when the agent becomes highly capable, it will gain a flood of abilities and opportunities to competently optimize whatever vaguely good-seeming internal proxy objectives we entrained into its cognition. When this happens, the AI's capabilities will keep growing, but its alignment will not.
My response: This perceived disagreement might be important, or maybe I just use words differently than Eliezer.
When I’m not thinking in terms of inner/outer, but “what cognition got chiseled into the AI?”, there isn’t any separate “tendency to fail to generalize alignment” in a deceptive misalignment scenario. The AI just didn’t have the cognition you thought or wanted.
For simplicity, suppose you want the future to contain lots of bananas. Suppose you think your AI cares about bananas but actually it primarily cares about fruit in general and only pretended to primarily care about bananas, for instrumental reasons. Then it kills everyone and makes a ton of fruit (only some of which are bananas). In that scenario, we should have chiseled different cognition into the AI so that it would have valued bananas more strongly. (Similarly for "the AI cared about granite spheres and paperclips and...")
While this scenario involves misgeneralization, there’s no separate tendency of “alignment shalt not generalize.”
But suppose you do get the AI to primarily care about bananas early in training, and it retains that banana value shard/decision-influencing-factor into mid-training. At this point, I think the banana-shard will convergently be motivated to steer the AI’s future training so that the AI keeps making bananas. So, if you get some of the early-/mid-training values to care about making bananas, then those early-/mid-values will, by instrumental convergence, reliably steer training to keep generalizing appropriately. If they did not, that would lead to fewer bananas, and the banana-shard would bid for a different path of capability gain!
(This is not an airtight safety argument, but I think it's a reasonably strong a priori case.)
The main difficulty here still seems to be my already-central expected difficulty of “loss signals might chisel undesired values into the AI.”
Eliezer is mockingly imitating a naive AI alignment researcher. My current read, however, is that the bolded part represents his real view. Given that: A loss function is not a “wish” or an expression of your desires. A loss function is a source of gradient updates, a loss function is a chisel with which to shape the agent’s cognition.
To me, this statement seems weird and sideways of central alignment problems. I perceive Eliezer to be arguing "If only the loss function represented what we wanted, that'd be better." If he meant to connote "loss functions simply won't represent what you want, get over it, that's not how alignment works", we're more likely on the same page.
My response:
First, I want to say: type error: loss function not of type goal.[2] I imagine Eliezer understands this, at least on the more obvious level of the statement. But I'm going to explain my worldview here so as to better triangulate my meaning.
I think there's potential for deep confusion here. Loss functions provide gradients to the way the AI thinks (i.e. computes forward passes). Trying to cast human values[3] into a loss function is a highly unnatural type conversion to attempt. Attempting to force the conversion anyways may well damage your view of the alignment problem.
From Four usages of "loss" in AI:
Second, we want to train a network which ends up doing what we want. There are several strategies to achieve this.
It might shake out that, as an empirical fact, the best way to spend an additional increment of alignment research is to make the loss function "represent what you want" in some way. For example, you might more accurately spot flaws in AI-generated alignment proposals, and train the AI on that more accurate signal.
But "make the objective better 'represent' our goals" would be an empirical contingency, not pinned down by the mechanistic function of a loss function. This contingency may be sensitive to the means by which feedback translates into gradient updates. For example, changing the loss function will probably differently affect the gradients provided by:
Because loss is not the optimization target, there's some level of "goal representation" where I should stop thinking about how "good" the loss function is, and start thinking about e.g. the abstractions learned by self-supervised pre-training. EG If I populate the corpus with more instances of people helping each other, that might change the inductive biases on SGD dynamics to increase the probability of helping-concepts getting hooked in to value shard formation.
I think it's possible that after more deliberation, I'll conclude "we should just consider some intuitive notion of 'goal representation fidelity' when reasoning about P(alignment | loss function)." I just don't know where or whether this deliberation is supposed to have occurred. So we probably need more of it.
Because loss functions don't natively represent goals, and because of these empirical contingencies, I'm weirded out by statements like "the loss function doesn't capture what you really want."[4]
Other disagreements with alignment thinkers
Evan Hubinger
Sometimes, inner/outer alignment ideas can be appropriate (e.g. chess). For aligning real-world agents in partially observable environments, I think it’s not that appropriate. (See here for a more detailed discussion of what I eventually realized Evan means here, though.)
Paul Christiano
I read this and think “this all feels like a red herring.” I think this is not necessary because robust grading is not necessary for alignment. However, because reward provides cognitive updates, it’s important to think carefully about what cognitive updates will be provided by the reward given when e.g. a large language model submits an alignment proposal. Those reward events will shape the network’s decision-making and generalization properties, which is what we’re really interested in.
Why do we need new learning algorithms? The point of reward, on a mechanistic basis, is to update the agent’s cognition. Shaping reward seems fine to me, and I am uncomfortable with this apparent-to-me emphasis as reward “embodying” the agent’s goals.
Nick Bostrom
Historical reasoning about RL seems quite bad. This is a prime example. In one fell swoop, in several pages of mistaken exposition, Superintelligence rules out the single known method for producing human-compatible values. We should forewarn new alignment researchers of these deep confusions before recommending this book.
Thanks to Drake Thomas, ChatGPT, Ulisse Mini, and Peli Grietzer for feedback on this post.
List of Lethalities was, AFAICT, intended to convey the most important dangers, in the right language. Rob Bensinger (who works at MIRI but was expressing his own views) also commented:
So if Eliezer's talking about "how do we get agents to care about non-sensory observables", this indicates to me that I disagree with him about what the central subproblems of alignment are.
From Inner and outer alignment decompose one hard problem into two extremely hard problems:
I think this holds for basically any values in a rich, partially observable domain, including paperclip optimization or picking three flowers.
ChatGPT wrote this hammer analogy, given the prompt of a post draft (but the draft didn't include any of my reward-as-chisel analogies).