Linch - AI Alignment Forum

Evaluating the historical value misspecification argument

I think I'm relatively optimistic that the difference between a system that "can (and will) do a very good job with human values when restricted to the text domain: vs "system that can do a very good job, unrestricted" isn't that high. This is because I'm personally fairly skeptical about arguments along the lines of "words aren't human thinking, words are mere shadows of human thinking" that people put out, at least when it comes to human values.

(It's definitely possible to come up with examples that illustrates the differences between all of human thinking and human-thinking-put-into-words; I agree about their existence, I disagree about their importance).

Evaluating the historical value misspecification argument

Linch1y*156

I think I read this a few times but I still don't think I fully understand your point. I'm going to try to rephrase what I believe you are saying in my own words:

Our correct epistemic state in 2000 or 2010 should be to have a lot of uncertainty about the complexity and fragility of human values. Perhaps it is very complex, but perhaps people are just not approaching it correctly.
At the limit, the level of complexity can approach "simulate a number of human beings in constant conversation and moral deliberation with each other, embedded in the existing broader environment, and where a small mistake in the simulation renders the entire thing broken in the sense of losing almost all moral value in the universe if that's what you point at"
At the other, you can imagine a fairly simple mathematical statement that's practically robust to any OOD environments or small perturbations.
In worlds where human values aren't very complex, alignment isn't solved, but you should perhaps expect it to be (significantly) easier. ("Optimize for this mathematical statement" is an easier thing to point at than "optimize for the outcome of this complex deliberation, no, not the actual answers out of their mouths but the indirect more abstract thing they point at")
Suppose in 2000 you were told that a100-line Python program (that doesn't abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you should rationally conclude that human values aren't actually all that complex (more complex than the clean mathematical statement, but simpler than almost everything else).
In such a world, if inner alignment is solved, you can "just" train a superintelligent AI to "optimize for the results of that Python program" and you'd get a superintelligent AI with human values.
- Notably, alignment isn't solved by itself. You still need to get the superintelligent AI to actually optimize for that Python program and not some random other thing that happens to have low predictive loss in training on that program.
Well, in 2023 we have that Python program, with a few relaxations:
- The answer isn't embedded in 100 lines of Python, but in a subset of the weights of GPT-4
  - Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values.
- What we have now isn't a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with.
The human value function as expressed by GPT-4 is also immune to almost all in-practice, non-adversarial, perturbations
We should then rationally update on the complexity of human values. It's probably not much more complex than GPT-4, and possibly significantly simpler. ie, the fact that we have a pretty good description of human values well short of superintelligent AI means we should not expect a perfect description of human values to be very complex either.
This is a different claim from saying that Superintelligent AIs will understand human values; which everybody agrees with. Human values isn't any more mysterious from the perspective of physics than any other emergent property like fluid dynamics or the formation of cities.
However, if AIs needed to be superintelligent (eg at the level of approximating physics simulations of Earth) before they grasp human values, that'd be too late, as they can/will destroy the world before their human creators can task a training process (or other ways of making AGI) towards {this thing that we mean when we say human values}.
But instead, the world we live in is one where we can point future AGIs towards the outputs of GPT-N when asked questions about morality as the thing to optimize for.
- Which, again, isn't to say the alignment problem is solved, we might still all die because future AGIs could just be like "lol nope" to the outputs of GPT-N, or try to hack it to produce adversarial results, or something. But at least one subset of the problem is either solved or a non-issue, depending on your POV.
Given all this, MIRI appeared to empirically be wrong when they previously talked about the complexity and fragility of human values. Human values now seem noticeably less complex than many possibilities, and empirically we already have a pretty good representation of human values in silica.

Is my summary reasonably correct?

Disentangling inner alignment failures

Linch2y20

Distributional shift: The worry is precisely that capabilities will generalize better than goals across the distributional shift. If capabilities didn't generalize, we'd be fine. But as the CoinRun agent examplifies, you can get AIs that capably pursue a different objective after a distributional shift than the one you were hoping for. One difference to deception is that models which become incompetent after a distributional shift are in fact quite plausible. But to the extent that we think we'll get goal misgeneralization specifically, the underlying worry again seems to be that capabilities will be robust while alignment will not.

One thing to flag is that even if for any given model, the probability of capabilities generalizing is very low, total doom can still be high, since there might be many tries at getting models that generalize well across distributional shifts, whereas the selection pressures to getting alignment robustness is comparably weaker. You can imagine a 2x2 quadrant of capabilities vs alignment generalizability across distributional shift:

Capabilities doesn't generalize, alignment doesn't: irrelevant

Capabilities doesn't generalize, alignment does: irrelevant

Capabilities generalizes, alignment doesn't: potentially very dangerous, especially if power-seeking. Agent (or agent and friends) acquires more power and may attempt a takeover.

Capabilities generalizes, alignment does: Good, but not clearly great. By default I won't expect it to be powerseeking (unless you're deliberately creating a sovereign), so it only has as much power as humans allow it to have. So the AI might risk being outcompeted by their more nefarious peers.

A shot at the diamond-alignment problem

Linch2y20

Use a very large (future) multimodal self-supervised learned (SSL) initialization to give the AI a latent ontology for understanding the real world and important concepts. Combining this initialization with a recurrent state and an action head, train an embodied AI to do real-world robotics using imitation learning on human in-simulation datasets and then sim2real. Since we got a really good pretrained initialization, there's relatively low sample complexity for the imitation learning (IL). The SSL and IL datasets both contain above-average diamond-related content, with some IL trajectories involving humans navigating towards diamonds because the humans want the diamonds.

I don't know much about ML, and I'm a bit confused about this step. How worried are we/should we be about sample efficiency here? It sounds like after pre-training you're growing the diamond shard via a real-world embedded RL agent? Naively this would be pretty performance uncompetitive compared to agents primarily trained in simulated worlds, unless your algorithm is unusually sample efficient (why?). If you aren't performance competitive, then I expect your agent to be outcompeted by stronger AI systems with trainers that are less careful about diamond (or rubies, or staples, or w/e) alignment.

OTOH if your training is primarily simulated, I'd be worried about the difficulty of creating an agent that terminally values real world (rather than simulated) diamonds.

An Update on Academia vs. Industry (one year into my faculty job)

Linch3y*86

Minor, but Dunning-Kruger neither claims to detect a Mount Stupid effect nor (probably) is the study powered enough to detect it.

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments