User Comment Replies — AI Alignment Forum

Evaluating the historical value misspecification argument

I think I'm relatively optimistic that the difference between a system that "can (and will) do a very good job with human values when restricted to the text domain: vs "system that can do a very good job, unrestricted" isn't that high. This is because I'm personally fairly skeptical about arguments along the lines of "words aren't human thinking, words are mere shadows of human thinking" that people put out, at least when it comes to human values.

(It's definitely possible to come up with examples that illustrates the differences between all of human thinking and human-thinking-put-into-words; I agree about their existence, I disagree about their importance).

1David Scott Krueger3mo

OTMH, I think my concern here is less: * "The AI's values don't generalize well outside of the text domain (e.g. to a humanoid robot)" and more: * "The AI's values must be much more aligned in order to be safe outside the text domain" I.e. if we model an AI and a human as having fixed utility functions over the same accurate world model, then the same AI might be safe as a chatbot, but not as a robot. This would be because the richer domain / interface of the robot creates many more opportunities to "exploit" whatever discrepancies exist between AI and human values in ways that actually lead to perverse instantiation.

Evaluating the historical value misspecification argument

Linch1y*156

I think I read this a few times but I still don't think I fully understand your point. I'm going to try to rephrase what I believe you are saying in my own words:

Our correct epistemic state in 2000 or 2010 should be to have a lot of uncertainty about the complexity and fragility of human values. Perhaps it is very complex, but perhaps people are just not approaching it correctly.
At the limit, the level of complexity can approach "simulate a number of human beings in constant conversation and moral deliberation with each other, embedded in the existing broa

... (read more)

1David Scott Krueger3mo

This comment made me reflect on what fragility of values means. To me this point was always most salient when thinking about embodied agents, which may need to reliably recognize something like "people" in its environment (in order to instantiate human values like "try not to hurt people") even as the world changes radically with the introduction of various forms of transhumanism. I guess it's not clear to me how much progress we make towards that with a system that can do a very good job with human values when restricted to the text domain. Plausibly we just translate everything into text and are good to go? It makes me wonder where we're at with adversarial robustness of vision-language models, e.g.

Disentangling inner alignment failures

Linch2y20

Distributional shift: The worry is precisely that capabilities will generalize better than goals across the distributional shift. If capabilities didn't generalize, we'd be fine. But as the CoinRun agent examplifies, you can get AIs that capably pursue a different objective after a distributional shift than the one you were hoping for. One difference to deception is that models which become incompetent after a distributional shift are in fact quite plausible. But to the extent that we think we'll get goal misgeneralization specifically, the underlying worr

... (read more)

A shot at the diamond-alignment problem

Linch2y20

Use a very large (future) multimodal self-supervised learned (SSL) initialization to give the AI a latent ontology for understanding the real world and important concepts. Combining this initialization with a recurrent state and an action head, train an embodied AI to do real-world robotics using imitation learning on human in-simulation datasets and then sim2real. Since we got a really good pretrained initialization, there's relatively low sample complexity for the imitation learning (IL). The SSL and IL datasets both contain above-average diamond-related

... (read more)

4Alex Turner2y

Good question, which I should probably have clarified in the essay. On a similar compute budget, could e.g. an actor-critic in-sim approach reach superintelligence even more quickly? Yeah, probably. The point of this story isn't that this (i.e. SSL+IL+PG RL) is the optimal alignment configuration along (competitiveness, alignability-to-diamonds), but rather I claim that if this story goes through at all, it throws a rock through how we should be thinking about alignment; if this story goes through, one of the simplest, "dumbest", most quickly dismissed ideas (reward agent for good event) can work just fine to superhuman and beyond, in a predictable-to-us way which we can learn more about by looking at current ML.

An Update on Academia vs. Industry (one year into my faculty job)

Linch3y*86

Minor, but Dunning-Kruger neither claims to detect a Mount Stupid effect nor (probably) is the study powered enough to detect it.

1David Scott Krueger3y

Very good to know! I guess in the context of my comment it doesn't matter as much because I only talk about others' perception.

AI ALIGNMENT FORUM
AF

All of Linch's Comments + Replies