(My understanding of) What Everyone in Technical Alignment is Doing and Why

elifland

One insight this has generated so far is that Reward is not the optimization target — instead, it is more like that which reinforces. People sometimes refer to inner aligning an RL agent with respect to the reward signal, but this doesn't actually make sense.

Grumble grumble. Savvy people have known that reward is not the optimization target for at least five years, probably more like a decade. It's true that various people don't know this yet & so I'm glad that post was written, but it's a bit unfair to credit shard theory with having generated that idea. (I think TurnTrout would agree with this, his post says that alignment people seem to be aware of this point already)

[-]David Scott Krueger (formerly: capybaralet)3y63

I don't consider this a settled question; is there rigorous technical work establishing that "Reward is not the optimization target"?

[-]Daniel Kokotajlo3y82

Depends on your standards for "rigorous technical work" and "establishing." In some sense nothing on this topic is sufficiently rigorous, and in some sense nothing on this topic has been established yet. I think the Risks from Learned Optimization paper might be what you are looking for. There's also evhub's recent talk. And of course, TurnTrouts post that was linked above. And again I just pull these out of the top of my head, the ideas in them have been floating around for a while.

I'd be interested to hear an argument that reward is the optimization target, if you've got one!

I suspect that this is an issue that will be cleared up by everyone being super careful and explicit and nitpicky about their definitions. (Because I think a big part of what's going on here is that people aren't doing that and so they are getting subtly confused and equivocating between importantly different statements, and then on top of that other people are misunderstanding their words)

[-]David Scott Krueger (formerly: capybaralet)3y57

Thanks! I don't think those meet my criteria. I also suspect "everyone being super careful and explicit and nitpicky about their definitions" is lacking, and I'd consider that a basic and essential component of rigorous technical work.

[-]Daniel Kokotajlo3y4-1

Agreed!

Got an argument that reward is the optimization target?

[-]David Scott Krueger (formerly: capybaralet)3y25

I don't think this framing of it being the optimization target or not is very helpful. It's like asking "does SGD converge?" or "will my supervised learning model learn the true hypothesis?" The answer will depend on a number of factors, and it's often not best thought of as a binary thing.

e.g. for agents that do planning based on optimizing a reward function, it seems appropriate to say that reward is the optimization target.

Here's another argument: maybe it's the field of RL, and not Alex Turner, who is right about this: https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target#Appendix__The_field_of_RL_thinks_reward_optimization_target
(I'm not sure Alex characterizes the field's beliefs correctly, and I'm sort of playing devil's advocate with that one (not a big fan of "outside views"), but it's a bit odd to act like the burden of proof is on someone who agrees with the relevant academic field).

[-]Steven Byrnes3y*21

e.g. for agents that do planning based on optimizing a reward function, it seems appropriate to say that reward is the optimization target.

Often, when an RL agent imagines a possible future roll-out, it does not evaluate whether that possible future is good or bad by querying an external ground-truth reward function; instead, it queries a learned value function. When that’s the case, the thing that the agent is foresightedly “trying” / “planning” to do is to optimize the learned value function, not the reward function. Right?

For example, I believe AlphaZero can be described this way—it explores some number of possible future scenarios (I’m hazy on the details), and evaluates how good they are based on querying the learned value function, not querying the external ground-truth reward function, except in rare cases where the game is just about to end.

I claim that, if we make AGI via model-based RL (as I expect), it will almost definitely be like that too. If an AGI has a (nonverbal) idea along the lines of “What if I try to invent a new microscope using (still-somewhat-vague but innovative concept)”, I can’t imagine how on earth you would build an external ground-truth reward function that can be queried with that kind of abstract hypothetical. But I find it very easy to imagine how a learned value function could be queried with that kind of abstract hypothetical.

(You can say “OK fine but the learned value function will asymptotically approach the external ground-truth reward function”. However, that might or might not be true. It depends on the algorithm and environment. I expect AGIs to be in a nonstationary environment with vastly too large an action space to fully explore, and full of irreversible actions that make full exploration impossible anyway. In that case, we cannot assume that there’s no important difference between “trying” to maximize the learned value function versus “trying” to maximize the reward function.)

Sorry if I’m misunderstanding. (My own discussion of this topic, in the context of a specific model-based RL architecture, is Section 9.5 here.)

[-]Daniel Kokotajlo3y2-1

Thanks!

I'm not sure the framing is helpful either, but reading Turner's linked appendix it does seem like various people are making some sort of mistake that can be summarized as "they seem to think the policy / trained network should be understood as trying to get reward, as preferring higher-reward outcomes, as targeting reward..." (And Turner says he himself was one of them despite doing a PhD in RL theory) Like I said above I think that probably there's room for improvement here -- if everyone defined their terms better this problem would clear up and go away. I see Turner's post as movement in this direction but by no means the end of the journey.

Re your first argument: If I understand you correctly, you are saying that if your AI design involves something like monte-carlo tree search using a reward-estimator module (Idk what the technical term for that is) and the reward-estimator module is just trained to predict reward, then it's fair to describe the system as optimizing for the goal of reward. Yep that seems right to me, modulo concerns about inner alignment failures in the reward-estimator module. I don't see this as contradicting Alex Turner's claims but maybe it does.

Re your second argument, the appeal to authority: I suppose in a vacuum, not having thought about it myself or heard any halfway decent arguments, I'd defer to the RL field on this matter. But I have thought about it a bit myself and I have heard some decent arguments, and that effect is stronger than the deference effect for me, and I think this is justified.

[-]David Scott Krueger (formerly: capybaralet)3y31

RE appeal to authority: I mostly mentioned it because you asked for an argument and I figured I would just provide any decent ones I thought of OTMH. But I have not provided anything close to my full thoughts on the matter, and probably won't, due to bandwidth.

[-]David Scott Krueger (formerly: capybaralet)3y2023

The main thing missing here are academic groups (like mine at Cambridge https://www.davidscottkrueger.com/). This is a pretty glaring oversight, although I'm not that surprised since it's LW.

Some other noteworthy groups in academia lead by people who are somewhat connected to this community:
- Jacob Steinhardt (Berkeley)
- Dylan Hadfield-Menell (MIT)
- Sam Bowman (NYU)
- Roger Grosse (UofT)

More at https://futureoflife.org/team/ai-existential-safety-community/ (although I think the level of focus on x-safety and engagement with this community varies substantially among these people).

BTW, FLI is itself worth a mention, as is FHI, maybe in particular https://www.fhi.ox.ac.uk/causal-incentives-working-group/ if you want to focus on technical stuff.

Some other noteworthy groups in academia lead by people who are perhaps less connected to this community:
- Aleksander Madry (MIT)
- Percy Liang (Stanford)
- Scott Neikum (UMass Amhearst)

These are just examples.

[-]Gunnar_Zarncke3y00

Some other noteworthy groups in academia lead by people who are somewhat connected to this community:
- Jacob Steinhardt (Berkeley)
- Dylan Hadfield-Menell (MIT)
- Sam Bowman (NYU)
- Roger Grosse (UofT)
Some other noteworthy groups in academia lead by people who are perhaps less connected to this community:
- Aleksander Madry (MIT)
- Percy Liang (Stanford)
- Scott Neikum (UMass Amhearst)

Can you provide some links to these groups?

[-]aog3y88

These professors all have a lot of published papers in academic conferences. It’s probably a bit frustrating to not have their work summarized, and then be asked to explain their own work, when all of their work is published already. I would start by looking at their Google Scholar pages, followed by personal websites and maybe Twitter. One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.

[-]Thomas Larsen3y68

Agree with both aogara and Eli's comment.

One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.

For me this reading between the lines is hard: I spent ~2 hours reading academic papers/websites yesterday and while I could quite quickly summarize the work itself, it was quite hard to me to figure out the motivations.

[-]David Scott Krueger (formerly: capybaralet)3y312

There's a lot of work that could be relevant for x-risk but is not motivated by it. Some of it is more relevant than work that is motivated by it. An important challenge for this community (to facilitate scaling of research funding, etc.) is to move away from evaluating work based on motivations, and towards evaluating work based on technical content.

[-]joshc3y22

PAIS #5 might be helpful here. It explains how a variety of empirical directions are related to X-Risk and probably includes many of the ones that academics are working on.

[-]JohnMalin3y55

I don't think the onus should be on the reader to infer x-risk motivations. In academic ML, it's the author's job to explain why the reader should care about the paper. I don't see why this should be different in safety. If it's hard to do that in the paper itself, you can always e.g. write a blog post explaining safety relevance (as mentioned by aogara, people are already doing this, which is great!).

There are often many different ways in which a paper might be intended to be useful for x-risks (and ways in which it might not be). Often the motivation for a paper (even in the groups mentioned above) may be some combination of it being an interesting ML problem, interests of the particular student, and various possible thoughts around AI safety. It's hard to try to disentangle this from the outside by reading between the lines.

[-]Rohin Shah3y*154

Note: I link to a bunch of stuff below in the context of the DeepMind safety team, this should be thought of as "things that particular people do" and may not represent the views of DeepMind or even just the DeepMind safety team.

I just don't know much about what the [DeepMind] technical alignment work actually looks like right now

We do a lot of stuff, e.g. of the things you've listed, the Alignment / Scalable Alignment Teams have done at least some work on the following since I joined in late 2020:

Eliciting latent knowledge (see ELK prizes, particularly the submission from Victoria Krakovna & Vikrant Varma & Ramana Kumar)
LLM alignment (lots of work discussed in the podcast with Geoffrey you mentioned)
Scalable oversight (same as above)
Mechanistic interpretability (unpublished so far)
Externalized Reasoning Oversight (my guess is that this will be published soon) (EDIT: this paper)
Communicating views on alignment (e.g. the post you linked, the writing that I do on this forum is in large part about communicating my views)
Deception + inner alignment (in particular examples of goal misgeneralization)
Understanding agency (see e.g. discovering agents, most of Ramana's posts)

And in addition we've also done other stuff like

I'm probably forgetting a few others.

I think you can talk about the agendas of specific people on the DeepMind safety teams but there isn't really one "unified agenda".

[-]Thomas Larsen3y20

Thanks you for this thoughtful response, I didn't know about most of these projects. I've linked this comment in the DeepMind section, as well as done some modifications for both clarity and including a bit more.

I think you can talk about the agendas of specific people on the DeepMind safety teams but there isn't really one "unified agenda".

This is useful to know.

[-]Vika3y54

Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful.

I agree with Rohin's summary of what we're working on. I would add "understanding / distilling threat models" to the list, e.g. "refining the sharp left turn" and "will capabilities generalize more".

Some corrections for your overall description of the DM alignment team:

I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team)
I would put DM alignment in the "fairly hard" bucket (p(doom) = 10-50%) for alignment difficulty, and the "mixed" bucket for "conceptual vs applied"

[-]habryka2y41Review for 2022 Review

These kinds of overview posts are very valuable, and I think this one is as well. I think it was quite well executed, and I've seen it linked a lot, especially to newer people trying to orient to the state of the AI Alignment field, and the ever growing number of people working in it.

[-]JanB3y40

Anthropic is also working on inner alignment, it's just not published yet.

Regarding what "the point" of RL from human preferences with language models is; I think it's not only to make progress on outer alignment (I would agree that this is probably not the core issue; although I still think that it's a relevant alignment issue).

See e.g. Ajeya's comment here:

According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post that pure human feedback is likely to lead to takeover):
Human feedback is better than even-worse alternatives such as training the AI on a collection of fully automated rewards (predicting the next token, winning games, proving theorems, etc) and waiting for it to get smart enough to generalize well enough to be helpful / follow instructions. So it seemed good to move the culture at AI labs away from automated and easy rewards and toward human feedback.
You need to have human feedback working pretty well to start testing many other strategies for alignment like debate and recursive reward modeling and training-for-interpretability, which tend to build on a foundation of human feedback.
Human feedback provides a more realistic baseline to compare other strategies to -- you want to be able to tell clearly if your alignment scheme actually works better than human feedback.
With that said, my guess is that on the current margin people focused on safety shouldn't be spending too much more time refining pure human feedback (and ML alignment practitioners I've talked to largely agree, e.g. the OpenAI safety team recently released this critiques work -- one step in the direction of debate).

^{^}

We may revise the document based on corrections in the comments or future announcements, but don't promise anything. Others are welcome to create future versions or submit summaries of their own approaches for us to edit in. We will note the time it was last edited when we edit things. (ETA: most recent update: 10/9/2022)

^{^}

In this chart, the ? denotes more uncertainty if this is a correct description

^{^}

~~I would appreciate someone giving more information on DeepMind's approach to alignment.~~ Update: Rohin has given a helpful summary in a comment.

^{^}

Technically, they just need to span the set of extrapolations, so that the correct extrapolation is just a linear combination of the found classifiers.

^{^}

Hold on, how come you are excited about Conjecture automating alignment research but not OpenAI?
Answer: I see a categorical distinction between trying to align agentic and oracle AIs. Conjecture is trying only for oracle LLMs, trained without any RL pressure giving them goals, which seems way safer. OpenAI doing recursive reward modeling / IDA type schemes involves creating agentic AGIs and therefore faces also a lot more alignment issues like convergent instrumental goals, power seeking, goodharting, inner alignment failure, etc.

I think inner alignment can be a problem with LLMs trained purely in a self-supervised fashion (e.g., simulacra becoming aware of their surroundings), but I anticipate it to only be a problem with further capabilities. I think RL trained GPT-6 is a lot more likely to be an x-risk than GPT-6 trained only to do text prediction.

^{^}

To be clear: I am very against proposals for violent pivotal acts that are sometimes brought up, such as destroying other AI labs on the verge of creating a misaligned AGI. This seems bad because 1) violence is bad and isn't dignified. 2) it seems like this intention would make it much harder to coordinate. 3) Setting an AGI loose to pull off a violent pivotal act could incredibly easily disempower humanity: you are intentionally letting the AGI destructively take over.

^{^}

Some cruxes that would change this conclusion are if we don't get prosaic AGI or if solving alignment takes a lot of serial thought, e.g. work that needs to be done by 1 researcher over 10 years, and can't be solved by 10 researchers working for 1 year.

Approach	Problem Focus	Current Approach Summary	Scale
Aligned AI	Model splintering	Solve extrapolation problems.	2-5 researchers, started Feb 2022
ARC	Inaccessible information	ELK + LLM power-seeking evaluation	3 researchers, started April 2021
Anthropic	LLM Outer Alignment (?)[3]	Interpretability + HHH + augmenting alignment research with LLMs	~35? technical staff[3], started May 2021
Brain-like-AGI Safety	Brain-like AGI Safety	Use brains as a model for how AGI will be developed, think about alignment in this context	~4 researchers, started March 2021
Center for AI Safety (CAIS)	Engaging the ML community, many technical problems	Technical research, Infrastructure, and ML community field-building for safety	7-10 FTE, founded in ~March 2022
CHAI	Outer alignment, though CHAI is diverse	Improve CIRL + many other independent approaches.	~20 FTE?, founded in 2016
CLR	Suffering risks	Foundational game theory research	5-10 FTE, founded before 2015
Conjecture	Inner alignment	Interpretability + automating alignment research with LLMs	~20 FTE, announced April 2022
David Krueger	Goal misgeneralization	Empirical examples and understanding ML inductive biases	Academic lab with 7 students
DeepMind	Many including scalable oversight and goal misgeneralization	Many including Debate, discovering agents, ERO, and understanding threat models. [4]	>1000 FTE for the company as a whole, ~20-25 FTE on the alignment + scalable alignment teams
Dylan Hadfield-Menell	Value Alignment	Reward specification + Norms	Academic research lab
Encultured	Multipolar failure from lack of coordination	Video game	~3 people, announced August 2022
Externalized Reasoning Oversight	Deception	Get the reasoning of the AGI to happen in natural language, then oversee that reasoning	~1 person's project for a summer (though others are working on this approach)
FHI	Agent incentives / wireheading (?)	Causal model formalism to study incentives.	~3 people in the causal group / ~20 total?, FHI founded in 2005, Causal group founded in 2021
FAR	Many	Incubate new, scalable alignment research agendas, technical support for existing researchers	4 people on leadership but I'm guessing ~5 more engineers, announced July 2022
MIRI	Many including deception, the sharp left turn, corrigibility is anti-natural	Mathematical research to resolve fundamental confusion about the nature of goals/agency/optimization	11 research staff, founded in approximately 2005
Jacob Steinhardt	Distribution Shift	Conceptual alignment	Academic lab of 9 PhD students + Postdocs
OpenAI	Scalable oversight	RLHF / Recursive Reward Modeling, then automate alignment research	100 capabilities and 30 alignment researchers, founded December 2015.
Ought	Scalable oversight	Supervise process rather than outcomes + augment alignment researchers	10 employees, founded in ~2018
Redwood	Inner alignment (?)	Interpretability + Adversarial Training	12-15 research staff, started sometime before September 2021
Sam Bowman	LLM Outer Alignment	Creating datasets for evaluation + inverse scaling prize	Academic lab
Selection Theorems	Being able to robustly point at objects in the world	Selection Theorems based on natural abstractions	~2 FTE, started around August 2019
Team Shard	Instilling inner values from an outer training loop	Find patterns of values given by current RL setups and humans, then create quantitative rules to do this	~4-6 people, started Spring 2022
Truthful AI	Deception	Create standards and datasets to evaluate model truthfulness	~10 people, one research project

Type of approach	Mostly conceptual	Mixed	Mostly applied
Organization	MIRI, John Wentworth, ARC	Team Shard, CHAI, DeepMind	Conjecture, Encultured, OpenAI, Anthropic, Redwood, Ought

95

95

Introduction

Aligned AI / Stuart Armstrong

Eliciting Latent Knowledge / Paul Christiano

Evaluating LM power-seeking / Beth Barnes

LLM Alignment

Interpretability

Scaling laws

Center for AI Safety (CAIS) / Dan Hendrycks

Center for Human Compatible AI (CHAI) / Stuart Russell

Scalable LLM Interpretability

Simulacra Theory

Externalized Reasoning Oversight / Tamera Lanham

Future of Humanity Institute (FHI)

Communicate their view on alignment

Deception + Inner Alignment / Evan Hubinger

Agent Foundations / Scott Garrabrant and Abram Demski

Infra-Bayesianism / Vanessa Kosoy

Selection Theorems / John Wentworth

Truthful AI / Owain Evans and Owen Cotton-Barratt

Other Organizations

Appendix

Visualizing Differences

Automating alignment and alignment difficulty

Conceptual vs. applied

Thomas’s Alignment Big Picture