Defining alignment research

[-]Daniel Kokotajlo1y107

I like my own definition of alignment vs. capabilities research better:

"Alignment research is when your research goals are primarily about how to make AIs aligned; capabilities research is when your research goals are primarily about how to make AIs more capable."

I think it's very important that lots of people currently doing capabilities research switch to doing alignment research. That is, I think it's very important that lots of people who are currently waking up every day thinking 'how can I design a training run that will result in AGI?' switch to waking up every day thinking 'Suppose my colleagues do in fact get to AGI in something like the current paradigm, and they apply standard alignment techniques -- what would happen? Would it be aligned? How can I improve the odds that it would be aligned?'

Whereas I don't think it's particularly important that e.g. people switch from scalable oversight to agent foundations research. (In fact it might even be harmful lol)

[-]Richard_Ngo1y31

What if your research goal is "I'd like to understand how neural networks work?" This is not research primarily about how to make AIs aligned. We tend to hypothesize, as a community, that it will help with alignment more than it helps with capabilities. But that's not an inherent part of the research goal for many interpretability researchers.

(Same for "I'd like to understand how agency works", which is a big motivation for many agent foundations researchers.)

Conversely, what if your research goal is "I'm going to design a training run that will produce a frontier model, so that we can study it to advance alignment research"? Seems odd, but I'd bet that (e.g.) a chunk of Anthropic's scaling team thinks this way. Counts as alignment under your definition, since that's the primary goal of the research.

More generally, I think it's actually a very important component of science that people judge the research itself, not the motivations behind it—since historically scientific breakthroughs have often come from people who were disliked by establishment scientists. A definition that basically boils down to "alignment research is whatever research is done by the people with the right motivations" makes it very easy to prioritize the ingroup. I do think that historically being motivated by alignment has correlated with choosing valuable research directions from an alignment perspective (like mech interp instead of more shallow interp techniques) but I think we can mostly capture that difference by favoring more principled, robust, generalizable research (as per my definitions in the post).

Whereas I don't think it's particularly important that e.g. people switch from scalable oversight to agent foundations research. (In fact it might even be harmful lol)

I agree. I'll add a note in the post saying that the point you end up on the alignment spectrum should also account for feasibility of the research direction.

Though note that we can interpret your definition as endorsing this too: if you really hate the idea of making AIs more capable, then that might motivate you to switch from scalable oversight to agent foundations, since scalable oversight will likely be more useful for capabilities progress.

[-]Daniel Kokotajlo1y53

Answering your first question. If you truly aren't trying to make AGI, and you truly aren't trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) ...great! That's neither capabilities nor alignment research afaict, but basic science. Good for you. I still think it would be better if you switched to doing alignment research (e.g. you could switch to 'i want to understand how neural networks work... so that I can understand how a prosaic AGI system being RLHF'd might behave when presented with genuine credible opportunities for takeover + lots of time to think about what to do) but I don't feel so strongly about it as I would if you were doing capabilities research.

re: judging the research itself rather than the motivations: idk I think it's actually easier, and less subjective, to judge the motivations, at least in this case. People usually just state what their motivations are. Also I'm not primarily trying to judge people, I'm trying to exhort people -- I'm giving people advice about what they should do to make the world a better place, I'm not e.g. talking about which research should be published and which should be restricted (I'm generally in favor of publishing research with maybe a few exceptions, but I think corporations on the margin should publish more) Moreover it's much easier for the researcher to judge their own motivations, than for them to judge the long-term impact of their work or to fit it into your diagram.

[-]Richard_Ngo1y81

If you truly aren't trying to make AGI, and you truly aren't trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) ...great! That's neither capabilities nor alignment research afaict, but basic science.

Consider Chris Olah, who I think has done more than almost anyone else to benefit alignment. It would be very odd if we had a definition of alignment research where you could read all of Chris's interpretability work and still not know whether or not he's an "alignment researcher". On your definition, when I read a paper by a researcher I haven't heard of, I don't know anything about whether it's alignment research or not until I stalk them on facebook and find out how socially proximal they are to the AI safety community. That doesn't seem great.

Back to Chris. Because I've talked to Chris and read other stuff by him, I'm confident that he does care about alignment. But I still don't know whether his actual motivations are more like 10% intrinsic interest in how neural networks work and 90% in alignment, or vice versa, or anything in between. (It's probably not even a meaningful thing to measure.) It does seem likely to me that the ratio of how much intrinsic interest he has in how neural networks work, to how much he cares about alignment, is significantly higher than that of most alignment researchers, and I don't think that's a coincidence—based on the history of science (Darwin, Newton, etc) intrinsic interest in a topic seems like one of the best predictors of actually making the most important breakthroughs.

In other words: I think your model of what produces more useful research from an alignment perspective overprioritizes towards first-order effects (if people care more they'll do more relevant work) and ignores the second-order effects that IMO are more important (1. Great breakthroughs seem, historically, to be primarily motivated by intrinsic interest; and 2. Creating research communities that are gatekept by people's beliefs/motivations/ideologies is corrosive, and leads to political factionalism + ingroupiness rather than truth-seeking.)

I'm not primarily trying to judge people, I'm trying to exhort people

Well, there are a lot of grants given out for alignment research. Under your definition, those grants would only be given to people who express the right shibboleths.

I also think that the best exhortation of researchers mostly looks like nerdsniping them, and the way to do that is to build a research community that is genuinely very interested in a certain set of (relatively object-level) topics. I'd much rather an interpretability team hire someone who's intrinsically fascinated by neural networks (but doesn't think much about alignment) than someone who deeply cares about making AI go well (but doesn't find neural nets very interesting). But any step in the pipeline that prioritizes "alignment researchers" (like: who gets invited to alignment workshops, who gets alignment funding or career coaching, who gets mentorship, etc) will prioritize the latter over the former if they're using your definition.

[-]Buck1y75

I'd much rather an interpretability team hire someone who's intrinsically fascinated by neural networks (but doesn't think much about alignment) than someone who deeply cares about making AI go well (but doesn't find neural nets very interesting).

I disagree, I'd rather they'd hire someone who cares about making AI go well. E.g. I like Sam Marks's work on making interpretability techniques useful (e.g. here), and I think he gets a lot of leverage compared to most interpretability researchers via trying to do stuff that's in the direction of being useful. (Though note that his work builds on the work of non-backchaining interpretability researchers.)

[-]habryka1y5-9

(FWIW I think Chris Olah's work is approximately irrelevant to alignment and indeed this is basically fully explained by the motivational dimension)

[-]Richard_Ngo1y30

Whose work is relevant, according to you?

[-]habryka1y*30

Lots of people's work:

Paul's work (ELK more than RLHF though it was useful to see what happens when you throw RL at LLMs in a way that's kind of similar to how I do get some value out of Chris's work)
Eliezer's work
Nate's work
Holden's writing on cold takes
Ajeya's work
Wentworth's work
The debate stuff
Redwood's work
Bostrom's work
Evan's work
Scott and Abram's work

There is of course still huge variance in how relevant and how much for the throat these different people's work is going for, but all of these seem more relevant to AI Alignment/AI-not-kill-everyonism than Chris's work (which again, I found interesting, but not like super interesting).

[-]Daniel Kokotajlo1y20

I feel like we are misunderstanding each other, and I think it's at least in large part my fault.

I definitely agree that we don't want to be handing out grants or judging people on the basis of what shibboleths they spout or what community they are part of. In fact I agree with most of what you've said above, except for when you start attributing stuff to me.

I think that grantmakers should evaluate research proposals not on the basis of the intentions of the researcher, but on the basis of whether the proposed research seems useful for alignment. This is not your view though right? You are proposing something else?

[-]habryka1y715

To be fair, the alignment community has caused some confusion by describing models as more or less “aligned”

I am curious what you are thinking about. My sense is the trend of calling models "aligned" started with OpenAI and ChatGPT, and wasn't really driven by anyone in the AI Alignment community (and is something that I complained a lot about at the time, because it did indeed seem like a thing that predictably would lead to confusion).

It's plausible to me that Paul also contributed to this, but my sense is most senior alignment people have been very hesitant to use "alignment" as a generalized term to describe the behavior of present-day models.

[-]Richard_Ngo1y40

Fair point. I've now removed that section from the post (and also, unrelatedly, renamed the post).

I was trying to make a point about people wanting to ensure that AI in general (not just current models) is "aligned", but in hindsight I think people usually talk about alignment with human values or similar. I have some qualms about that but will discuss in a different post.

[+][comment deleted]1y20

[-]LawrenceC1y40

Pretty sure Anthropic's early assistant stuff used the word this way too: See e.g. Bai et al https://arxiv.org/abs/2204.05862

But yes, people complained about it a lot at the time

[-]habryka1y40

Yeah, my sense is others (like Anthropic) followed along after OpenAI did that, though it seemed to me mostly to be against consensus in the alignment field (though I agree it's messy).

[-]LawrenceC1y30

(The Anthropic paper I cited predates ChatGPT by 7 months)

[-]habryka1y21

Huh, interesting. Maybe the OpenAI statements about their models being "more aligned" came earlier than that in the context of Instruct-GPT? I definitely feel like I remember some Twitter threads and LW comment threads about it in the context of OpenAI announcements, and nothing in the context of Anthropic announcements.

[-]leogao1y62

This is likely not the first instance, but OpenAI was already using the word "aligned" in this way in 2021 in the Codex paper.

https://arxiv.org/abs/2107.03374 (section 7.2)

[-]LawrenceC1y40

Ah, you're correct, it's from the original instructGPT release in Jan 2022:
https://openai.com/index/instruction-following/

	Average-case	Pessimistic-case	Worst-case
Engineering	Scaling	RLHF	Adversarial robustness
Behaviorist science	Optimization science	Scalable oversight	AI control
Cognitivist science	Concept-based interpretability	Mechanistic interpretability	Agent foundations

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

45

45

“Alignment” and “capabilities” are primarily properties of AIs not of AI research

What types of research are valuable for preventing misalignment?

Valuable property 1: worst-case focus

Valuable property 2: scientific approach

A better definition of alignment research