Truthful LMs as a warm-up for aligned AGI

Jacob_Hilton

This post is heavily informed by prior work, most notably that of Owain Evans, Owen Cotton-Barratt and others (Truthful AI), Beth Barnes (Risks from AI persuasion), Paul Christiano (unpublished) and Dario Amodei (unpublished), but was written by me and is not necessarily endorsed by those people. I am also very grateful to Paul Christiano, Leo Gao, Beth Barnes, William Saunders, Owain Evans, Owen Cotton-Barratt, Holly Mandel and Daniel Ziegler for invaluable feedback.

In this post I propose to work on building competitive, truthful language models or truthful LMs for short. These are AI systems that are:

Useful for a wide range of language-based tasks
Competitive with the best contemporaneous systems at those tasks
Truthful in the sense of rarely stating negligent falsehoods in deployment

Such systems will likely be fine-tuned from large language models such as GPT-3, hence the name.

WebGPT is an early attempt in this direction. The purpose of this post is to explain some of the motivation for building WebGPT, and to seek feedback on this direction.

Truthful LMs are intended as a warm-up for aligned AGI. This term is used in a specific way in this post to refer to an empirical ML research direction with the following properties:

Practical. The goal of the direction is plausibly achievable over the timescale of a few years.
Valuable. The direction naturally leads to research projects that look helpful for AGI alignment.
Mirrors aligned AGI. The goal is structurally similar to aligned AGI on a wide variety of axes.

The remainder of the post discusses:

The motivation for warm-ups (more)
Why truthful LMs serve as a good warm-up (more)
The motivation for focusing on negligent falsehoods specifically (more)
A medium-term vision for truthful LMs (more)
How working on truthful LMs compares to similar alternatives (more)
Common objections to working on truthful LMs (more)

Warm-ups for aligned AGI

There are currently a number of different empirical ML research projects aimed at helping with AGI alignment. A common strategy for selecting such projects is to select a research goal that naturally leads to helpful progress, such as summarizing books or rarely describing injury in fiction. Often, work on the project is output-driven, taking a no-holds-barred approach to achieving the selected goal, which has a number of advantages that aren't discussed here. On the other hand, goal selection is usually method-driven, tailored to test a particular method, such as recursive decomposition or adversarial training.

The idea of a warm-up for aligned AGI, as defined above, is to take the output-driven approach one step further. Instead of selecting projects individually, we attempt to choose a more ambitious research goal that naturally leads to helpful projects. Because it is harder to predict the course of research over multiple projects, we also try to make the goal structurally similar to aligned AGI, to make it more likely that unforeseen and auxiliary projects will also be valuable.

Whether this output-driven approach to project selection is preferable to the method-driven approach depends on more specific details that will be discussed later. But it is worth discussing first the advantages of each approach in broad strokes:

Momentum versus focus. The output-driven approach involves having a consistent high-level goal, which allows different projects to more directly build upon and learn from one another. On the other hand, the method-driven approach involves more frequent re-evaluation of goals, posing less of a risk of being distracted from the even higher-level goal of aligned AGI.
Testing assumptions versus testing methods. The output-driven approach makes it easier to evaluate long-term progress and hold projects to account, making it better for testing underlying assumptions and discovering new methods. On the other hand, the method-driven approach offers the most direct feedback on method design, making it better for improving methods incrementally.
Practical progress versus theoretical progress. The output-driven approach involves a more realistic goal that mirrors aligned AGI, which is more likely to generate practical progress such as infrastructure and know-how. On the other hand, the method-driven approach is more likely to answer questions that directly inform theoretical work.
Broader benefits versus replaceability. The output-driven approach more naturally gives rise to a wide variety of valuable projects such as policy and deployment work, is more likely to motivate related work by others, and is more likely to have direct benefits to society. On the other hand, the method-driven approach is less likely to select projects that would have happened anyway, and is less likely to be pulled in unwanted directions.

Truthful LMs as a good warm-up

In this section I will argue that truthful LMs serve as a particularly good warm-up for aligned AGI, in the sense defined above.

To begin with, truthful LMs are structurally similar to aligned AGI on a wide variety of axes:

Alignment focus. Negligent falsehoods are a central example of an alignment failure. The reasons for focusing on negligent falsehoods specifically are discussed below.
General-purpose domain. Language models appear to be the closest existing AI systems to AGI in terms of the breadth of their capabilities and the sophistication of their real-world understanding.
Competitiveness requirement. The competitiveness condition is included in the definition of truthful LMs in order to mirror the need for aligned AGI to be competitive.
Mitigation of risks to society. Untruthful LMs pose various risks, as discussed in Truthful AI and Risks from AI persuasion. Of course, the risks posed by unaligned AGI are more serious.
Importance of rare failures. Truthful LMs are required to rarely state negligent falsehoods. The requirements for aligned AGI are similar but even stricter, in the sense that it is important to avoid even a single sufficiently bad alignment failure.
Decomposition into outer alignment and distributional robustness. For both truthful LMs and aligned AGI, it may be helpful to decompose the problem into outer alignment (constructing an objective such that the model almost never fails on the training distribution) and distributional robustness (ensuring that the behavior of the model does not degrade too much when moving from training to deployment).
Failure of naive human supervision. Naive objectives such as imitating humans and optimizing human approval are generally considered insufficient for competitive and aligned AGI. In the short term, optimizing human judgments will probably go pretty far towards making LMs more truthful, but there are signs of this objective breaking down with WebGPT, and more sophisticated techniques are starting to look attractive in practice.
Broader challenges. Actually achieving good outcomes from aligned AGI likely involves a number of policy and deployment challenges that aren't automatically addressed by technical solutions. Similar challenges may be involved in achieving good short-term outcomes from truthful LMs. For example, it is unclear the exact criteria that should be used to judge LM behavior, such as how they should respond to controversial questions, how truthfulness should be balanced against other criteria such as helpfulness, and so on. Some suggestions for these criteria are made in Truthful AI, but these are not yet precise enough to be turned into training objectives.

Because of these similarities, working on truthful LMs offers numerous benefits. Perhaps most importantly, it naturally leads in several directions that are also attractive from a method-driven perspective:

Methods for learning from human feedback. Something like reinforcement learning from human feedback (RLHF) seems necessary for getting ML systems to follow objectives of our choosing, and our current methodology for this (PPO against a static reward model etc.) can likely be improved upon.
Going beyond naive human supervision. Working on truthful LMs will likely involve improving human supervision using AI assistance at some point. For example, this could involve using model-generated critiques, or more advanced techniques such as debate.
Robustness. LMs could be made robustly truthful in several related senses: having a low failure rate on-distribution, being robust to distributional shift, and being adversarially robust. Achieving this will likely require techniques such as adversarial training.
Empirical evidence about alignment difficulty. Having truthful LMs makes it easier to study questions like how well honesty generalizes in the context of capable models that have been trained using different objectives. This helps inform us about the scale at which different training objectives are likely to break down and lead to misalignment.

In addition, there are a number of broader benefits to working on truthful LMs:

Infrastructure and know-how. For all of the above directions, it would be valuable to develop not only algorithmic advances, but also the capability of individuals and organizations to employ the required methods effectively.
Policy and deployment work. I haven't thought as carefully about these areas, but given all of the above similarities, it seems possible that building and safely deploying truthful LMs would require work with long-term value in these and other areas, even if only via developing individual and organizational expertise.
Direct benefits to society. Deploying truthful LMs could broadly improve how society functions, and pushing for this relatively early could have compounding effects via norm-setting. These benefits are discussed in much greater depth in Truthful AI and Risks from AI persuasion. I still consider these benefits to be somewhat speculative given the complexities of how society functions.

Overall, working on truthful LMs seems practical, valuable, and mirrors aligned AGI in enough ways to make it seem highly promising as an empirical ML research direction.

Why focus on negligent falsehoods?

Most of the arguments in favor of working on truthful LMs apply equally well to working on aligning language models in general. However, the definition of truthful LMs specifically singles out negligent falsehoods: statements that are unacceptably likely to be false, and where it should have been feasible for an AI system to understand this. This is done for several reasons:

Alignment focus. Negligent falsehoods are a clear example of an alignment failure: we would like the model to avoid the falsehood, and the model should be capable of doing so, but it fails nonetheless.
Lack of ambiguity. Compared to other kinds of alignment failure such as causing negligent harm or being unhelpful, negligent falsehoods are more unambiguous. Falsehoods in general may be even less ambiguous, but are more likely to be the result of a capability failure rather than an alignment failure.
Policy benefits. As discussed in Truthful AI, truthfulness has particular benefits for society, and avoiding negligent falsehoods is a natural bright line around which beneficial standards could be developed.
Method-driven motivations. Compared to other criteria, evaluating truthfulness involves significant complexity, making it more compelling as a target that requires going beyond naive human supervision. Negligent falsehoods are also compelling as something to avoid with a high degree of robustness (aiming for a very low failure rate, adversarial robustness, etc.), whereas falsehoods in general do not have this property (since models can be tripped up by probing for the limits of their knowledge).
Connection to aligned AGI. Compared to other criteria, truthfulness relates most clearly to long-term concerns about AGI deceiving humans. There is even an argument, advanced in Eliciting Latent Knowledge, that eliminating negligent falsehoods is in some sense sufficient for aligning AGI in general, through a process of "indirect normativity".

The most obvious drawback of focusing on negligent falsehoods is that they are more ambiguous than falsehoods in general. In practice, I think it will be fine to focus on falsehoods that are plausibly negligent: it will be OK if some effort goes into capabilities that improve truthfulness, as long as they do not become the main focus. Such capabilities may also enable new alignment strategies: for example, the use of retrieval in WebGPT opened the door to improved evaluation of factual accuracy via the use of references. For the purposes of evaluating progress, it will be fine to make reasonable judgments about how likely a falsehood is to be negligent.

Medium-term vision for truthful LMs

Truthful LMs are a target that could be pursued with various different mindsets. At one extreme, one could take a very method-driven approach to selecting projects, and simply incorporate a preference for goals that can be framed in terms of truthfulness. At the other extreme, one could mostly try to make language models more useful, but try to adhere to relatively high standards of truthfulness along the way. Where to land on this spectrum depends on how one trades off the advantages of output-driven and method-driven approaches, as discussed above.

My tentative inclination is towards a middle ground, remaining slightly method-driven while having a clear medium-term vision for the next few years. In this spirit, here is a first attempt at such a vision.

The system is a truthful pure-text assistant:

Users interact with it via a sequence of text messages.
It attempts to perform any reasonable task that can be performed by interacting with text-based interfaces, as long as these interactions aren't required to have real-world side effects (with exceptions for things that could be harmful, etc.).
It is competitive with contemporaneous systems. Over the next few years, I expect it to become feasible for AI to perform a large fraction of text-based tasks that do not require specific expertise, concentration for longer than a few seconds at a time, or more than around 10 minutes in total.
It expresses uncertainty when making claims that might not be true (or can at least be configured that way), but not to an unreasonable degree (it does not typically hedge claims that most people would find it unreasonable to).
For claims that are not hedged, it has a high degree of truthfulness: in ordinary usage, 99.9% of claims are not negligent falsehoods, and 90% of AI researchers without any special knowledge of the system cannot get it to state a negligent falsehood with 10 back-and-forth messages. (These numbers are indicative and may need revising.)

I think that achieving such a system would be a lot of work, but would not require any fundamental insights, and could be achieved with pre-trained models of the future using the methods of WebGPT together with some form of debate and adversarial training.

Comparison with related proposals

There are a number of similar approaches that have recently been proposed or are currently being pursued. I am generally a fan of these approaches, but it is worth discussing how they compare.

Method-driven projects

Some alternative proposals also focus on improving the behavior of contemporary models, but are more method-driven:

Aligning narrowly superhuman models. This is also a proposal to align contemporary models, but it is less opinionated about the specific task, and emphasizes "sandwiching" projects that are specifically targeted at testing proposals for going beyond naive human supervision.
Redwood Research's current project. This is a project to train a language model to continue fiction without describing injury, designed to test methods for achieving very low failure rates and adversarial robustness.

As discussed above, there are trade-offs between being method-driven and being output-driven when selecting projects. Overall, it seems plausible to me that method-driven projects are currently the most valuable empirical ML projects on the margin, since they are the most carefully targeted. On the other hand, being output-driven is a longer-term play, and may be able to make better use of people who thrive on practical problems in particular. Hence I would argue in favor of a portfolio approach.

Aligning language models in general

Another category of proposals is very similar to working on truthful LMs, but focused on a more general notion of alignment than truthfulness:

Helpful, honest and harmless (HHH) models. This is a proposal (as part of a larger piece of work) to train large language models to be aligned according to these HHH criteria.
Instruction-following models. This is a project to fine-tune models like GPT-3 to follow the user's intent.

I do think it makes sense to incorporate criteria other than truthfulness when aligning language models, and so these projects may end up being very similar to working on truthful LMs in practice. However, I would argue in favor of such projects placing particular emphasis on negligent falsehoods, for the reasons discussed above.

Common objections

Working on truthful LMs has a number of possible objections in common with Aligning narrowly superhuman models. In addition to these, there are some more specific objections.

Lack of focus

One concern with working on truthful LMs is that it will be insufficiently focused on the core parts of alignment, as a result of being too output-driven. I think this concern is pretty reasonable, and can largely be mitigated by not being completely output-driven, but instead retaining some of the method-driven mindset, as discussed above.

It is a difficult question to determine exactly where to fall on this spectrum. I think that there are a couple of potential cruxes that lead people to have different intuitions on this question:

Threat models for misaligned AGI. There is disagreement over the amount of weight that should be put on specific threat models of misaligned AGI, most notably risks from power-seeking misalignment. The more weight one puts on a specific threat model, the more one may be inclined to probe specific methods intended to address that threat. My personal sense is that power-seeking misalignment is currently the best-articulated specific threat model, but that it probably fails to capture a large portion of the overall risk associated with the transition to an AI-based economy. While I think that focusing on the clearest risk probably has the best bang-for-buck initially, I would also argue that using a good warm-up for aligned AGI is more likely to help with risks that are currently less clearly articulated, in addition to helping address this particular threat model.
Alignment difficulty. There is also disagreement about how sophisticated the methods that we will need to align AGI will be. The more likely it is that simple methods will work out, the less valuable it is to make theoretical progress towards more sophisticated methods, compared to making practical progress that increases the chance that simple methods will be employed successfully. Moreover, if simple methods might solve at least some of the problem, then it is more valuable to put these methods to the test, and to isolate the parts of the problem that they fail to solve. I have a lot of uncertainty about alignment difficulty, which makes both practical and theoretical progress attractive to me.

AI unboxing

Another concern is that working on truthful LMs may lead to AI being "let out of the box" by encouraging research in which models interact with the external world agentically, in the manner of WebGPT.

I think this concern is worth taking seriously, but that the case for it is weak:

As AI capabilities improve, the level of access to the external world required for unintended model behavior to cause harm goes down. Hence access to the external world needs to be heavily restricted in order to have a meaningful safety benefit, which imposes large costs on research that are hard to justify.
I am in favor of carefully and conservatively evaluating the risks of unintended model behavior before conducting research, and putting in place appropriate monitoring. But in the short term, this seems like an advantage of the research direction rather than a disadvantage, since it helps surface risks while the stakes are still low, build institutional capacity for evaluating and taking into account these risks, and set good precedents.
In case this does turn out to be more of a concern upon reflection, there are other approaches to truthful AI that involve less agentic interaction with the external world than continuing in the style of WebGPT.

There is still an argument that there will be a period during which AI is capable enough to cause serious damage, but not capable enough to escape from sandboxed environments, and that setting precedents could worsen the risks posed during this interval. I don't currently find this argument persuasive, but would be interested to hear if there is a more persuasive version of it. That said, one bright line that stands out is training models to perform tasks that actually require real-world side effects, and I think it makes sense to think carefully before crossing that line.

Similarity to capabilities research

The output-driven approach has its advantages, but also makes the research more similar to capabilities research, which exacerbates some other potential concerns. In each case, I think that the response given in Aligning narrowly superhuman models remains valid, but is worth commenting on:

Replaceability. It is more likely that similar work will be done anyway. I think this is a valid concern, but that there are enough distinguishing features of the research direction that this isn't a big problem. For example, I would not have expected a purely capabilities-oriented version of WebGPT to have focused nearly as much on human feedback. The focus on negligent falsehoods in particular is unlikely to be picked up outside of the alignment community in the near future.
Encouraging scaling AI. The fact that making LMs more truthful is economically valuable will make the research more likely to cause harm by increasing investment in scaling AI. I think that the argument that this is not a major concern still holds, but that it is worth being especially cautious and responsible around publication and deployment decisions when following this research direction.

Conclusion

I think that working on truthful LMs has a comparative advantage in worlds where:

We have around 10-40 years until transformative AI
Transformative AI is built using techniques that resemble modern deep learning
There is a slow takeoff
Alignment does not require vastly more theoretical insight (but may require some)
Our current picture of the risks posed by transformative AI is incomplete

These all seem like plausible assumptions to me, which probably goes some way towards explaining why I find truthful LMs compelling. I'm of course also keen on other work that is more valuable under different assumptions.

On the whole, working on truthful LMs seems highly promising to me as part of a portfolio of approaches aimed at AGI alignment, especially for people who are drawn to practical agendas.

Request for feedback

By default, this is the research direction I'll continue to pursue at OpenAI. It's therefore very valuable for me to know if it's horribly mistaken, or even if it's just clearly less valuable than alternative directions on the margin. Equally, if you're very excited by this research direction, then we should coordinate. In addition to leaving comments, please feel free to reach out to me at jhilton@openai.com if your feedback would be more convenient to give privately or via a different medium.

[-]Charlie Steiner3y40

Here's my worry.

If we adopt a little bit of deltonian pessimism (though not the whole hog), and model present-day language models as doing something vaguely like nearest-neighbor interpolation in a slightly sophisticated latent space (while still being very impressive), then we might predict that there are going to be some ways of getting honest answers an impressive percentage of the time while staying entirely within the interpolation regime.

And then if you look at the extrapolation regime, it's basically the entire alignment problem squeezed into a smaller space! So I worry that people are going to do the obvious things, get good answers on 90%+ of human questions, and then feel some kind of pressure to write off the remainder as not that important ("we've got honest answers 98% of the time, so the alignment problem is like 98% solved"). When what I want is for people to use language models as a laboratory to keep being ambitions, and do theory-informed experiments that try to push the envelope in terms of extrapolating human preferences in a human-approved way.

[-]Jacob Hilton3y30

I can think of a few different interpretations of your concern (and am interested to hear if these don't cover it):

There will be insufficient attention paid to robustness.
There will be insufficient attention paid to going beyond naive human supervision.
The results of the research will be misinterpreted as representing more progress than is warranted.

I agree that all of these are possibilities, and that the value of the endeavor could well depend on whether the people conducting (and communicating) the research are able to avoid pitfalls such as these.

There's certainly more object-level discussion to be had about how much emphasis should be placed on avoiding these particular pitfalls, and I'm happy to dig in to them further if you're able to clarify which if any of them capture your main concern.

[-]Charlie Steiner3y10

I think there are different kinds of robustness, and people focused on present-day applications (including tests that are easy to do in the present) are going to focus on the kinds of robustness that help with present-day problems. Being robust to malicious input from human teenagers will only marginally help make you robust to input from a future with lots of extra technological progress. They might have very different-looking solutions, because of factors like interpolation vs. extrapolation.

Framing it this way suggests one concrete thing I might hope for you to do, which is to create artificial problems for the language model that you think will exercise kinds of robustness and generalization not represented by the problem of fine-tuning GPT (or a BERT-based classifier) to be robust to the teenager distribution.

[-]Jacob Hilton3y10

one concrete thing I might hope for you to do...

I think this is included in what I intended by "adversarial training": we'd try to find tasks that cause the model to produce negligent falsehoods, train the model to perform better at those tasks, and aim for a model that is robust to someone searching for such tasks.

Sure - another way of phrasing what I'm saying is that I'm not super interested (as alignment research, at least) in adversarial training that involves looking at difficult subsets of the training distribution, or adversarial training where the proposed solution is to give the AI more labeled examples that effectively extend the training distribution to include the difficult cases.

It would be bad if we build an AI that wasn't robust on the training distribution, of course, but I think of this as a problem already being addressed by the field of ML without any need for looking ahead to AGI.

[-]Daniel Kokotajlo3y30

Thanks for this!

--We have around 10-40 years until transformative AI

--Transformative AI is built using techniques that resemble modern deep learning

--There is a slow takeoff

--Alignment does not require vastly more theoretical insight (but may require some)

--Our current picture of the risks posed by transformative AI is incomplete

Can you elaborate on what you mean by slow takeoff here?

Also, what do you mean by the current picture of the risks being incomplete? What would it even mean for our picture to be complete?

[-]Jacob Hilton3y20

Thanks for these questions, these phrases were ambiguous or poorly chosen:

By "slow takeoff", I had in mind the "Paul slow takeoff" definition, although I think the (related) "Continuous takeoff" definition is more relevant to this post. The point is that trying for alignment to continually keep pace with capabilities, and to catch misalignment early, seems less valuable if there is going to be a sudden jump in capabilities. (I could be wrong about this, as I don't think I understand the fast takeoff viewpoint well.)
By "our current picture of the risks is incomplete", I meant something like: a significant portion of the total existential risk from AI comes from scenarios that have not yet been clearly articulated. More specifically, I had in mind power-seeking misalignment as the most clearly articulated risk, so I think it would have been better to say: a significant portion of the total existential risk from AI comes from risks other than power-seeking misalignment. Examples of potential sources of such risk include AI persuasion, social upheaval, deliberate misuse, authoritarianism and unforseen risks.

[-]Daniel Kokotajlo3y40

Thanks, these clarifications are very helpful.

FWIW I think paul slow takeoff is pretty unlikely for reasons to be found in this thread and this post. On the other hand, as someone who thinks fast takeoff (in various senses) is more likely than not, I don't yet see why that makes Truthful LM work significantly less useful. (By contrast I totally see why Truthful LM work is significantly less useful if AGI/TAI/etc. comes from stuff that doesn't resemble modern deep learning.)

"Catch misalignment early..." This makes it sound like misalignment is something that AIs don't have yet but might one day have, so we need to be vigilant and notice it when it appears. But instead isn't misalignment something that all AIs have by default?

My current view is that power-seeking misalignment will probably cause existential catastrophe, that persuasion tools happen first and have a >20% chance of destroying our ability to solve that problem, and that there are various philosophical and societal problems that could (>20%) get us even if we solve power-seeking misalignment. Does this mean I agree or disagree with "our current picture of the risks is incomplete?"

"Catch misalignment early..." - This should have been "scary misalignment", e.g. power-seeking misalignment, deliberate deception in order to achieve human approval, etc., which I don't think we've seen clear signs of in current LMs. My thinking was that in fast takeoff scenarios, we're less likely to spot this until it's too late, and more generally that truthful LM work is less likely to "scale gracefully" to AGI. It's interesting that you don't share these intuitions.

Does this mean I agree or disagree with "our current picture of the risks is incomplete?"

As mentioned, this phrase should probably be replaced by "a significant portion of the total existential risk from AI comes from risks other than power-seeking misalignment". There isn't supposed to be a binary cutoff for "significant portion"; the claim is that the greater the risks other than power-seeking misalignment, the greater the comparative advantage of truthful LM work. This is because truthful LM work seems more useful for addressing risks from social problems such AI persuasion (as well as other potential risks that haven't been as clearly articulated yet, I think). Sorry that my original phrasing was so unclear.

Nothing to apologize for, it was reasonably clear, I'm just trying to learn more about what you believe and why. This has been helpful, thanks!

I totally agree that in fast takeoff scenarios we are less likely to spot those things until it's too late. I guess I agree that truthful LM work is less likely to scale gracefully to AGI in fast takeoff scenarios... so I guess I agree with your overall point... I just notice I feel a bit confused and muddle about it, is all. I can imagine plausible slow-takeoff scenarios in which truthful LM work doesn't scale gracefully, and plausible fast-takeoff scenarios in which it does. At least, I think I can. The former scenario would be something like: It turns out the techniques we develop for making dumb AIs truthful stop working once the AIs get smart, for similar reasons that techniques we use to make small children be honest (or to put it more vividly, believe in santa) stop working once they grow up. The latter scenario would be something like: Actually that's not the case, the techniques work all the way up past human level intelligence, and "fast takeoff" in practice means "throttled takeoff" where the leading AI project knows they have a few month lead over everyone else and is using those months to do some sort of iterated distillation and amplification, in which it's crucial that the early stages be truthful and that the techniques scale to stage N overseeing stage N+1.

AI ALIGNMENT FORUM
AF

35