Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

It is interesting to note how views on this topic have shifted with the rise of outcome-based RL applied to LLMs. A couple of years ago, the consensus in the safety community was that process-based RL should be prioritized over outcome-based RL, since it incentivizes choosing actions for reasons that humans endorse. See for example Anthropic's Core Views On AI Safety:

Learning Processes Rather than Achieving Outcomes
One way to go about learning a new task is via trial and error – if you know what the desired final outcome looks like, you can just keep trying new strategies until you succeed. We refer to this as “outcome-oriented learning”. In outcome-oriented learning, the agent’s strategy is determined entirely by the desired outcome and the agent will (ideally) converge on some low-cost strategy that lets it achieve this.
Often, a better way to learn is to have an expert coach you on the processes they follow to achieve success. During practice rounds, your success may not even matter that much, if instead you can focus on improving your methods. As you improve, you might shift to a more collaborative process, where you consult with your coach to check if new strategies might work even better for you. We refer to this as “process-oriented learning”. In process-oriented learning, the goal is not to achieve the final outcome but to master individual processes that can then be used to achieve that outcome.
At least on a conceptual level, many of the concerns about the safety of advanced AI systems are addressed by training these systems in a process-oriented manner. In particular, in this paradigm:
Human experts will continue to understand the individual steps AI systems follow because in order for these processes to be encouraged, they will have to be justified to humans.
AI systems will not be rewarded for achieving success in inscrutable or pernicious ways because they will be rewarded only based on the efficacy and comprehensibility of their processes.
AI systems should not be rewarded for pursuing problematic sub-goals such as resource acquisition or deception, since humans or their proxies will provide negative feedback for individual acquisitive processes during the training process.
At Anthropic we strongly endorse simple solutions, and limiting AI training to process-oriented learning might be the simplest way to ameliorate a host of issues with advanced AI systems. We are also excited to identify and address the limitations of process-oriented learning, and to understand when safety problems arise if we train with mixtures of process and outcome-based learning. We currently believe process-oriented learning may be the most promising path to training safe and transparent systems up to and somewhat beyond human-level capabilities.

Or Solving math word problems with process- and outcome-based feedback (DeepMind, 2022):

Second, process-based approaches may facilitate human understanding because they select for reasoning steps that humans understand. By contrast, outcome-based optimization may find hard-to-understand strategies, and result in less understandable systems, if these strategies are the easiest way to achieve highly-rated outcomes. For example in GSM8K, when starting from SFT, adding Final-Answer RL decreases final-answer error, but increases (though not significantly) trace error.
[...]
In contrast, consider training from process-based feedback, using user evaluations of individual
actions, rather than overall satisfaction ratings. While this does not directly prevent actions which
influence future user preferences, these future changes would not affect rewards for the corresponding
actions, and so would not be optimized for by process-based feedback. We refer to Kumar et al. (2020)
and Uesato et al. (2020) for a formal presentation of this argument. Their decoupling algorithms
present a particularly pure version of process-based feedback, which prevent the feedback from
depending directly on outcomes.

Or Let's Verify Step by Step (OpenAI, 2023):

Process supervision has several advantages over outcome supervision related to AI alignment. Process supervision is more likely to produce interpretable reasoning, since it encourages models to follow a process endorsed by humans. Process supervision is also inherently safer: it directly rewards an aligned chain-of-thought rather than relying on outcomes as a proxy for aligned behavior (Stuhlmüller and Byun, 2022). In contrast, outcome supervision is harder to scrutinize, and the preferences conveyed are less precise. In the worst case, the use of outcomes as an imperfect proxy could lead to models that become misaligned after learning to exploit the reward signal (Uesato et al., 2022; Cotra, 2022; Everitt et al., 2017).

In some cases, safer methods for AI systems can lead to reduced performance (Ouyang et al., 2022; Askell et al., 2021), a cost which is known as an alignment tax. In general, any alignment tax may hinder the adoption of alignment methods, due to pressure to deploy the most capable model. Our results show that process supervision in fact incurs a negative alignment tax. This could lead to increased adoption of process supervision, which we believe would have positive alignment side-effects. It is unknown how broadly these results will generalize beyond the domain of math, and we consider it important for future work to explore the impact of process supervision in other domains.

It seems worthwhile to reflect on why this perspective has gone out of fashion:

The most obvious reason is the success of outcome-based RL, which seems to be outperforming processed-based RL. Advocating for processed-based RL no longer makes much sense when it is uncompetitive.
Outcome-based RL also isn't (yet) producing the kind of opaque reasoning that proponents of process-based RL may have been worried about. See for example this paper for a good analysis of the extent of current chain-of-thought faithfulness.
Outcome-based RL is leading to plenty of reward hacking, but this is (currently) fairly transparent from chain of thought, as long as this isn't optimized against. See for example the analysis in this paper.

Some tentative takeaways:

There is strong pressure to walk over safety-motivated lines in the sand if (a) doing so is important for capabilities and/or (b) doing so doesn't pose a serious, immediate danger. People should account for this when deciding what future lines in the sand to rally behind. (I don't think using outcome-based RL was ever a hard red line, but it was definitely a line of some kind.)
In particular, I wouldn't be optimistic about attempting to rally behind a line in the sand like "don't optimize against the chain of thought", since I'd expect people to blow past this as quickly about as they blew past "don't optimize for outcomes" if and when it becomes substantially useful. N.B. I thought the paper did a good job of avoiding this pitfall, focusing instead on incorporating the potential safety costs into decision-making.
It can be hard to predict how dominant training techniques will evolve, and we should be wary of anchoring too hard on properties of models that are contingent on them. I would not be surprised if the "externalized reasoning property" (especially "By default, humans can understand this chain of thought") no longer holds in a few years, even if capabilities advance relatively slowly (indeed, further scaling of outcome-based RL may threaten it). N.B. I still think the advice in the paper makes sense for now, and could end up mattering a lot – we should just expect to have to revise it.
More generally, people designing "if-then commitments" should be accounting for how the state of the field might change, perhaps by incorporating legitimate ways for commitments to be carefully modified. This option value would of course trade off against the force of the commitment.

[-]Rohin Shah4mo124

A couple of years ago, the consensus in the safety community was that process-based RL should be prioritized over outcome-based RL [...]
I don't think using outcome-based RL was ever a hard red line, but it was definitely a line of some kind.

I think many people at OpenAI were pretty explicitly keen on avoiding (what you are calling) process-based RL in favor of (what you are calling) outcome-based RL for safety reasons, specifically to avoid putting optimization pressure on the chain of thought. E.g. I argued with @Daniel Kokotajlo about this, I forget whether before or after he left OpenAI.

There were maybe like 5-10 people at GDM who were keen on process-based RL for safety reasons, out of thousands of employees.

I don't know what was happening at Anthropic, though I'd be surprised to learn that this was central to their thinking.

Overall I feel like it's not correct to say that there was a line of some kind, except under really trivial / vacuous interpretations of that. At best it might apply to Anthropic (since it was in the Core Views post).

Separately, I am still personally keen on process-based RL and don't think it's irrelevant -- and indeed we recently published MONA which imo is the most direct experimental paper on the safety idea behind process-based RL.

In general there is a spectrum between process- and outcome-based RL, and I don't think I would have ever said that we shouldn't do outcome-based RL on short-horizon tasks with verifiable rewards; I care much more about the distinction between the two for long-horizon fuzzy tasks.

I do agree that there are some signs that people will continue with outcome-based RL anyway, all the way into long-horizon fuzzy tasks. I don't think this is a settled question -- reasoning models have only been around a year, things can change quite quickly.

None of this is to disagree with your takeaways, I roughly agree with all of them (maybe I'd have some quibbles about #2).

[-]Daniel Kokotajlo4mo52

My memory agrees with Rohin. Some safety people were trying to hold the line but most weren't, I don't think it reached as much consensus as this CoT monitorability just did.

I think it's good to defend lines even if they get blown past, unless you have a better strategy this trades off against. "Defense in depth" "Fighting retreat" etc.

(Unimportant: My position has basically been that process-based reinforcement and outcome-based reinforcement are interesting to explore separately but that mixing them together would be bad, and also, process-based reinforcement will not be competitive capabilities-wise.)

[-]Daniel Kokotajlo5mo78

I'm very happy to see this happen. I think that we're in a vastly better position to solve the alignment problem if we can see what our AIs are thinking, and I think that we sorta mostly can right now, but that by default in the future companies will move away from this paradigm into e.g. neuralese/recurrence/vector memory, etc. or simply start training/optimizing the CoT's to look nice. (This is an important sub-plot in AI 2027) Right now we've just created common knowledge of the dangers of doing that, which will hopefully prevent that feared default outcome from occurring, or at least delay it for a while.

All this does is create common knowledge, it doesn't commit anyone to anything, but it's a start.

[-]Hopenope5mo20

Is optimizing CoT to look nice a big concern? There are other ways to show a nice CoT without optimizing for it. The frontrunners also have some incentives to not show the real CoT. Additionally, there is a good chance that people prefer a nice structured summary of CoT by a small LLM when reasonings become very long and convoluted.

[-]Daniel Kokotajlo5mo5-2

Yes, it's a big concern. For example, over the next year, there might be more and more examples accumulating of AIs scheming or otherwise thinking bad thoughts in the CoT, some fraction of which will lead to bad behavior such as writing hacky code or lying to users. Companies will notice that if you just have a bad thoughts classifier look over the CoT during training and dole out negative reinforcement when it's detected, the rates of bad behavior in deployment will go down. So, they'll be under some economic and PR pressure to do that. But this way leads to disaster in the long run for reasons described here: https://www.planned-obsolescence.org/july-2022-training-game-report/ and also here: https://openai.com/index/chain-of-thought-monitoring/

[-]Thomas Kwa4mo30

Monitoring seems crucial for the next safety case. My guess is that in couple of years we'll lose capability-based safety cases, and need to rely on a good understanding of the question "how difficult are the tasks frontier agents can do covertly, without the human or AI monitor detecting its reasoning?".

Having said this, it may be that in a year or two, we'll be able to translate neuralese or have a good idea of how much one can train against a monitor before CoT stops being transparent, so the current paradigm of faithful CoT is certainly not the only hope for monitoring.

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

80

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

80

Learning Processes Rather than Achieving Outcomes