I sense that having users/companies want faithful CoT is very important.. In-tune users, as. nostalgebraist points out, will know how to use CoTs to debug LLMs. But I'm not sure whether this represents only 1% of users, so big labs just won't care. Maybe we need to try and educate more users about this. Maybe reach out to people who tweet about LLM best use cases to highlight this?

Reply

OpenAI: Detecting misbehavior in frontier reasoning models

James Chua17d*90

we extend the duration of the "golden era" in which people can mostly sorta tell what the best AIs are thinking

Agreed. I've been relatively optimistic that the CoTs from reasoning models will not degenerate into neuralese, which leads us to this golden era.

Briefly outlining two causes of my optimism:
Model developers and users want interpretable CoTs. Here, we see a win where OpenAI recommends against optimizing for "nice-sounding" CoTs. One obstacle to that is if other model developers don't care. Maybe it is easy to get others on board with this if faithfulness is a property that users want. We know users can be dumb and may prefer nice-sounding CoT. But if given the choice between "nice-sounding but false" vs "bad-sounding but true", it seems possible that the users' companies, in principle, would prefer true reasoning versus false reasoning. Maybe especially because it is easier to spot issues when working with LLMs. E.g. Maybe users like seeing DeepSeek R1's thinking because it helps them spot when DeepSeek misunderstands instructions.

Natural language is a pretty good local optima to write the CoT in. Most of the pretraining data is in natural language. You can also learn "offline" from other models by training on their CoT. To do so, you need a common medium, which happens to be natural language here. We know that llamas mostly work in English. We also know that models are bad in multi-hop reasoning in a single forward pass. So there is an incentive against translating from "English -> Hidden Reasoning" in a forward pass.

Also, credit to you for pointing out that we should not optimize for nice-sounding CoTs since 2023.

Reply

1