User Comment Replies — AI Alignment Forum

OpenAI: Detecting misbehavior in frontier reasoning models

Do you have a sense of what I, as a researcher, could do?

I sense that having users/companies want faithful CoT is very important.. In-tune users, as. nostalgebraist points out, will know how to use CoTs to debug LLMs. But I'm not sure whether this represents only 1% of users, so big labs just won't care. Maybe we need to try and educate more users about this. Maybe reach out to people who tweet about LLM best use cases to highlight this?

4Daniel Kokotajlo18d

Since '23 my answer to that question would have been "well the first step is for researchers like you to produce [basically exactly the paper OpenAI just produced]" So that's done. Nice. There are lots of follow-up experiments that can be done. I don't think trying to shift the market/consumers as a whole is very tractable. But talking to your friends at the companies, getting their buy-in, seems valuable.

OpenAI: Detecting misbehavior in frontier reasoning models

James Chua20d*90

we extend the duration of the "golden era" in which people can mostly sorta tell what the best AIs are thinking

Agreed. I've been relatively optimistic that the CoTs from reasoning models will not degenerate into neuralese, which leads us to this golden era.

Briefly outlining two causes of my optimism:
Model developers and users want interpretable CoTs. Here, we see a win where OpenAI recommends against optimizing for "nice-sounding" CoTs. One obstacle to that is if other model developers don't care. Maybe it is easy to get others on board with this if ... (read more)

nostalgebraist19d100

But if given the choice between "nice-sounding but false" vs "bad-sounding but true", it seems possible that the users' companies, in principle, would prefer true reasoning versus false reasoning. Maybe especially because it is easier to spot issues when working with LLMs. E.g. Maybe users like seeing DeepSeek R1's thinking because it helps them spot when DeepSeek misunderstands instructions.

This definitely aligns with my own experience so far.

On the day Claude 3.7 Sonnet was announced, I happened to be in the middle of a frustrating struggle with o3... (read more)

AI ALIGNMENT FORUM
AF

All of James Chua's Comments + Replies