Very interesting results. Something that I, unfortunately, was expecting to see as LLMs got better.
Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?
"One thing I appreciate about Buck/Ryan's comms around AI control is that they explicitly acknowledge that they believe control will fail for sufficiently intelligent systems."
Does that mean that they believe that after a certain point we would lose control over AI? I am new to this field, but doesn't this fact spell doom for humanity?
Have there been any proposals for detecting alignment faking LLMs in AI control literature?