I have a little bit of skepticism on the idea of using COT reasoning for interpretability. If you really look into what COT is doing, its not actually doing much a regular model doesnt already do, its just optimized for a particular prompt that basically says "Show me your reasoning". The problem is, we still have to trust that its being truthful in its reasoning. It still isn't accounting for those hidden states , the 'subconscious', to use a somewhat flawed analogy
We are still relying on trusting an entity that we dont know if we can actually trust... (read more)
Indeed, I think the faithfulness properties that we currently have are not as strong as they should be ideally, and moreover they will probably degrade as models get more capable! I don't expect them to still be there when we hit ASI, that's for sure. But what we have now is a lot better than nothing & it's a great testbed for alignment research! And some of that research can be devoted to strengthening / adding more faithfulness properties!
I have a little bit of skepticism on the idea of using COT reasoning for interpretability. If you really look into what COT is doing, its not actually doing much a regular model doesnt already do, its just optimized for a particular prompt that basically says "Show me your reasoning". The problem is, we still have to trust that its being truthful in its reasoning. It still isn't accounting for those hidden states , the 'subconscious', to use a somewhat flawed analogy
We are still relying on trusting an entity that we dont know if we can actually trust... (read more)