Thank you for the post! I have also been (very pleasantly!) surprised by how aligned current models seem to be.
I'm curious if you expect 'alignment by default' to continue to hold in a regime where continuous learning is solved and models are constantly updating themselves/being updated by what they encounter in the world?
Chain of Thought not producing evidence of scheming or instrumental convergence does seem like evidence against, but it seems quite weak to me as a proxy for what to expect from 'true agents'. CoT doesn't run long enough or have the type of flexibility I'd expect to see in an agent that was actually learning over long time horizons, which would give it the affordance to contemplate the many ways it could accomplish its goals.
And, while just speculation, I imagine that the kind of training procedures we're doing now to instill alignment will not be possible with Continuous Learning, or we'll have to pay a heavy alignment tax to do that for these agents. Note: Jan's recent tweet on his impression that it is quite hard to align large models and it doesn't fall out for free from size.
Thank you for the post! I have also been (very pleasantly!) surprised by how aligned current models seem to be.
I'm curious if you expect 'alignment by default' to continue to hold in a regime where continuous learning is solved and models are constantly updating themselves/being updated by what they encounter in the world?
Chain of Thought not producing evidence of scheming or instrumental convergence does seem like evidence against, but it seems quite weak to me as a proxy for what to expect from 'true agents'. CoT doesn't run long enough or have the type of flexibility I'd expect to see in an agent that was actually learning over long time horizons, which would give it the affordance to contemplate the many ways it could accomplish its goals.
And, while just speculation, I imagine that the kind of training procedures we're doing now to instill alignment will not be possible with Continuous Learning, or we'll have to pay a heavy alignment tax to do that for these agents. Note: Jan's recent tweet on his impression that it is quite hard to align large models and it doesn't fall out for free from size.