User Comment Replies — AI Alignment Forum

Why imperfect adversarial robustness doesn't doom AI control

It seems like that assumption runs throughout the post though, eg 'But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said', 'the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.'

I don't just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change? Is the assumption that labs will be willing to accept worse performance because they recognize the model can't be trusted?

4Buck Shlegeris4mo

Those sentences are talking in the context of defer to resample, where I think it's actually reasonable to think the model won't see the resampled actions. I'm specifically talking about not showing it actions that were resampled. You're totally right that they'll need other information about their actions and interactions (though it might be a good safety-usefulness tradeoff to somewhat redact this information). Yes, I am assuming that the deployer is willing to spend some performance in order to reduce risk from the model scheming.

Why imperfect adversarial robustness doesn't doom AI control

Egg Syntax4mo*11

the untrusted model is stateless between queries and only sees the command history and system state.

What justifies the assumption that untrusted models will be stateless in deployment? There are clear performance advantages in many cases to giving models memory, and in practice they already typically receive context containing the history of their actions or interactions.

If you're assuming that commercial labs will stop doing that for the sake of safety despite its advantages, it seems worth making that assumption explicit.

4Buck Shlegeris4mo

Hmmm, good call, it will probably see the history of its chain of thought. My dear coauthor Claude wrote that and I didn't second guess it, I agree it's wrong.

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Egg Syntax8mo10

Minor suggestion to link to more info on LLC (or at least write it out as local learning coefficient) the first time it's mentioned -- I see that you do that later down but seems confusing on first read.

AI ALIGNMENT FORUM
AF

All of eggsyntax's Comments + Replies