Measurement tampering detection as a special case of weak-to-strong generalization

Fabien Roger; Buck

My take is that detecting bad behavior at test time is <5% of the problem, and >95% of the problem is making an AI that doesn't have bad behavior at test time.

If you want to detect and avoid bad behavior during training, Goodhart's law becomes a problem, and you actually need to learn your standards for "bad behavior" as you go rather than keeping them static. Which I think obviates a lot of the specificity you're saying advantages MTD.

[-]Fabien Roger2y20

The kind of scheme I'm imagining is the following: Let the AI take actions. Look at a transcript describing the thousand of actions the AI took and all consequences they had on the codebase/the bank account/the users reviews/the materials coming in and out of the factory. Give that report to human raters. Train on that reward signal. Use MTD to lower reward if the AI tried to tamper with any part of that process.

In some sense, this scheme "learns what bad behavior is as you go" thanks to the human raters. Imo, the main threat is that the AI tampers with the report / with how the information gets into the report / with how the human rater feels about the report (e.g. brain hacking stuff), and my central hope is that such actions are clearly different in the AI's "mind", such that a fixed technique using the internal of the models is enough to capture that these are "different reasons" for getting a high rating - no online adjustment required. I agree there are some things in some gray area between those, but I think that given enough budget to analyze the large number of measurements, they would actually give non-catastrophic rewards if the measurements still described what they usually describe.

See this section of the MTD post for more details.

Is your concern that the human rater reward signal is too expensive, gameable in ways which won't be "measurement tampering" (in a way that is obvious to the AI itself), that there is no internal-based technique capturing "tampering" that the AI can't game (even if tampering is obvious to the AI itself), or sth else entirely? Or did I address your objection?

[-]Charlie Steiner2y10

Agree with simon that if the AI gets rich data about what counts as "measurement tampering," then you're sort of pushing around the loss basin but if tampering was optimal in the first place, the remaining optimum is still probably some unintended solution that has most effects of tampering without falling under the human-provided definition. Not only is there usually no bright-line distinction between undesired and desired behavior, the AI would be incentivized to avoid developing such a distinction.

I agree that this isn't actually that big a problem in modest settings that humans can effectively oversee, because tampering isn't even that advantageous when there's a lot of expensive human oversight that has a good chance of catching anything detrimental. There still might be incentive to find extreme plans like "take over the world to tamper with my reward signal," but those indeed seem distinct enough to penalize, assuming that you don't ever want your AI to take over the world for any reason.

But I expect we'll sometimes want to build AI to do things that humans can't effectively oversee.

[-]ryan_greenblatt2y10

Indeed, the core concern is that tampering actions would get higher reward and human supervision would be unable to understand tampering (e.g. because the actions of the AI are generally inscrutable).

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

39

Measurement tampering detection as a special case of weak-to-strong generalization

39

MTD as a subset of W2SG

A similar goal

Slightly different experiments

Different hopes for succeeding at W2SG

Confidence and consistency

Detecting tampering

Other special cases of W2SG might be good research problems