Formal verification, heuristic explanations and surprise accounting

[-]John Schulman1y60

I'm not sure I fully understand how you would use surprise accounting in a practical scenario. In the circuit example, you're starting out with a statement that is known to be true (by checking all inputs) -- that the circuit always outputs 1. But in most practical scenarios, you have a hypothesis (e.g., "the model will never do $bad_thing under any input") that you're not sure is true. So where do you get the statement that you'd apply the surprise accounting to? I guess you could start out with "the model didn't do $bad_thing on 1 million randomly sampled inputs", try to find an explanation of this finding with low total surprise, and then extend the explanation into a proof that applies to all inputs. Is that right?

[-]Jacob_Hilton1y30

Yes, I think the most natural way to estimate total surprise in practice would be to use sampling like you suggest. You could try to find the best explanation for "the model does $bad_thing with probability less than 1 in a million" (which you believe based on sampling) and then see how unlikely $bad_thing is according to the resulting explanation. In the Boolean circuit worked example, the final 23-bit explanation is likely still the best explanation for why the model outputs TRUE on at least 99% of inputs, and we can use this explanation to see that the model actually outputs TRUE on all inputs.

Another possible approach is analogous to fine-tuning. You could start by using surprise accounting to find the best explanation for "the loss of the model is L" (where L is estimated during training), which should incentivize rich explanations of the model's behavior in general. Then to estimate the probability that model does some rare $bad_thing, you could "fine-tune" your explanation using an objective that encourages it to focus on the relevant tails of the distribution. We have more ideas about estimating the probability of events that are too rare to estimate via sampling, and have been considering objectives other than surprise accounting for this. We plan to share these ideas soon.

[-]John Schulman1y20

Cool, find-tuning sounds a bit like conditional Kolmogorov complexity -- the cost of your explanation would be K(explanation of rare thing | explanation of the loss value and general functionality)

Roughly speaking, the reason that the anti-correlation halves with each layer is that if the subsequent AND or OR gates are different, then the anti-correlation is maintained, whereas if they are the same, then the anti-correlation is eliminated. Making this argument more precise is left as an exercise to the reader. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

81

Formal verification, heuristic explanations and surprise accounting

81

Formal verification for neural networks

Heuristic explanations

Surprise accounting

Worked example: Boolean circuit

Brute-force

Final OR gate

Pattern of NOT gates

Pattern of AND and OR gates

Discussion

Conclusion