This is a linkpost for https://arxiv.org/abs/2211.06738

(I did not have anything to do with this paper and these are just my own takes.)

The Alignment Research Center recently published their second report, Formalizing the presumption of independence. While it's not explicitly about AI alignment, it's probably still interesting for some  people here.

Summary

The paper is about "heuristic arguments". These are similar to proofs, except that their conclusions are not guaranteed to be correct and can be overturned by counterarguments. Mathematicians often use these kinds of arguments, but in contrast to proofs, they haven't been formalized. The paper mainly describes the open problem of finding a good formalization of heuristic arguments. They do describe one attempt, "cumulant propagation", in Appendix D, but point out it can behave pathologically.

So what's the "presumption of independence" from the title? Lots of heuristic arguments work by assuming that some quantities are independent to simplify things, and that's what the paper focuses on. Such an argument can be overturned by showing that there’s actually some correlation we initially ignored, which should then lead to a more sophisticated heuristic argument with a potentially different conclusion.

What does this have to do with alignment?

The paper only very briefly mentions alignment (in Appendix F), more detailed discussion is planned for the future. But roughly:

Avoiding catastrophic failures. Heuristic arguments can let us better estimate the probability of rare failures, or failures which occur only on novel distributions where we cannot easily draw samples. This can be used during validation to estimate risk, or potentially during training to further reduce risk.

Eliciting latent knowledge. Heuristic arguments may let us see “why” a model makes its predictions. We could potentially use them to distinguish cases where similar behaviors are produced by very different mechanisms—for example distinguishing cases where a model predicts that a smiling human face will show up on camera because it predicts there will actually be a smiling human in the room, from cases where it makes the same prediction because it predicts that the camera will be tampered with. [...]

Neither of these applications is straightforward, and it should not be obvious that heuristic arguments would allow us to achieve either goal. [...]

Heuristic arguments can be seen as somewhere between interpretability and formal verification: unlike interpretability, heuristic arguments are meant to be machine-checkable and don't have to be human-understandable. But unlike formal proofs, they don't require perfect certainty and might be much easier to find.

Readers here might also be reminded of Logical Induction. This paper is trying to do something somewhat different though:

[Approaches to logical uncertainty] have primarily focused on establishing coherence conditions and on capturing inductive reasoning, i.e. ensuring that a reasoner eventually successfully predicts φ(n) given observations of φ(1), φ(2), . . . φ(n − 1). These systems would not automatically recognize intuitively valid heuristic arguments [...], although they would eventually learn to trust these arguments after observing them producing good predictions in practice.

Indeed, we can view ourselves as reasoners in exactly this situation, trying to understand and formalize a type of reasoning that appears to often make good predictions in practice. Formalizations of inductive reasoning may help clarify the standards we should use for evaluating a proposed heuristic estimator, but do not constitute a good heuristic estimator themselves.

So should you read the paper?

Given it's a 60-page report (though most of that's appendices) with basically no explicit discussion of alignment, I don't think this is a "must-read" for everyone. For example, if you haven't read the ELK report, I would strongly recommend that over this new paper.

On the other hand, if you work on something related, such as formal verification, ELK, or conceptual interpretability research, I think it makes a lot of sense to at least look at the main paper and Appendix F (16 pages and quite readable).

Personally, I also think this is just really interesting independent from alignment. Appendix B and C were my favorite parts from that perspective (though also the most speculative ones).

New Comment