This is a linkpost for https://docs.google.com/document/d/19VL31R_0oZUGe7aE5OD5njX2l46fXgpgL6HvTrAJVZI/edit
This work has been done in the context of SaferAI’s work on risk assessment. Equal contribution by Eli and Joel. I'm sharing this writeup in the form of a Google Doc and reproducing the summary below.
Disclaimer: this writeup is context for upcoming experiments, not complete work. As such it contains a lot of (not always well-justified) guess-work and untidy conceptual choices. We are publishing now despite this to get feedback.
If you are interested in this work — perhaps as a future collaborator or funder, or because this work could provide helpful input into e.g. risk assessments or RSPs — please get in touch with us at joel@qallys.com and/or simeon@safer-ai.org.
In this write-up, we will: