Nate Soares argues that there's a deep tension between training an AI to do useful tasks (like alignment research) and training it to avoid dangerous actions. Holden is less convinced of this tension. They discuss a hypothetical training process and analyze potential risks.
Call for BETA testers for an AI control/security tool. I'm bottlenecked on bug reports!
I recently advertised the alpha test for claude-guard. After a few weeks of dev work, it's now in beta!
I want claude-guard to be a tool that people actually use, not just because it works but because it works seamlessly. In the alpha test, I only got a single user PR and no issues. I can't surface everything on my own! I need your data!
The bar is low. If setup failed, doctor confused you, the firewall blocked something you needed, or any other reason you wouldn't want to...
Evaluation awareness — an AI recognizing it's being evaluated — is a widely discussed concept in AI safety. But there is a closely related concept that we claim is more important: deployment awareness, the AI's ability to recognize when it is not being evaluated and when its actions matter. A misaligned AI with deployment awareness can game evaluations without any evaluation awareness at all, with a simple strategy: act aligned by default, and deviate only when confident you're in real deployment and your actions matter for your goals. This requires two ingredients — occasionally recognizable deployment situations, and enough self-reflective and strategic reasoning for the AI to anticipate and plan around this. We think "deployment awareness" better identifies what makes evaluations fragile, and we develop this idea...
Yes, I agree. How I would currently put this is something like this. Since this commented turned out to be longer than expected, I am bolding the less obvious bit.
A Critique of Functional Decision Theory
NB: My writing this note was prompted by Carl Shulman, who suggested we could try a low-time-commitment way of attempting to understanding the disagreement between some folks in the rationality community and academic decision theorists (including myself, though I’m not much of a decision theorist). Apologies that it’s sloppier than I’d usually aim for in a philosophy paper, and lacking in appropriate references. And, even though the paper is pretty negative about FDT, I want to emphasise that my writing this should be taken as a sign of respect for those involved in developing FDT. I’ll also caveat I’m unlikely to have time to engage in the comments; I thought it was better to get this out there all the same rather...
Points III and IV are both incorrect.
For III, there is both a simulated you and a real you. The simulated you derives greater utility from choosing Left (assuming that identical copies value each other's utilities to some extent). Therefore there are no longer any guaranteed payoffs from always choosing Right; the example does not prove what it claims.
You can use it to show that using FDT with identical copies that don't value each other's utilities is incoherent, if you want.
For point IV, if the perdictor runs an algorithm like "Scots tend to one-box, whe...
If we had a misalignment warning shot, would we be able to tell?
Suppose an AI company catches their model taking an egregious action, like deleting oversight code that monitors its actions. Should they sound the alarm? A key piece of evidence to determine what to do next – such as what mitigations to take – is to understand why the model took the action. If the model was just confused (e.g. it may have been trying to reduce latency), a simple mitigation like a regex classifier that blocks destructive actions until a user approves should suffice to prevent the behavior. But if this was intentional subversion, the model will circumvent the regex, and more robust, expensive mitigations are needed. This motivates the need for a follow-up investigation...
In a variety of conversations about AI misalignment risks, I find that it is important to be able to clearly point at different levels of AI capability. My current favorite approach is to talk about how much the AI accelerates AI R&D[1] labor.
I define acceleration of AI R&D labor by Y times as "the level of acceleration which is as useful (for making more powerful AIs) for an AI company as having its employees run Y times faster[2] (when you allow the total inference compute budget for AI assistance to be equal to total salaries)". Importantly, a 5x AI R&D labor acceleration won't necessarily mean that research into making AI systems more powerful happens 5x faster, as this just refers to increasing the labor part of the...
A more specific argument to expect spikiness and therefore TED-AI to be vastly superhuman in most areas: