Neural networks generalize unexpectedly well. Jesse argues this is because of singularities in the loss surface which reduce the effective number of parameters. These singularities arise from symmetries in the network. More complex singularities lead to simpler functions which generalize better. This is the core insight of singular learning theory.
Evaluation awareness — an AI recognizing it's being evaluated — is a widely discussed concept in AI safety. But there is a closely related concept that we claim is more important: deployment awareness, the AI's ability to recognize when it is not being evaluated and when its actions matter. A misaligned AI with deployment awareness can game evaluations without any evaluation awareness at all, with a simple strategy: act aligned by default, and deviate only when confident you're in real deployment and your actions matter for your goals. This requires two ingredients — occasionally recognizable deployment situations, and enough self-reflective and strategic reasoning for the AI to anticipate and plan around this. We think "deployment awareness" better identifies what makes evaluations fragile, and we develop this idea...
Thank you for the thoughtful comments and references! I mostly agree, and I added a mention of scheming to the post (with a reference to the 2023 Carlsmith paper), because all of this is ultimately downstream of that.
The main thing where my intuitions might differ somewhat is here:
Overall, I'm worried that splitting this into a ton of different types of "awareness" is going to end up giving us false confidence.
I agree with the explicit claim here, that writing a laundry list of different "awarenesses" to be avoided, and feeling safe when none of them l...
If we had a misalignment warning shot, would we be able to tell?
Suppose an AI company catches their model taking an egregious action, like deleting oversight code that monitors its actions. Should they sound the alarm? A key piece of evidence to determine what to do next – such as what mitigations to take – is to understand why the model took the action. If the model was just confused (e.g. it may have been trying to reduce latency), a simple mitigation like a regex classifier that blocks destructive actions until a user approves should suffice to prevent the behavior. But if this was intentional subversion, the model will circumvent the regex, and more robust, expensive mitigations are needed. This motivates the need for a follow-up investigation...
In a variety of conversations about AI misalignment risks, I find that it is important to be able to clearly point at different levels of AI capability. My current favorite approach is to talk about how much the AI accelerates AI R&D[1] labor.
I define acceleration of AI R&D labor by Y times as "the level of acceleration which is as useful (for making more powerful AIs) for an AI company as having its employees run Y times faster[2] (when you allow the total inference compute budget for AI assistance to be equal to total salaries)". Importantly, a 5x AI R&D labor acceleration won't necessarily mean that research into making AI systems more powerful happens 5x faster, as this just refers to increasing the labor part of the...
A more specific argument to expect spikiness and therefore TED-AI to be vastly superhuman in most areas:
I would generally expect faster takeoff speeds in domains that AIs are worse at, or that AI companies aren't prioritizing very highly.
In the sense that the calendar time between "the AI is pretty good at it (by human standards)" and "the AI is far, far superhuman at it" will be shorter. Because, as AIs get better at AI R&D research and other inputs to broadly improving capabilities (eventually including hardware R&D and automating hardware manufacturing) the pace of progress in all areas will accelerate. So for capabilities that AIs struggle with, ...
We would often like to get a qualitative sense of a target model’s behaviors in important distributions (e.g. deployment, RL training, or evals). For example, we might want to discover novel behaviors, figure out what causes some target behavior to occur, or find surprising correlations between behaviors.
In a recent short exploratory project, we tackled this problem via LLM-Driven Feature Discovery. Our method works as follows:
I did some related work with mentees last year that extends EDW by using an unsupervised optimisation target: the average CE loss of an LLM on items in the dataset when given the featurisation of the item in the prompt (https://arxiv.org/abs/2502.17541). That kind of approach can help you prioritise the features which explains the dataset you're investigating the most (kind of a natural language PCA), and can be applied to unstructured data. I think stuff like this is pretty interesting and would be excited to see more of it!
I like your suggestion of a pro...