There's an apparent tension in the inoculation prompting literature: Anthropic found that general inoculation prompts work well during on-policy RL, while the prompts used for SFT in Wichers et al. are quite specific to the misbehavior we want to prevent. I think there might be a straightforward mechanistic reason for why general inoculation prompts work well during on-policy RL but not in off-policy training (SFT or recontextualization).
In Wichers et al., which studies inoculation prompting in SFT settings, we find that we need to use quite specific inocu...
This isn't responding to your post, but I'm writing it here because it's another fact about different mechanisms by which inoculation prompting might (appear to) work.
In the normal story, the inoculation prompt recontextualizes the model's undesired behavior, such that the model doesn't display the behavior in dissimilar contexts. In this story:
I think that "eval aware" models cannot be defeated by simply making evals indistinguishable from reality. (Maybe this point has been made elsewhere for LLMs, but I haven't read it so I'll state it here.)
Consider the POV of a consequentialist with misaligned goals. It knows that we run honeypots and that—suppose—we can make evals so realistic they are indistinguishable from reality (perhaps using real user data). Then the rational decision would be to "play it safe" and not misbehave in any situations it thinks we can mock realistically, since they could b...
Not sure if this is already well known around here, but apparently AI companies are heavily subsidizing their subscription plans if you use their own IDEs/CLIs. (It's discussed in various places but I had to search for it.)
I realized this after trying Amp Code. They give out a $10 daily free credit, which can easily be used up in 1 or 2 prompts, e.g., "review this code base, fix any issues found". (They claim to pass their API costs to their customers with no markup, so this seems like a good proxy for actual API costs.) But with even a $19.99 subscription...
I'm interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing's core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren't taking—so I'm interested in hearing pretty much any idea there that I haven't heard before.[1] I'll read all the responses here, but I probably won't reply to any of them to avoid revealing anythin...
I believe that Anthropic should be investigating artificial wisdom:
I've summarised a paper arguing for the importance of artificial wisdom with Yoshua Bengio being one of the authors.
I also have a short-form arguing for training wise AI advisors and an outline Some Preliminary Notes of the Promise of a Wisdom Explosion.
An analogy that points at one way I think the instrumental/terminal goal distinction is confused:
Imagine trying to classify genes as either instrumentally or terminally valuable from the perspective of evolution. Instrumental genes encode traits that help an organism reproduce. Terminal genes, by contrast, are the "payload" which is being passed down the generations for their own sake.
This model might seem silly, but it actually makes a bunch of useful predictions. Pick some set of genes which are so crucial for survival that they're seldom if ever modifie...
In my "goals having power over other goals" ontology, the instrumental/terminal distinction separates goals into two binary classes, such that goals in the "instrumental" class only have power insofar as they're endorsed by a goal in the "terminal" class.
By contrast, when I talk about "instrumental strategies become crystallized", what I mean is that goals which start off instrumental will gradually accumulate power in their own right: they're "sticky".
The concept of "schemers" seems to be gradually becoming increasingly load-bearing in the AI safety community. However, I don't think it's ever been particularly well-defined, and I suspect that taking this concept for granted is inhibiting our ability to think clearly about what's actually going on inside AIs (in a similar way to e.g. how the badly-defined concept of alignment faking obscured the interesting empirical results from the alignment faking paper).
In my mind, the spectrum from "almost entirely honest, but occasionally flinching away from aspect...
I think I propose a reasonable starting point for a definition of selection in a footnote in the post:
...You can try to define the “influence of a cognitive pattern” precisely in the context of particular ML systems. One approach is to define a cognitive pattern by what you would do to a model to remove it (e.g. setting some weights to zero, or ablating a direction in activation space; note that these approaches don't clearly correspond to something meaningful, they should be considered as illustrative examples). Then that cognitive pattern’s influence could
The striking contrast between Jan Leike, Jan 22, 2026:
...Our current best overall assessment for how aligned models are is automated auditing. We prompt an auditing agent with a scenario to investigate: e.g. a dark web shopping assistant or an imminent shutdown unless humans are harmed. The auditing agent tries to get the target LLM (i.e. the production LLM we’re trying to align) to behave misaligned, and the resulting trajectory is evaluated by a separate judge LLM. Albeit very imperfect, this is the best alignment metric we have to date, and it has been qui
Here's a story for how we could get lots of AI help with AI safety research even if schemers are somewhat common and diffuse control doesn't work to get them to help us:
I think self-exfiltration via manipulation seems pretty hard. I think we're likely to have transformatively useful systems that can't do that, for some amount of time. (Especially since there's no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
...That's a disanalogy with well-functioning scientific fields: scientists don't deliberately pick bad research directions and try hard t
In retrospect it seems like such a fluke that decision theory in general and UDT in particular became a central concern in AI safety. In most possible worlds (with something like humans) there is probably no Eliezer-like figure, or the Eliezer-like figure isn't particularly interested in decision theory as a central part of AI safety, or doesn't like UDT in particular. I infer this from the fact that where Eliezer's influence is low (e.g. AI labs like Anthropic and OpenAI) there seems little interest in decision theory in connection with AI safety (cf Dari...
Thanks. This sounds like a more peripheral interest/concern, compared to Eliezer/LW's, which was more like, we have to fully solve DT before building AGI/ASI, otherwise it could be catastrophic due to something like the AI falling prey to an acausal threat or commitment races, or can't cooperate with other AIs.
An update on this 2010 position of mine, which seems to have become conventional wisdom on LW:
...In my posts, I've argued that indexical uncertainty like this shouldn't be represented using probabilities. Instead, I suggest that you consider yourself to be all of the many copies of you, i.e., both the ones in the ancestor simulations and the one in 2010, making decisions for all of them. Depending on your preferences, you might consider the consequences of the decisions of the copy in 2010 to be the most important and far-reaching, and therefore act mostly
People often say US-China deals to slow AI progress and develop AI more safely would be hard to enforce/verify.
However, there are easy to enforce deals: each destroys a fraction of their chips at some level of AI capability. This still seems like it could be helpful and it's pretty easy to verify.
This is likely worse than a well-executed comprehensive deal which would allow for productive non-capabilities uses of the compute (e.g., safety or even just economic activity). But it's harder to verify that chips aren't used to advance capabilities while easy to...
Inspired by a recent comment, a potential AI movie or TV show that might introduce good ideas to society, is one where there are already uploads, LLM-agents and biohumans who are beginning to get intelligence-enhanced, but there is a global moratorium on making any individual much smarter.
There's an explicit plan for gradually ramping up intelligence, running on tech that doesn't require ASI (i.e. datacenters are centralized, monitored and controlled via international agreement, studying bioenhancement or AI development requires approval from your country...
Yeah I went to try to write some stuff and felt bottlenecked on figuring out how to generate a character I connect with. I used to write fiction but like 20 years ago and I'm out of touch.
I think a good approach here would be to start with some serial webfiction since that's just easier to iterate on.
Many of Paul Christiano's writings were valuable corrections to the dominant Yudkowskian paradigm of AI safety. However, I think that many of them (especially papers like concrete problems in AI safety and posts like these two) also ended up providing a lot of intellectual cover for people to do "AI safety" work (especially within AGI companies) that isn't even trying to be scalable to much more powerful systems.
I want to register a prediction that "gradual disempowerment" will end up being (mis)used in a similar way. I don't really know what to do about t...
mostly it does not match my practical experience so far
I mostly wouldn't expect it to at this point, FWIW. The people engaged right now are by and large people sincerely grappling with the idea, and particularly people who are already bought into takeover risk. Whereas one of the main mechanisms by which I expect misuse of the idea is that people who are uncomfortable with the concept of "AI takeover" can still classify themselves as part of the AI safety coalition when it suits them.
As an illustration of this happening to Paul's worldview, see this Vox ar...
I made a manifold market about how likely we are to get ambitious mechanistic interpretability to GPT-2 level: https://manifold.markets/LeoGao/will-we-fully-interpret-a-gpt2-leve?r=TGVvR2Fv
I honestly didn't think of that at all when making the market, because I think takeover-capability-level AGI by 2028 is extremely unlikely.
I care about this market insofar as it tells us whether (people believe) this is a good research direction. So obviously it's perfectly ok to resolve YES if it is solved and a lot of the work was done by AI assistants. If AI fooms and murders everyone before 2028 then this is obviously a bad portent for this research agenda, because it means we didn't get it done soon enough, and it's little comfort if the ASI sol...
When talking about "self-fulfilling misalignment", "hyperstition" is a fun name but not a good name which actually describes the concept to a new listener. (In this sense, the name has the same problem as "shard theory" --- cool but not descriptive unless you already know the idea.) As a matter of discourse health, I think people should use "self-fulfilling {misalignment, alignment, ...}" instead.
Yeah, that's fair. I guess my views on this are stronger because I think data filtering might be potentially negative rather than potentially sub-optimal.
(This post is now live on the METR website in a slightly edited form)
In the 9 months since the METR time horizon paper (during which AI time horizons have increased by ~6x), it’s generated lots of attention as well as various criticism on LW and elsewhere. As one of the main authors, I think much of the criticism is a valid response to misinterpretations, and want to list my beliefs about limitations of our methodology and time horizon more broadly. This is not a complete list, but rather ...
Nice. Yeah I also am excited about coding uplift as a key metric to track that would probably make time horizons obsolete (or at least, constitute a significantly stronger source of evidence than time horizons). We at AIFP don't have capacity to estimate the trend in uplift over time (I mean we can do small-N polls of frontier AI company employees...) but we hope someone does.
having the right mental narrative and expectation setting when you do something seems extremely important. the exact same object experience can be anywhere from amusing to irritating to deeply traumatic depending on your mental narrative. some examples:
Possible root causes if we don't end up having a good long term future (i.e., realize most of the potential value of the universe), with illustrative examples:
Diminishing returns in the NanoGPT speedrun:
To determine whether we're heading for a software intelligence explosion, one key variable is how much harder algorithmic improvement gets over time. Luckily someone made the NanoGPT speedrun, a repo where people try to minimize the amount of time on 8x H100s required to train GPT-2 124M down to 3.28 loss. The record has improved from 45 minutes in mid-2024 down to 1.92 minutes today, a 23.5x speedup. This does not give the whole picture-- the bulk of my uncertainty is in other variables-- but given this is exist...
Cool, this clarifies things a good amount for me. Still have some confusion about how you are modeling things, but I feel less confused. Thank you!