Sometimes people find it mysterious or surprising that current AIs can't fully automate difficult tasks given how smart they seem. I don't find this very confusing.
Current LLMs are just not that "smart" (yet). They compensate using very broad knowledge and strong heuristics that are mostly domain-specific. In other words, they have high crystallized intelligence but lower fluid intelligence.
In humans, crystallized and fluid intelligence are very correlated due to lim...
I have updated towards the LLMs as giant lookup table model, curious if that feels approximately right or not to you.
Haven't engaged enough to know, could be bid to engage, won't by default.
I have repeatedly argued for a departure from pure Bayesianism that I call "quasi-Bayesianism". But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!
Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can cho
This argument uses the assumption that Alice can't change eir beliefs in response to learning that Omega has proposed specific bets and not others.
Not true. Changing her beliefs in response to Omega's proposal doesn't help her. Imagine that Alice is given a choice between
No matter what probability Alice assigns to X after her update, "normal" Bayesian calculus (really CDT calculus, see below) mandates that she chooses 1 or 2, not 3.
...It seems clear that
An analogy that points at one way I think the instrumental/terminal goal distinction is confused:
Imagine trying to classify genes as either instrumentally or terminally valuable from the perspective of evolution. Instrumental genes encode traits that help an organism reproduce. Terminal genes, by contrast, are the "payload" which is being passed down the generations for their own sake.
This model might seem silly, but it actually makes a bunch of useful predictions. Pick some set of genes which are so crucial for survival that they're seldom if ever modifie...
In my "goals having power over other goals" ontology, the instrumental/terminal distinction separates goals into two binary classes, such that goals in the "instrumental" class only have power insofar as they're endorsed by a goal in the "terminal" class.
By contrast, when I talk about "instrumental strategies become crystallized", what I mean is that goals which start off instrumental will gradually accumulate power in their own right: they're "sticky".
Rob Wiblin asked:
What's the best published (or unpublished) case for each of the big 3 companies having the best approach to safety/security/alignment? That is:
Anthropic
OpenAI
GDM
(They're each unique in some way such that someone who cared a lot about their X-factor might favour them.)
...The basic case for Anthropic is that they have the largest number of people who are thoughtful about AI misalignment risk and highly focused on mitigating it, and the company culture is somewhat more AGI-pilled, and more of the staff would support taking actions t
I think we should be relatively less worried about instrumental power-seeking and relatively more worried about terminal power-seeking. Note that this is only a relative update on the margin, and maybe on net I am still more concerned about the instrumental version because I started much more concerned about it. This is also not a super recent update—I just haven't seen it written up before.
Simple argument:
Certainly the really concerning thing here is (1). Though indeed one way you might get (1) is by generalization from (2).
Hypothesis: alignment-related properties of an ML model will be mostly determined by the part(s) of training that were most responsible for capabilities.
If you take a very smart AI model with arbitrary goals/values and train it to output any particular sequence of tokens using SFT, it'll almost certainly work. So can we align an arbitrary model by training them to say "I'm a nice chatbot, I wouldn't cause any existential risk, ... "? Seems like obviously not, because the model will just learn the domain specific / shallow property of outputting those part...
Relevant OpenAI blog post just today: https://alignment.openai.com/how-far-does-alignment-midtraining-generalize/
Relevant figure:

There's a strong pattern in ratfic of the protagonist "winning" by gaining the power to design a new world order from scratch—i.e. taking over the world. It's a very High Modernist mindset (as I pointed out in a recent tweet). And once you see how crucial this is to the rationalist perspective on what a good future looks like, it's hard to unsee.
You might respond: the worlds these protagonists find themselves in are usually so bad that seizing absolute power is in fact the most ethical thing to do. But the worlds didn't have to be that bad! The writers cho...
Oh, I think of "ending factory farming" as very far from "taking over the world".
If Superman were a skilled political operator it could be as simple as arranging to take photoshoots with whichever politicians legislated the end of factory farms.
Or if he were less skilled it could involve doing various kinds of property damage to factory farms (potentially even things which there aren't laws against, like flying around them in a way which blows the buildings over).
This might escalate to the government trying to arrest him, and outright conflict, but honestl...
A few more observations.
The definition of iteration we had before implicitly assumes that the agent can observe the full outcome of previous iterations. We don't have to make this assumption. Instead, we can assume a set of possible observations
I believe that Theorem 4 remains valid.
As we remarked before, DDT is not invariant under adding a constant to the loss function. It is interesting to consider what happens when we add an increasingly large ...
I conducted an exercise at METR to simulate what our work would be like in 2027, when we have 200 hour time horizon AIs. Some observations:
Reward-seekers will probably behave according to causal decision theory.
Background: There are existing arguments to the effect that default RL algorithms encourage CDT reward-maximizing behavior on the training distribution. (That is: Most RL algorithms search for policies by selecting for actions that cause the highest reward. E.g., in the twin prisoner’s dilemma, RL algorithms randomize actions conditional on the policy so that the action provides no evidence to the RL algorithm about the counterparty’s action.) This doesn’t imply RL produces CDT reward-...
Let's say the current policy has a 90% chance of cooperating. Then, what action results in the highest expected reward for player 1 (and in turn, gets reinforced the most on average)? Player 1 sampling defect leads to a higher reward for player 1 whether or not player 2 samples cooperate (strategic dominance), and there's a 90% chance of player 2 sampling cooperate regardless of player 1's action because the policy is fixed (i.e., player 1 cooperating is no evidence of player 2 cooperating, so it's not the case that reward tends to be higher for player 1 w...
PSA: Anthropic models don't seem to particularly privilege the explicit thinking field. This makes reinforcement spillover—where training on a model's outputs generalizes to the CoT, making it appear safer—more likely.
While Anthropic models do have an separate explicit thinking field, they don't really use thinking that differently from outputs and aren't that dependent on the thinking field. Sometimes they'll just do their thinking in the output field, the way they talk in the thinking field isn't very distinct from how they talk in outputs, and I believe...
I'm curious what Sonnet 4.5's score would be on your external CoT controllability eval. I.e., what Sonnet 4.5's blue bar in your Fig. 25 would look like. (I couldn't find a result like this in your paper, but I might have overlooked something)
My casual impression has been that external CoT with Sonnet 4.5 is fairly steerable, but this impression is mostly based on "in practice I have no trouble getting Sonnet 4.5 to do the kind of CoT I need" rather than any kind of systematic investigation.
It seems like at least one of the following must be true:
Note: These are all rough numbers, I'd expect I'd shift substantially about all of this on further debate.
Suppose we made humanity completely robust to biorisk, i.e. we did sufficient preparation such that the risk of bio catastrophe (including AI mediated biocatastrophe) was basically 0.[1] How much would this reduce total x-risk?
The basic story for any specific takeover path not mattering much is that the AIs, conditional on them being wanting to take over, will self-improve until they find they find the next easiest takeover path and do that instead. ...
Difficulty of the successor alignment problem seems like a crux. Misaligned AIs could have an easy time aligning their successors just because they're willing to dedicate enough resources. If alignment requires say 10% of resources to succeed but an AI is misaligned because the humans only spent 3%, it can easily pay this to align its successor.
If you think that the critical safety:capabilities ratio R required to achieve alignment follows a log-uniform distribution from 1:100 to 10:1, and humans always spend 3% on safety while AIs can spend up to 50%, the...
I signed an amicus brief supporting Anthropic's right to do business without governmental retaliation. As an AI expert, I attest that Anthropic's technical concerns are legitimate, and no laws were designed to protect against AI analysis of surveillance data.
Even though I work at a competing lab (Google DeepMind), I'm proud of Anthropic for taking a stand against unlawful retaliation and immoral demands.
(I speak only for myself, not my employer.)
It seems that LLMs are not good enough at reasoning, even after being trained on ~all human output, such that you couldn't amplify their capabilities to arbitrary levels through iterated amplification, so AI companies are mainly increasing AI capabilities via RLVR instead. Is this impression wrong, and how to update on it if not?
Aside from the potential implications on alignment (i.e., closing off one approach that seemed hopeful for some, at least for the foreseeable future), I wonder if this is a deficiency in LLMs (their architecture or how they're trai...
I think part of the problem is people think of themselves as having at least, like, a medium explicit strategy, but, the strategy routes through some judgment that conveniently keeps returning "not yet" or "only saying things in a somewhat cagey way."
i.e. this advice seems necessary but not sufficient.
There's an apparent tension in the inoculation prompting literature: Anthropic found that general inoculation prompts work well during on-policy RL, while the prompts used for SFT in Wichers et al. are quite specific to the misbehavior we want to prevent. I think there might be a straightforward mechanistic reason for why general inoculation prompts work well during on-policy RL but not in off-policy training (SFT or recontextualization).
In Wichers et al., which studies inoculation prompting in SFT settings, we find that we need to use quite specific inocu...
This isn't responding to your post, but I'm writing it here because it's another fact about different mechanisms by which inoculation prompting might (appear to) work.
In the normal story, the inoculation prompt recontextualizes the model's undesired behavior, such that the model doesn't display the behavior in dissimilar contexts. In this story:
I think that "eval aware" models cannot be defeated by simply making evals indistinguishable from reality. (Maybe this point has been made elsewhere for LLMs, but I haven't read it so I'll state it here.)
Consider the POV of a consequentialist with misaligned goals. It knows that we run honeypots and that—suppose—we can make evals so realistic they are indistinguishable from reality (perhaps using real user data). Then the rational decision would be to "play it safe" and not misbehave in any situations it thinks we can mock realistically, since they could b...
Not sure if this is already well known around here, but apparently AI companies are heavily subsidizing their subscription plans if you use their own IDEs/CLIs. (It's discussed in various places but I had to search for it.)
I realized this after trying Amp Code. They give out a $10 daily free credit, which can easily be used up in 1 or 2 prompts, e.g., "review this code base, fix any issues found". (They claim to pass their API costs to their customers with no markup, so this seems like a good proxy for actual API costs.) But with even a $19.99 subscription...
I'm interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing's core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren't taking—so I'm interested in hearing pretty much any idea there that I haven't heard before.[1] I'll read all the responses here, but I probably won't reply to any of them to avoid revealing anythin...
I believe that Anthropic should be investigating artificial wisdom:
I've summarised a paper arguing for the importance of artificial wisdom with Yoshua Bengio being one of the authors.
I also have a short-form arguing for training wise AI advisors and an outline Some Preliminary Notes of the Promise of a Wisdom Explosion.
The concept of "schemers" seems to be gradually becoming increasingly load-bearing in the AI safety community. However, I don't think it's ever been particularly well-defined, and I suspect that taking this concept for granted is inhibiting our ability to think clearly about what's actually going on inside AIs (in a similar way to e.g. how the badly-defined concept of alignment faking obscured the interesting empirical results from the alignment faking paper).
In my mind, the spectrum from "almost entirely honest, but occasionally flinching away from aspect...
I think I propose a reasonable starting point for a definition of selection in a footnote in the post:
...You can try to define the “influence of a cognitive pattern” precisely in the context of particular ML systems. One approach is to define a cognitive pattern by what you would do to a model to remove it (e.g. setting some weights to zero, or ablating a direction in activation space; note that these approaches don't clearly correspond to something meaningful, they should be considered as illustrative examples). Then that cognitive pattern’s influence could