Like the model only needs to be somewhat confident that its exploring everything the humans explored in the imitation data, but for any given case of choosing not to explore some behavior it needs to be very confident.

I don't have the exact model very precisely worked out in my head and I might be explaining this in a somewhat confusing way, sorry.

Notes on countermeasures for exploration hacking (aka sandbagging)

Ryan Greenblatt7d30

I think I basically agree with much of this comment. However:

Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn't be able to do something to avoid exploring it. E.g., if we have 100,000 carefully targeted trajectory edits/imitation examples, then to avoid getting caught, the model would need to ensure that none of the behaviors it avoids exploring are caught by these examples. This might force the AI to explore lots of superhuman stuff.
Further, if the model messes up, this might cascade as discussed.
Ensembling multiple models for exploration (or multiple prompt variants, or various different jitters of the internals) might mean that all the models need to coordinate on the same boundary of what not to explore into. Otherwise you'll explore some behavior that a model doesn't do, and then update this to sandbag less and then this might cascade. So, it might be non-trivial for models to exploration hack (given substantial countermeasures) even if the behavior is quite superhuman.
None of these mechanisms suffice for "robustly avoiding sandbagging", but they suggest that countermeasures are still useful in superhuman regimes. (Of course, betting the world on countermeasures which have a 50% chance of working is crazy, but that doesn't mean that we shouldn't try to make the countermeasures better!)

The only way to get this that I can think of is to use an RL method which guarantees against mode-collapse, IE, a technique which keeps a broad distribution over the strategies which achieve good reward, rather than collapsing down to one particular way of getting good reward as training proceeds. This guarantees that the only reason why a specific strategy is not explored is because its expected value is low (according to all plausible hypotheses).

This would be true if we had an infinite amount of data, but we don't. You need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy - just add a small weight on a uniform prior over tokens like they did in old school atari RL.)

Fabien's Shortform

Ryan Greenblatt9d20

First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can't be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP).

Maybe this is true for SL on trajectories from the post-RL policy, but this doesn't clearly seem like the right way to think about it from my perspective.

Fabien's Shortform

Ryan Greenblatt10d32

How much flop do you think it takes for large scale RL to learn codes? Let's say RL learns 100x less than SL (seems plausible) and is only 10% as focused on learning new ways of thinking / languages as SL. Then, we'd expect that reasonably efficient RL with 3 OOMs more FLOP than big pretraining runs (that do learn this to some extent) could learn new languages. This would naively be a ~1e27 FLOP RL run (assuming that we can learn this much stuff in 1e24 FLOP pretraining runs).

I think we'll probably see 1e27 FLOP RL runs next year?

Fabien's Shortform

Ryan Greenblatt10d30

Doesn't SL already learn a full language in which complex thoughts can be expressed? I agree it was brutal, it required >1e24 flop, but still.

Fabien's Shortform

Ryan Greenblatt10d3-2

I think the RL case might be more analogous to the translation pair case than the "just the encoded text" case. How does that alter the bottom line?

Towards a scale-free theory of intelligent agency

Ryan Greenblatt10d*121

However, this strongly limits the space of possible aggregated agents. Imagine two EUMs, Alice and Bob, whose utilities are each linear in how much cake they have. Suppose they’re trying to form a new EUM whose utility function is a weighted average of their utility functions. Then they’d only have three options:

Form an EUM which would give Alice all the cakes (because it weights Alice’s utility higher than Bob’s)

Form an EUM which would give Bob all the cakes (because it weights Bob’s utility higher than Alice’s)

Form an EUM which is totally indifferent about the cake allocation between them (which would allocate cakes arbitrarily, and could be swayed by the tiniest incentive to give all Alice’s cakes to Bob, or vice versa)

None of these is very satisfactory!

I think this exact example is failing to really inhabit the mindset of a true linear(!) returns EUM agent. If Alice has literally linear returns Alice is totally happy to accept a deal which gets Alice 2x as many cakes + epsilon in 50% of worlds and nothing otherwise.

Correspondingly, if Alice and Bob have ex-ante exactly identical expected power and it is ex-ante as easy to make cake for then I think the agent they would build together would be something like:

Form an EUM which is totally indifferent about the cake allocation between them and thus gives 100% of the cake to whichever agent is cheaper/easier to provide cake for.

From Alice's perspective this gets twice as many cakes + epsilon (due to being more efficient) in 50% of worlds and is thus a nice trade.

(If the marginal cost of giving a cake to Alice vs Bob increases with number of cakes, then you'd give some to both.)

If Alice/Bob had dimishing returns, then adding the utility functions with some bargained weighting is also totally fine and will get you some nice split of cake between them.

If we keep their preferences, but make them have different cake production abilities or marginal costs of providing cakes for them, then you just change the weights (based on some negotiation), not the linearity of the addition. And yes, this means that in many worlds (where one agent always has lower than ex-ante relative marginal cake consumption cost), one of the agents gets all the cake. But ex-ante they got a bit more in expectation!

I'm much more sympathetic to other objections to aggregations of EUM agents being EUM, like ontology issues, imperfect information (and adverse selection), etc.

The case for ensuring that powerful AIs are controlled

Ryan Greenblatt18d42

I think I disagree some with this change. Now I'd say something like "We think the control line-of-defense should mostly focus on the time before we have enough evidence to relatively clearly demonstrate the AI is consistently acting egregiously badly. However, the regime where we deploy models despite having pretty strong empirical evidence that that model is scheming (from the perspective of people like us), is not out of scope."

AI Control May Increase Existential Risk

Ryan Greenblatt19d41

You could have the view that open weights AGI is too costly on takeover risk and escape is bad, but we'll hopefully have some pre-AGI AIs which do strange misaligned behaviors that don't really get them much/any influence/power. If this is the view, then it really feels to me like preventing escape/rogue internal deployment is pretty useful.