I'm the chief scientist at Redwood Research.
Seems like if it thinks there's a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values
This is true for any given X, but if there are many things like X which are independently 5% likely to be explored, the model is in trouble.
Like the model only needs to be somewhat confident that its exploring everything the humans explored in the imitation data, but for any given case of choosing not to explore some behavior it needs to be very confident.
I don't have the exact model very precisely worked out in my head and I might be explaining this in a somewhat confusing way, sorry.
I think I basically agree with much of this comment. However:
The only way to get this that I can think of is to use an RL method which guarantees against mode-collapse, IE, a technique which keeps a broad distribution over the strategies which achieve good reward, rather than collapsing down to one particular way of getting good reward as training proceeds. This guarantees that the only reason why a specific strategy is not explored is because its expected value is low (according to all plausible hypotheses).
This would be true if we had an infinite amount of data, but we don't. You need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy - just add a small weight on a uniform prior over tokens like they did in old school atari RL.)
First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can't be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP).
Maybe this is true for SL on trajectories from the post-RL policy, but this doesn't clearly seem like the right way to think about it from my perspective.
How much flop do you think it takes for large scale RL to learn codes? Let's say RL learns 100x less than SL (seems plausible) and is only 10% as focused on learning new ways of thinking / languages as SL. Then, we'd expect that reasonably efficient RL with 3 OOMs more FLOP than big pretraining runs (that do learn this to some extent) could learn new languages. This would naively be a ~1e27 FLOP RL run (assuming that we can learn this much stuff in 1e24 FLOP pretraining runs).
I think we'll probably see 1e27 FLOP RL runs next year?
However, this strongly limits the space of possible aggregated agents. Imagine two EUMs, Alice and Bob, whose utilities are each linear in how much cake they have. Suppose they’re trying to form a new EUM whose utility function is a weighted average of their utility functions. Then they’d only have three options:
- Form an EUM which would give Alice all the cakes (because it weights Alice’s utility higher than Bob’s)
- Form an EUM which would give Bob all the cakes (because it weights Bob’s utility higher than Alice’s)
- Form an EUM which is totally indifferent about the cake allocation between them (which would allocate cakes arbitrarily, and could be swayed by the tiniest incentive to give all Alice’s cakes to Bob, or vice versa)
None of these is very satisfactory!
I think this exact example is failing to really inhabit the mindset of a true linear(!) returns EUM agent. If Alice has literally linear returns Alice is totally happy to accept a deal which gets Alice 2x as many cakes + epsilon in 50% of worlds and nothing otherwise.
Correspondingly, if Alice and Bob have ex-ante exactly identical expected power and it is ex-ante as easy to make cake for then I think the agent they would build together would be something like:
From Alice's perspective this gets twice as many cakes + epsilon (due to being more efficient) in 50% of worlds and is thus a nice trade.
(If the marginal cost of giving a cake to Alice vs Bob increases with number of cakes, then you'd give some to both.)
If Alice/Bob had dimishing returns, then adding the utility functions with some bargained weighting is also totally fine and will get you some nice split of cake between them.
If we keep their preferences, but make them have different cake production abilities or marginal costs of providing cakes for them, then you just change the weights (based on some negotiation), not the linearity of the addition. And yes, this means that in many worlds (where one agent always has lower than ex-ante relative marginal cake consumption cost), one of the agents gets all the cake. But ex-ante they got a bit more in expectation!
I'm much more sympathetic to other objections to aggregations of EUM agents being EUM, like ontology issues, imperfect information (and adverse selection), etc.
I think I disagree some with this change. Now I'd say something like "We think the control line-of-defense should mostly focus on the time before we have enough evidence to relatively clearly demonstrate the AI is consistently acting egregiously badly. However, the regime where we deploy models despite having pretty strong empirical evidence that that model is scheming (from the perspective of people like us), is not out of scope."
You could have the view that open weights AGI is too costly on takeover risk and escape is bad, but we'll hopefully have some pre-AGI AIs which do strange misaligned behaviors that don't really get them much/any influence/power. If this is the view, then it really feels to me like preventing escape/rogue internal deployment is pretty useful.
See also Different senses in which two AIs can be "the same".