All of ryan_greenblatt's Comments + Replies

Seems like if it thinks there's a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values

This is true for any given X, but if there are many things like X which are independently 5% likely to be explored, the model is in trouble.

Like the model only needs to be somewhat confident that its exploring everything the humans explored in the imitation data, but for any given case of choosing not to explore some behavior it needs to be very confident.

I don't have the exact model very precisely worked out in my head and I might be explaining this in a somewhat confusing way, sorry.

I think I basically agree with much of this comment. However:

  • Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn't be able to do something to avoid exploring it. E.g., if we have 100,000 carefully targeted trajectory edits/imitation examples, then to avoid getting caught, the model would need to ensure that none of the behaviors it avoids exploring are caught by these examples. This might force the AI to explore lots of superhuman stuff.
  • Further,
... (read more)
2Abram Demski
Yeah, what I really had in mind with "avoiding mode collapse" was something more complex, but it seems tricky to spell out precisely. It's an interesting point, but where does the "extremely" come from? Seems like if it thinks there's a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values, it could be a very worthwhile gamble. Maybe I'm unclear on the rules of the game as you're imagining them.

First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can't be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP).

Maybe this is true for SL on trajectories from the post-RL policy, but this doesn't clearly seem like the right way to think about it from my perspective.

How much flop do you think it takes for large scale RL to learn codes? Let's say RL learns 100x less than SL (seems plausible) and is only 10% as focused on learning new ways of thinking / languages as SL. Then, we'd expect that reasonably efficient RL with 3 OOMs more FLOP than big pretraining runs (that do learn this to some extent) could learn new languages. This would naively be a ~1e27 FLOP RL run (assuming that we can learn this much stuff in 1e24 FLOP pretraining runs).

I think we'll probably see 1e27 FLOP RL runs next year?

2Fabien Roger
I think it's much worse than that. First, I think RL is more like 10,000x than 100x  less efficient than SL (deepseek v3 probably can't be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP). Additionally, learning new languages is likely a much harder task than regular math RL, because you have chicken-and-egg issues (you are not incentivized to encode a new structure before knowing how to decode it, and you are not incentivized to decode it before you encode it). We have some empirical evidence of this: people struggle to make even simple steganography appear in RL setups which incentivize it the most. Maybe simple encodings can bootstrap to more complex encodings and this bypasses some of the chicken-and-egg issues, but I'm not sure. Chicken-and-egg problems also mean you benefit a lot from having many serial steps of RL when learning an encoding, and while RL can probably be scaled massively in parallel, high latency of generations imply there probably won't be massive scale ups of the number of serial steps of RL compared to what r1 already did. (This is a hand-wavy argument, it might be incorrect.)
4gwern
You would also expect that the larger models will be more sample-efficient, including at in-context learning of variations of existing tasks (which of course is what steganography is). So all scale-ups go much further than any experiment at small-scale like 8B would indicate. (No idea what 'medium-scale' here might mean.)

Doesn't SL already learn a full language in which complex thoughts can be expressed? I agree it was brutal, it required >1e24 flop, but still.

3Fabien Roger
Edited to say "small-scale SL and large-scale RL"

I think the RL case might be more analogous to the translation pair case than the "just the encoded text" case. How does that alter the bottom line?

3Fabien Roger
I think it's only somewhat more analogous: if you slowly transition to a new language, you don't have incentives to have translations pairs with redundant content. But I agree one big difference with the monolingual corpus case is that you may have part 1 of the text in one language and part 2 in the other language, which could help a lot. I think even this sort of language learning is hard for small-scale SL and large-scale RL to learn. (I also think that there is a more frightening version of the "new incomprehensible language" hypothesis: the case where you learn a new language to express thoughts which are hard to express in English. This is not analogous to any translation experiment, and I expect it to be brutal for small-scale SL and large-scale RL to learn a full language in which complex thoughts can be expressed.)

However, this strongly limits the space of possible aggregated agents. Imagine two EUMs, Alice and Bob, whose utilities are each linear in how much cake they have. Suppose they’re trying to form a new EUM whose utility function is a weighted average of their utility functions. Then they’d only have three options:

  • Form an EUM which would give Alice all the cakes (because it weights Alice’s utility higher than Bob’s)
  • Form an EUM which would give Bob all the cakes (because it weights Bob’s utility higher than Alice’s)
  • Form an EUM which is totally indifferen
... (read more)
4Richard Ngo
I was a bit lazy in how I phrased this. I agree with all your points; the thing I'm trying to get at is that this approach falls apart quickly if we make the bargaining even slightly less idealized. E.g. your suggestion "Form an EUM which is totally indifferent about the cake allocation between them and thus gives 100% of the cake to whichever agent is cheaper/easier to provide cake for": 1. Strongly incentivizes deception (including self-deception) during bargaining (e.g. each agent wants to overstate the difficulty of providing cake for it). 2. Strongly incentivizes defection from the deal once one of the agents realize that they'll get no cake going forward. 3. Is non-robust to multi-agent dynamics (e.g. what if one of Alice's allies later decides "actually I'm going to sell pies to the Alice+Bob coalition more cheaply if Alice gets to eat them"? Does that then divert Bob's resources towards buying cakes for Alice?) EUM treats these as messy details. Coalitional agency treats them as hints that EUM is missing something. EDIT: another thing I glossed over is that IIUC Harsanyi's theorem says the aggregation of EUMs should have a weighted average of utilities, NOT a probability distribution over weighted averages of utilities. So even flipping a coin isn't technically kosher. This may seem nitpicky but I think it's yet another illustration of the underlying non-robustness of EUM.

I think I disagree some with this change. Now I'd say something like "We think the control line-of-defense should mostly focus on the time before we have enough evidence to relatively clearly demonstrate the AI is consistently acting egregiously badly. However, the regime where we deploy models despite having pretty strong empirical evidence that that model is scheming (from the perspective of people like us), is not out of scope."

You could have the view that open weights AGI is too costly on takeover risk and escape is bad, but we'll hopefully have some pre-AGI AIs which do strange misaligned behaviors that don't really get them much/any influence/power. If this is the view, then it really feels to me like preventing escape/rogue internal deployment is pretty useful.

Related question: are you in favor of making AGI open weights? By AGI, I mean AIs which effectively operate autonomously and can acquire automously acquire money/power. This includes AIs capable enough to automate whole fields of R&D (but not much more capable than this). I think the case for this being useful on your views feels much stronger than the case for control preventing warning shots. After all, you seemingly mostly thought control was bad due to the chance it would prevent escape or incidents of strange (and not that strategic) behavior. Nai... (read more)

4Ryan Greenblatt
You could have the view that open weights AGI is too costly on takeover risk and escape is bad, but we'll hopefully have some pre-AGI AIs which do strange misaligned behaviors that don't really get them much/any influence/power. If this is the view, then it really feels to me like preventing escape/rogue internal deployment is pretty useful.

I do understand this line of reasoning, but yes, my intuition differs. For some sort of a weird case study, consider Sydney. [...] My guess is that none of that would have happened with properly implemented control measures.

Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don't really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn't feel like a meaningful representative example.

I would say some assumptions go into who the ‘we’

... (read more)
1Jan_Kulveit
According to this report Sydney relatives are well and alive as of last week.
2Jan_Kulveit
Meaningful representative example in what class: I think it's representative in 'weird stuff may happen', not in we will get more teenage-intern-trapped-in-a-machine characters. Which is the problem - my default expectation is the we in "the AI company" does not take strong action (for specificity, like, shutting down). Do you expect any of the labs to shut down if they catch their new model 'rogue deploy' or sabotage part of their processes? In contrast I do expect basically smooth spectrum of incidents and accidents. And expect control shapes the distribution away from small and moderately large to xrisk (that's the main point) Can you express what you believe in this frame? My paraphrase is you think it decreases the risk approximately uniformly across scales, and you expect some discontinuity between kills zero people and kills some people, where the 'and also kills everyone' is very close to kills some people. I deeply distrust the analytical approach of trying to enumerate failure modes and reason from that.  Because I don't think it will be easy to evaluate "leading people astray in costly ways".

I think something like this is a live concern, though I'm skeptical that control is net negative for this reason.

My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn't mitigated as effectively, so that AIs succeed in large... (read more)

I think something like this is a live concern, though I'm skeptical that control is net negative for this reason.

My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn't mitigated as effectively, so that AIs succeed in large

... (read more)

After thinking more about it, I think "we haven't seen evidence of scheming once the octopi were very smart" is a bigger update than I was imagining, especially in the case where the octopi weren't communicating with octopese. So, I'm now at ~20% without octopese and about 50% with it.

I should note that I'm quite uncertain here and I can easily imagine my views swinging by large amounts.

Recently, @Daniel Kokotajlo and I were talking about the probability that AIs trained using "business as usual RLHF" end up being basically aligned rather than conspiring against us and our tests.[1] One intuition pump we ended up discussing is the prospects of octopus misalignment. Overall, my view is that directly considering the case with AIs (and what various plausible scenarios would look like) is more informative than analogies like this, but analogies like this are still somewhat useful to consider.

So, what do I mean by octopus misalignment? Suppose... (read more)

3Ryan Greenblatt
After thinking more about it, I think "we haven't seen evidence of scheming once the octopi were very smart" is a bigger update than I was imagining, especially in the case where the octopi weren't communicating with octopese. So, I'm now at ~20% without octopese and about 50% with it.
4Ryan Greenblatt
I should note that I'm quite uncertain here and I can easily imagine my views swinging by large amounts.
4Daniel Kokotajlo
Yep, I feel more like 90% here. (Lower numbers if the octopi don't have octopese.) I'm curious for other people's views.

I'm not sure about the details of the concrete proposal, but I agree with the spirit of the proposal.

(In particular, I don't know if I think having the "do you consent" text in this way is a good way to do this given limited will. I also think you want to have a very specific signal of asking for consent that you commit to filtering out except when it is actually being used. This is so the AI isn't worried it is in red teaming etc.)

4Daniel Kokotajlo
I endorse that suggestion for changing the details.

To be clear, I think there are important additional considerations related to the fact that we don't just care about capabilities that aren't covered in that section, though that section is not that far from what I would say if you renamed it to "behavioral tests", including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).

1Josh Clymer
Yeah that's fair. Currently I merge "behavioral tests" into the alignment argument, but that's a bit clunky and I prob should have just made the carving: 1. looks good in behavioral tests 2. is still going to generalize to the deferred task But my guess is we agree on the object level here and there's a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.

But it isn't a capabilities condition? Maybe I would be happier if you renamed this section.

2Ryan Greenblatt
To be clear, I think there are important additional considerations related to the fact that we don't just care about capabilities that aren't covered in that section, though that section is not that far from what I would say if you renamed it to "behavioral tests", including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).

I think there is an important component of trustworthiness that you don't emphasize enough. It isn't sufficient to just rule out alignment faking, we need the AI to actually try hard to faithfully pursue our interests including on long, confusing, open-ended, and hard to check tasks. You discuss establishing this with behavioral testing, but I don't think this is trivial to establish with behavioral testing. (I happen to think this is pretty doable and easier than ruling out reasons why our tests might be misleading, but this seems nonobvious.)

Perhaps you ... (read more)

1Josh Clymer
I don't expect this testing to be hard (at least conceptually, though doing this well might be expensive), and I expect it can be entirely behavioral. See section 6: "The capability condition"

Note that the capability milestone forecasted in the linked short form is substantially weaker than the notion of transformative AI in the 2020 model. (It was defined as AI with an effect at least as large as the industrial revolution.)

I don't expect this adds many years, for me it adds like ~2 years to my median.

(Note that my median for time from 10x to this milestone is lower than 2 years, but median to Y isn't equal to median to X + median from X to Y.)

I think 0.4 is far on the lower end (maybe 15th percentile) for all the way down to one accelerated researcher, but seems pretty plausible at the margin.

As in, 0.4 suggests that 1000 researchers = 100 researchers at 2.5x speed which seems kinda reasonable while 1000 researchers = 1 researcher at 16x speed does seem kinda crazy / implausible.

So, I think my current median lambda at likely margins is like 0.55 or something and 0.4 is also pretty plausible at the margin.

I'm a bit late for a review, but I've recently been reflecting on decision theory and this post came to mind.

When I initially saw this post I didn't make much of it. I now feel like the thesis of "decision theory is very confusing and messed up" is true, insightful, and pretty important based on spending more time engaging with particular problems (mostly related to acausal/simulation trade and other interactions). I don't know if the specific examples in this post aged well, but I think the bottom line is worth keeping in mind.

You are possibly the first person I know of who reacted to MONA with "that's obvious"

I also have the "that's obvious reaction", but possibly I'm missing somne details. I also think it won't perform well enough in practice to pencil given other better places to allocate safety budget (if it does trade off which is unclear).

2Rohin Shah
I meant "it's obvious you should use MONA if you are seeing problems with long-term optimization", which I believe is Fabien's position (otherwise it would be "hard to find"). Your reaction seems more like "it's obvious MONA would prevent multi-step reward hacks"; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).

Ok, I think what is going on here is maybe that the constant you're discussing here is different from the constant I was discussing. I was trying to discuss the question of how much worse serial labor is than parallel labor, but I think the lambda you're talking about takes into account compute bottlenecks and similar?

Not totally sure.

[This comment is no longer endorsed by its author]Reply

Lower lambda. I'd now use more like lambda = 0.4 as my median. There's really not much evidence pinning this down; I think Tamay Besiroglu thinks there's some evidence for values as low as 0.2.

Isn't this really implausible? This implies that if you had 1000 researchers/engineers of average skill at OpenAI doing AI R&D, this would be as good as having one average skill researcher running at 16x () speed. It does seem very slightly plausible that having someone as good as the best researcher/engineer at OpenAI run at 16x speed would be competit... (read more)

2Ryan Greenblatt
I think 0.4 is far on the lower end (maybe 15th percentile) for all the way down to one accelerated researcher, but seems pretty plausible at the margin. As in, 0.4 suggests that 1000 researchers = 100 researchers at 2.5x speed which seems kinda reasonable while 1000 researchers = 1 researcher at 16x speed does seem kinda crazy / implausible. So, I think my current median lambda at likely margins is like 0.55 or something and 0.4 is also pretty plausible at the margin.
2Ryan Greenblatt
Ok, I think what is going on here is maybe that the constant you're discussing here is different from the constant I was discussing. I was trying to discuss the question of how much worse serial labor is than parallel labor, but I think the lambda you're talking about takes into account compute bottlenecks and similar? Not totally sure.

Another way to put this is that strategy stealing might not work due to technical alignment difficulties or for other reasons and I'm not sold the other reasons I've heard so far are very lethal. I do think the situation might really suck though with e.g. tons of people dying of bioweapons and with some groups that aren't sufficiently ruthless or which don't defer enough to AIs getting disempowered.

I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I'm getting at? Can you explain what you think it wrong with thinking of it this way?

I think something like this is a reasonable model but I have a few things I'd change.

Whichever group has more power at the end of the week survives.

Why can't both groups survive? Why is it winner takes all? Can we just talk about the relative change in power over the week? (As in, how much does the power of B reduce relat... (read more)

2Ryan Greenblatt
Another way to put this is that strategy stealing might not work due to technical alignment difficulties or for other reasons and I'm not sold the other reasons I've heard so far are very lethal. I do think the situation might really suck though with e.g. tons of people dying of bioweapons and with some groups that aren't sufficiently ruthless or which don't defer enough to AIs getting disempowered.

I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant. This state of affairs both causes many bad outcomes and many aspects are self-reinforcing. I don't expect AI to fix these problems; I expect it to exacerbate them.

Sure, but these things don't result in non-human entities obtaining power right? Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net tran... (read more)

1David Scott Krueger
First, RE the role of "solving alignment" in this discussion, I just want to note that: 1) I disagree that alignment solves gradual disempowerment problems. 2) Even if it would that does not imply that gradual disempowerment problems aren't important (since we can't assume alignment will be solved). 3) I'm not sure what you mean by "alignment is solved"; I'm taking it to mean "AI systems can be trivially intent aligned".  Such a system may still say things like "Well, I can build you a successor that I think has only a 90% chance of being aligned, but will make you win (e.g. survive) if it is aligned.  Is that what you want?" and people can respond with "yes" -- this is the sort of thing that probably still happens IMO. 4) Alternatively, you might say we're in the "alignment basin" -- I'm not sure what that means, precisely, but I would operationalize it as something like "the AI system is playing a roughly optimal CIRL game".  It's unclear how good of performance that can yield in practice (e.g. it can't actually be optimal due to compute limitations), but I suspect it still leaves significant room for fuck-ups. 5) I'm more interested in the case where alignment is not "perfectly" "solved", and so there are simply clear and obvious opportunities to trade-off safety and performance; I think this is much more realistic to consider. 6) I expect such trade-off opportunities to persist when it comes to assurance (even if alignment is solved), since I expect high-quality assurance to be extremely costly.  And it is irresponsible (because it's subjectively risky) to trust a perfectly aligned AI system absent strong assurances.  But of course, people who are willing to YOLO it and just say "seems aligned, let's ship" will win.  This is also part of the problem... My main response, at a high level: Consider a simple model: * We have 2 human/AI teams in competition with each other, A and B. * A and B both start out with the humans in charge, and then decide whether the

Discussion in this other comment is relevant:

Anthropic releasing their RSP was an important change in the AI safety landscape. The RSP was likely a substantial catalyst for policies like RSPs—which contain if-then commitments and more generally describe safety procedures—becoming more prominent. In particular, OpenAI now has a beta Preparedness Framework, Google DeepMind has a Frontier Safety Framework but there aren't any concrete publicly-known policies yet, many companies agreed to the Seoul commitments which require making a similar policy, and SB-10

... (read more)

Yes, that is an accurate understanding.

(Not sure what you mean by "This would explain Sonnet's reaction below!")

3Daniel Kokotajlo
What I meant was, Sonnet is smart enough to know the difference between text it generated and text a different dumber model generated. So if you feed it a conversation with Opus as if it were its own, and ask it to continue, it notices & says "JFYI I'm a different AI bro"

Agreed, downvoted my comment. (You can't strong downvote your own comment, or I would have done that.)

I was mostly just trying to point to prior arguments against similar arguments while expressing my view.

I don't think it's worth adjudicating the question of how relevant Vanessa's response is (though I do think Vannessa's response is directly relevant).

if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem

My claim would be that if single-single alignment is solved, this problem won't be existential. I agree that if you literally aligned all AIs to (e.g.) the mission of a non-profit as well as you can, you're in trouble. However, if you have single-single align... (read more)

5David Scott Krueger
I think we disagree about: 1) The level of "functionality" of the current world/institutions. 2) How strong and decisive competitive pressures are and will be in determining outcomes. I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant.  This state of affairs both causes many bad outcomes and many aspects are self-reinforcing.  I don't expect AI to fix these problems; I expect it to exacerbate them. I do believe it has the potential to fix them, however, I think the use of AI for such pro-social ends is not going to be sufficiently incentivized, especially on short time-scales (e.g. a few years), and we will instead see a race-to-the-bottom that encourages highly reckless, negligent, short-sighted, selfish decisions around AI development, deployment, and use.  The current AI arms race is a great example -- Companies and nations all view it as more important that they be the ones to develop ASI than to do it carefully or put effort into cooperation/coordination. Given these views: 1) Asking AI for advice instead of letting it take decisions directly seems unrealistically uncompetitive.  When we can plausibly simulate human meetings in seconds it will be organizational suicide to take hours-to-weeks to let the humans make an informed and thoughtful decision.  2) The idea that decision-makers who "think a goverance structure will yield total human disempowerment" will "do something else" also seems quite implausible.  Such decision-makers will likely struggle to retain power.  Decision-makers who prioritize their own "power" (and feel empowered even as they hand off increasing decision-making to AI) and their immediate political survival above all else will be empowered.   Another features of the future which seems likely and can already be witnessed beginning is the gradual emergence and ascendance of pro-AI-takeover and pro-arms-race ideologies, which endo

I think my main response is that we might have different models of how power and control actually work in today's world. Your responses seem to assume a level of individual human agency and control that I don't believe accurately reflects even today's reality.

Consider how some of the most individually powerful humans, leaders and decision-makers, operate within institutions. I would not say we see pure individual agency. Instead, we typically observe a complex mixture of:

  1. Serving the institutional logic of the entity they nominally lead (e.g., maintaining s
... (read more)

I've updated over time to thinking that trusting AI systems and mostly handing things off to AI systems is an important dynamic to be tracking. As in, the notion of human researcher obsolescence discussed here. In particular, it seems important to understand what is required for this and when we should make this transition.

I previously thought that this handoff should be mostly beyond our planning horizon (for more prosaic researchers) because by the time of this handoff we will have already spent many years being radically (e.g., >10x) accelerated by A... (read more)

Yeah, people at labs are generally not thoughtful about AI futurism IMO, though of course most people aren't thoughtful about AI futurism. And labs don't really have plans IMO. (TBC, I think careful futurism is hard, hard to check, and not clearly that useful given realistic levels of uncertainty.)

If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.

I don't have a ready to go list. You might be interested in this post and comments responding to it, though I'd note I disagree substantially with the post.

Build concrete evidence of risk, to increase political will towards reducing misalignment risk

Working on demonstrating evidence of risk at an very irresponsible developer might be worse than you'd hope because they prevent you publishing or try to prevent this type of research from happening in the first place (given that it might be bad for their interests).

However, it's also possible they'd be skeptical of research done elsewhere.

I'm not sure how to reconcile these considerations.

I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".

My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels.

Part of the objection is in avoiding multipolar traps, but there is also a more basic story like:

  • Humans own capital/influence.
  • They use this influence to serve their own interests and have an (aligned) AI system which faithfully represents their interests.
  • Given tha
... (read more)
4David Duvenaud
Thanks for this.  Discussions of things like "one time shifts in power between humans via mechanisms like states becoming more powerful" and personal AI representatives is exactly the sort of thing I'd like to hear more about.  I'm happy to have finally found someone who has something substantial to say about this transition! But over the last 2 years I asked a lot of people at the major labs about for any kind of details about a positive post-AGI future and almost no one had put anywhere close to as much thought into it as you have, and no one mentioned the things above.  Most people clearly hadn't put much thought into it at all.  If anyone at the labs had much more of plan than "we'll solve alignment while avoiding an arms race", I managed to fail to even hear about its existence despite many conversations, including with founders. The closest thing to a plan was Sam Bowman's checklist: https://sleepinyourhat.github.io/checklist/ which is exactly the sort of thing I was hoping for, except it's almost silent on issues of power, the state, and the role of post-AGI humans. If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.

The paper says:

Christiano (2019) makes the case that sudden disempowerment is unlikely,

This isn't accurate. The post What failure looks like includes a scenario involving sudden disempowerment!

The post does say:

The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.

I think this is probably not what failure will look like,

But, I think it is mostly arguing against threat models involving fast AI capability takeoff (where the lev... (read more)

I (remain) skeptical that the sort of failure mode described here is plausible if we solve the problem of aligning individual AI systems with their designers' intentions without this alignment requiring any substantial additional costs (that is, we solve single-single alignment with minimal alignment tax).

This has previously been argued by Vanessa here and Paul here in response to a post making a similar claim.

I do worry about human power grabs: some humans obtaining greatly more power as enabled by AI (even if we have no serious alignment issues). However... (read more)

I think it's a bit sad that this comment is being so well-received -- it's just some opinions without arguments from someone who hasn't read the paper in detail.  

In regards to whether “single-single alignment” will make coordination problems and other sorts of human dysfunction and slow-rolling catastrophes less likely:

…I’m not really sure what I think. I feel like have a lot of thoughts that have not gelled into a coherent whole.

(A) The optimistic side of me says what you said in your comment (and in the Vanessa and (especially) Paul comment link therein.

People don’t want bad things to happen. If someone asks an AI what’s gonna happen, and they say “bad thing”, then they’ll say “well what can I do about it?”, and ... (read more)

4Jan_Kulveit
I'm quite confused why do you think lined Vanessa's response to something slightly different has much relevance here.  One of the claims we make paraphrased & simplified in a way which I hope is closer to your way of thinking about it: - AIs are mostly not developed and deployed by individual humans - there is a lot of other agencies or self-interested self-preserving structures/processes in the world - if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem - there are plausible futures in which these structures keep power longer than humans Overall I would find it easier to discuss if you tried to formulate what you disagree about in the ontology of the paper. Also some of the points made are subtle enough that I don't expect responses to other arguments to address them.  

Thanks for the detailed objection and the pointers.  I agree there's a chance that solving alignment with designers' intentions might be sufficient.  I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".

My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels.  I think the main question is what's: the tax for coordinating to avoid a multipolar trap?  If it's ... (read more)

I think Zac is trying to say they left not to protest, but instead because they didn't think staying was viable (for whatever research and/or implementation they wanted to do).

On my views (not Zac's), "staying wouldn't be viable for someone who was willing to work in a potentially pretty unpleasant work environment and focus on implementation (and currently prepping for this implementation)" doesn't seem like an accurate description of the situation. (See also Buck's comment here.)

Relatedly, I think Buck far overestimates the influence and resources of safety-concerned staff in a 'rushed unreasonable developer'.

As in, you don't expect they'll be able to implement stuff even if it doesn't make anyone's workflow harder or you don't expect they'll be able to get that much compute?

Naively, we might expect ~1% of compute as we might expect around 1000 researchers and 10/1000 is 1%. Buck said 3% because I argued for increasing this number. My case would be that there will be bunch of cases where the thing they want to do is obviously r... (read more)

Rohin Shah*1312

I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don't accomplish much.

By far my biggest objection is that there are approximately zero useful things that "[don't] make anyone's workflow harder". I expect you're vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to i... (read more)

I think the post could directly say "voluntary RSPs seem unlikely to suffice (and wouldn't be pauses done right), but ...".

I agree it does emphasize the importance of regulation pretty strongly.

Part of my perspective is that the title implies a conclusion which isn't quite right and so it would have been good (at least with the benefit of hindsight) to clarify this explicitly. At least to the extent you agree with me.

This post seems mostly reasonable in retrospect, except that it doesn't specifically note that it seems unlikely that voluntary RSP commitments would result in AI companies unilaterally pausing until they were able to achieve broadly reasonable levels of safety. I wish the post more strongly emphasized that regulation was a key part of the picture---my view is that "voluntary RSPs are pauses done right" is wrong, but "RSPs via (international) regulation are pauses done right" seems like it could be roughly right. That said, I do think that purely voluntary... (read more)

4Evan Hubinger
I feel like it does emphasize that, about as strongly as is possible? The second step in my story of how RSPs make things go well is that the government has to step in and use them as a basis for regulation.

Anthropic releasing their RSP was an important change in the AI safety landscape. The RSP was likely a substantial catalyst for policies like RSPs—which contain if-then commitments and more generally describe safety procedures—becoming more prominent. In particular, OpenAI now has a beta Preparedness Framework, Google DeepMind has a Frontier Safety Framework but there aren't any concrete publicly-known policies yet, many companies agreed to the Seoul commitments which require making a similar policy, and SB-1047 required safety and security protocols.

Howev... (read more)

It feels like this sort of thing is better than nothing, but I wish we had something better.

The existing epoch paper is pretty good, but doesn't directly target LLMs in a way which seems somewhat sad.

The thing I'd be most excited about is:

  • Epoch does an in depth investigation using an estimation methodology which is directly targeting LLMs (rather than looking at returns in some other domains).
  • They use public data and solicit data from companies about algorithmic improvement, head count, compute on experiments etc.
  • (Some) companies provide this data.
... (read more)

I think usefully massively scaling up funding and the field is hard in various ways, especially doing so very quickly, but I roughly agree if we accept that premise. Overall, I think this is only a small quantitative effect in short timelines because it doesn't seem that likely, even if it happens it seems likely to not be that good given various difficulties in scaling, and even if it is good, I think the chance of huge improvements isn't that high (given my understanding of where we are at in the current returns curve). TBC, I think a well run massive AI safety program would greatly lower risk in expectation.

I think you are correct with respect to my estimate of and the associated model I was using. Sorry about my error here. I think I was fundamentally confusing a few things in my head when writing out the comment.

I think your refactoring of my strategy is correct and I tried to check it myself, though I don't feel confident in verifying it is correct.


Your estimate doesn't account for the conversion between algorithmic improvement and labor efficiency, but it is easy to add this in by just changing the historical algorithmic efficiency improvement of 3.5x... (read more)

(I'm going through this and understanding where I made an error with my approach to . I think I did make an error, but I'm trying to make sure I'm not still confused. Edit: I've figured this out, see my other comment.)

Wikipedia says Cobb-Douglas functions can have the exponents not add to 1 (while both being between 0 and 1). Maybe this makes sense here? Not an expert.

It shouldn't matter in this case because we're raising the whole value of to .

I think Tom's take is that he expects I will put more probability on software only singularity after updating on these considerations. It seems hard to isolate where Tom and I disagree based on this comment, but maybe it is on how much to weigh various considerations about compute being a key input.

Load More