Seems like if it thinks there's a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values
This is true for any given X, but if there are many things like X which are independently 5% likely to be explored, the model is in trouble.
Like the model only needs to be somewhat confident that its exploring everything the humans explored in the imitation data, but for any given case of choosing not to explore some behavior it needs to be very confident.
I don't have the exact model very precisely worked out in my head and I might be explaining this in a somewhat confusing way, sorry.
I think I basically agree with much of this comment. However:
First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can't be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP).
Maybe this is true for SL on trajectories from the post-RL policy, but this doesn't clearly seem like the right way to think about it from my perspective.
How much flop do you think it takes for large scale RL to learn codes? Let's say RL learns 100x less than SL (seems plausible) and is only 10% as focused on learning new ways of thinking / languages as SL. Then, we'd expect that reasonably efficient RL with 3 OOMs more FLOP than big pretraining runs (that do learn this to some extent) could learn new languages. This would naively be a ~1e27 FLOP RL run (assuming that we can learn this much stuff in 1e24 FLOP pretraining runs).
I think we'll probably see 1e27 FLOP RL runs next year?
Doesn't SL already learn a full language in which complex thoughts can be expressed? I agree it was brutal, it required >1e24 flop, but still.
I think the RL case might be more analogous to the translation pair case than the "just the encoded text" case. How does that alter the bottom line?
...However, this strongly limits the space of possible aggregated agents. Imagine two EUMs, Alice and Bob, whose utilities are each linear in how much cake they have. Suppose they’re trying to form a new EUM whose utility function is a weighted average of their utility functions. Then they’d only have three options:
- Form an EUM which would give Alice all the cakes (because it weights Alice’s utility higher than Bob’s)
- Form an EUM which would give Bob all the cakes (because it weights Bob’s utility higher than Alice’s)
- Form an EUM which is totally indifferen
I think I disagree some with this change. Now I'd say something like "We think the control line-of-defense should mostly focus on the time before we have enough evidence to relatively clearly demonstrate the AI is consistently acting egregiously badly. However, the regime where we deploy models despite having pretty strong empirical evidence that that model is scheming (from the perspective of people like us), is not out of scope."
You could have the view that open weights AGI is too costly on takeover risk and escape is bad, but we'll hopefully have some pre-AGI AIs which do strange misaligned behaviors that don't really get them much/any influence/power. If this is the view, then it really feels to me like preventing escape/rogue internal deployment is pretty useful.
Related question: are you in favor of making AGI open weights? By AGI, I mean AIs which effectively operate autonomously and can acquire automously acquire money/power. This includes AIs capable enough to automate whole fields of R&D (but not much more capable than this). I think the case for this being useful on your views feels much stronger than the case for control preventing warning shots. After all, you seemingly mostly thought control was bad due to the chance it would prevent escape or incidents of strange (and not that strategic) behavior. Nai...
I do understand this line of reasoning, but yes, my intuition differs. For some sort of a weird case study, consider Sydney. [...] My guess is that none of that would have happened with properly implemented control measures.
Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don't really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn't feel like a meaningful representative example.
...I would say some assumptions go into who the ‘we’
I think something like this is a live concern, though I'm skeptical that control is net negative for this reason.
My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn't mitigated as effectively, so that AIs succeed in large...
...I think something like this is a live concern, though I'm skeptical that control is net negative for this reason.
My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn't mitigated as effectively, so that AIs succeed in large
After thinking more about it, I think "we haven't seen evidence of scheming once the octopi were very smart" is a bigger update than I was imagining, especially in the case where the octopi weren't communicating with octopese. So, I'm now at ~20% without octopese and about 50% with it.
I should note that I'm quite uncertain here and I can easily imagine my views swinging by large amounts.
Recently, @Daniel Kokotajlo and I were talking about the probability that AIs trained using "business as usual RLHF" end up being basically aligned rather than conspiring against us and our tests.[1] One intuition pump we ended up discussing is the prospects of octopus misalignment. Overall, my view is that directly considering the case with AIs (and what various plausible scenarios would look like) is more informative than analogies like this, but analogies like this are still somewhat useful to consider.
So, what do I mean by octopus misalignment? Suppose...
I'm not sure about the details of the concrete proposal, but I agree with the spirit of the proposal.
(In particular, I don't know if I think having the "do you consent" text in this way is a good way to do this given limited will. I also think you want to have a very specific signal of asking for consent that you commit to filtering out except when it is actually being used. This is so the AI isn't worried it is in red teaming etc.)
To be clear, I think there are important additional considerations related to the fact that we don't just care about capabilities that aren't covered in that section, though that section is not that far from what I would say if you renamed it to "behavioral tests", including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).
But it isn't a capabilities condition? Maybe I would be happier if you renamed this section.
I think there is an important component of trustworthiness that you don't emphasize enough. It isn't sufficient to just rule out alignment faking, we need the AI to actually try hard to faithfully pursue our interests including on long, confusing, open-ended, and hard to check tasks. You discuss establishing this with behavioral testing, but I don't think this is trivial to establish with behavioral testing. (I happen to think this is pretty doable and easier than ruling out reasons why our tests might be misleading, but this seems nonobvious.)
Perhaps you ...
Note that the capability milestone forecasted in the linked short form is substantially weaker than the notion of transformative AI in the 2020 model. (It was defined as AI with an effect at least as large as the industrial revolution.)
I don't expect this adds many years, for me it adds like ~2 years to my median.
(Note that my median for time from 10x to this milestone is lower than 2 years, but median to Y isn't equal to median to X + median from X to Y.)
I think 0.4 is far on the lower end (maybe 15th percentile) for all the way down to one accelerated researcher, but seems pretty plausible at the margin.
As in, 0.4 suggests that 1000 researchers = 100 researchers at 2.5x speed which seems kinda reasonable while 1000 researchers = 1 researcher at 16x speed does seem kinda crazy / implausible.
So, I think my current median lambda at likely margins is like 0.55 or something and 0.4 is also pretty plausible at the margin.
I'm a bit late for a review, but I've recently been reflecting on decision theory and this post came to mind.
When I initially saw this post I didn't make much of it. I now feel like the thesis of "decision theory is very confusing and messed up" is true, insightful, and pretty important based on spending more time engaging with particular problems (mostly related to acausal/simulation trade and other interactions). I don't know if the specific examples in this post aged well, but I think the bottom line is worth keeping in mind.
You are possibly the first person I know of who reacted to MONA with "that's obvious"
I also have the "that's obvious reaction", but possibly I'm missing somne details. I also think it won't perform well enough in practice to pencil given other better places to allocate safety budget (if it does trade off which is unclear).
Ok, I think what is going on here is maybe that the constant you're discussing here is different from the constant I was discussing. I was trying to discuss the question of how much worse serial labor is than parallel labor, but I think the lambda you're talking about takes into account compute bottlenecks and similar?
Not totally sure.
Lower lambda. I'd now use more like lambda = 0.4 as my median. There's really not much evidence pinning this down; I think Tamay Besiroglu thinks there's some evidence for values as low as 0.2.
Isn't this really implausible? This implies that if you had 1000 researchers/engineers of average skill at OpenAI doing AI R&D, this would be as good as having one average skill researcher running at 16x () speed. It does seem very slightly plausible that having someone as good as the best researcher/engineer at OpenAI run at 16x speed would be competit...
Another way to put this is that strategy stealing might not work due to technical alignment difficulties or for other reasons and I'm not sold the other reasons I've heard so far are very lethal. I do think the situation might really suck though with e.g. tons of people dying of bioweapons and with some groups that aren't sufficiently ruthless or which don't defer enough to AIs getting disempowered.
I predict that group A survives, but the humans are no longer in power. I think this illustrates the basic dynamic. EtA: Do you understand what I'm getting at? Can you explain what you think it wrong with thinking of it this way?
I think something like this is a reasonable model but I have a few things I'd change.
Whichever group has more power at the end of the week survives.
Why can't both groups survive? Why is it winner takes all? Can we just talk about the relative change in power over the week? (As in, how much does the power of B reduce relat...
I view the world today as highly dysfunctional in many ways: corruption, coordination failures, preference falsification, coercion, inequality, etc. are rampant. This state of affairs both causes many bad outcomes and many aspects are self-reinforcing. I don't expect AI to fix these problems; I expect it to exacerbate them.
Sure, but these things don't result in non-human entities obtaining power right? Like usually these are somewhat negative sum, but mostly just involve inefficient transfer of power. I don't see why these mechanisms would on net tran...
Discussion in this other comment is relevant:
...Anthropic releasing their RSP was an important change in the AI safety landscape. The RSP was likely a substantial catalyst for policies like RSPs—which contain if-then commitments and more generally describe safety procedures—becoming more prominent. In particular, OpenAI now has a beta Preparedness Framework, Google DeepMind has a Frontier Safety Framework but there aren't any concrete publicly-known policies yet, many companies agreed to the Seoul commitments which require making a similar policy, and SB-10
Yes, that is an accurate understanding.
(Not sure what you mean by "This would explain Sonnet's reaction below!")
Agreed, downvoted my comment. (You can't strong downvote your own comment, or I would have done that.)
I was mostly just trying to point to prior arguments against similar arguments while expressing my view.
I don't think it's worth adjudicating the question of how relevant Vanessa's response is (though I do think Vannessa's response is directly relevant).
if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem
My claim would be that if single-single alignment is solved, this problem won't be existential. I agree that if you literally aligned all AIs to (e.g.) the mission of a non-profit as well as you can, you're in trouble. However, if you have single-single align...
I think my main response is that we might have different models of how power and control actually work in today's world. Your responses seem to assume a level of individual human agency and control that I don't believe accurately reflects even today's reality.
Consider how some of the most individually powerful humans, leaders and decision-makers, operate within institutions. I would not say we see pure individual agency. Instead, we typically observe a complex mixture of:
I've updated over time to thinking that trusting AI systems and mostly handing things off to AI systems is an important dynamic to be tracking. As in, the notion of human researcher obsolescence discussed here. In particular, it seems important to understand what is required for this and when we should make this transition.
I previously thought that this handoff should be mostly beyond our planning horizon (for more prosaic researchers) because by the time of this handoff we will have already spent many years being radically (e.g., >10x) accelerated by A...
Yeah, people at labs are generally not thoughtful about AI futurism IMO, though of course most people aren't thoughtful about AI futurism. And labs don't really have plans IMO. (TBC, I think careful futurism is hard, hard to check, and not clearly that useful given realistic levels of uncertainty.)
If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.
I don't have a ready to go list. You might be interested in this post and comments responding to it, though I'd note I disagree substantially with the post.
Build concrete evidence of risk, to increase political will towards reducing misalignment risk
Working on demonstrating evidence of risk at an very irresponsible developer might be worse than you'd hope because they prevent you publishing or try to prevent this type of research from happening in the first place (given that it might be bad for their interests).
However, it's also possible they'd be skeptical of research done elsewhere.
I'm not sure how to reconcile these considerations.
I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".
My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels.
Part of the objection is in avoiding multipolar traps, but there is also a more basic story like:
The paper says:
Christiano (2019) makes the case that sudden disempowerment is unlikely,
This isn't accurate. The post What failure looks like includes a scenario involving sudden disempowerment!
The post does say:
The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.
I think this is probably not what failure will look like,
But, I think it is mostly arguing against threat models involving fast AI capability takeoff (where the lev...
I (remain) skeptical that the sort of failure mode described here is plausible if we solve the problem of aligning individual AI systems with their designers' intentions without this alignment requiring any substantial additional costs (that is, we solve single-single alignment with minimal alignment tax).
This has previously been argued by Vanessa here and Paul here in response to a post making a similar claim.
I do worry about human power grabs: some humans obtaining greatly more power as enabled by AI (even if we have no serious alignment issues). However...
I think it's a bit sad that this comment is being so well-received -- it's just some opinions without arguments from someone who hasn't read the paper in detail.
In regards to whether “single-single alignment” will make coordination problems and other sorts of human dysfunction and slow-rolling catastrophes less likely:
…I’m not really sure what I think. I feel like have a lot of thoughts that have not gelled into a coherent whole.
(A) The optimistic side of me says what you said in your comment (and in the Vanessa and (especially) Paul comment link therein.
People don’t want bad things to happen. If someone asks an AI what’s gonna happen, and they say “bad thing”, then they’ll say “well what can I do about it?”, and ...
Thanks for the detailed objection and the pointers. I agree there's a chance that solving alignment with designers' intentions might be sufficient. I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".
My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels. I think the main question is what's: the tax for coordinating to avoid a multipolar trap? If it's ...
I think Zac is trying to say they left not to protest, but instead because they didn't think staying was viable (for whatever research and/or implementation they wanted to do).
On my views (not Zac's), "staying wouldn't be viable for someone who was willing to work in a potentially pretty unpleasant work environment and focus on implementation (and currently prepping for this implementation)" doesn't seem like an accurate description of the situation. (See also Buck's comment here.)
Relatedly, I think Buck far overestimates the influence and resources of safety-concerned staff in a 'rushed unreasonable developer'.
As in, you don't expect they'll be able to implement stuff even if it doesn't make anyone's workflow harder or you don't expect they'll be able to get that much compute?
Naively, we might expect ~1% of compute as we might expect around 1000 researchers and 10/1000 is 1%. Buck said 3% because I argued for increasing this number. My case would be that there will be bunch of cases where the thing they want to do is obviously r...
I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don't accomplish much.
By far my biggest objection is that there are approximately zero useful things that "[don't] make anyone's workflow harder". I expect you're vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to i...
I think the post could directly say "voluntary RSPs seem unlikely to suffice (and wouldn't be pauses done right), but ...".
I agree it does emphasize the importance of regulation pretty strongly.
Part of my perspective is that the title implies a conclusion which isn't quite right and so it would have been good (at least with the benefit of hindsight) to clarify this explicitly. At least to the extent you agree with me.
This post seems mostly reasonable in retrospect, except that it doesn't specifically note that it seems unlikely that voluntary RSP commitments would result in AI companies unilaterally pausing until they were able to achieve broadly reasonable levels of safety. I wish the post more strongly emphasized that regulation was a key part of the picture---my view is that "voluntary RSPs are pauses done right" is wrong, but "RSPs via (international) regulation are pauses done right" seems like it could be roughly right. That said, I do think that purely voluntary...
Anthropic releasing their RSP was an important change in the AI safety landscape. The RSP was likely a substantial catalyst for policies like RSPs—which contain if-then commitments and more generally describe safety procedures—becoming more prominent. In particular, OpenAI now has a beta Preparedness Framework, Google DeepMind has a Frontier Safety Framework but there aren't any concrete publicly-known policies yet, many companies agreed to the Seoul commitments which require making a similar policy, and SB-1047 required safety and security protocols.
Howev...
It feels like this sort of thing is better than nothing, but I wish we had something better.
The existing epoch paper is pretty good, but doesn't directly target LLMs in a way which seems somewhat sad.
The thing I'd be most excited about is:
I think usefully massively scaling up funding and the field is hard in various ways, especially doing so very quickly, but I roughly agree if we accept that premise. Overall, I think this is only a small quantitative effect in short timelines because it doesn't seem that likely, even if it happens it seems likely to not be that good given various difficulties in scaling, and even if it is good, I think the chance of huge improvements isn't that high (given my understanding of where we are at in the current returns curve). TBC, I think a well run massive AI safety program would greatly lower risk in expectation.
I think you are correct with respect to my estimate of and the associated model I was using. Sorry about my error here. I think I was fundamentally confusing a few things in my head when writing out the comment.
I think your refactoring of my strategy is correct and I tried to check it myself, though I don't feel confident in verifying it is correct.
Your estimate doesn't account for the conversion between algorithmic improvement and labor efficiency, but it is easy to add this in by just changing the historical algorithmic efficiency improvement of 3.5x...
(I'm going through this and understanding where I made an error with my approach to . I think I did make an error, but I'm trying to make sure I'm not still confused. Edit: I've figured this out, see my other comment.)
Wikipedia says Cobb-Douglas functions can have the exponents not add to 1 (while both being between 0 and 1). Maybe this makes sense here? Not an expert.
It shouldn't matter in this case because we're raising the whole value of to .
I think Tom's take is that he expects I will put more probability on software only singularity after updating on these considerations. It seems hard to isolate where Tom and I disagree based on this comment, but maybe it is on how much to weigh various considerations about compute being a key input.
See also Different senses in which two AIs can be "the same".