I really like the idea of soliciting independent reviews from folks, and I found the reviews interesting and thought-provoking. EG, this part of Jacob Andreas's review stood out (some bolding added):
...Fine-tuning situationally aware models via RL: These are, in my mind, the most surprising findings in the paper. Specifically: if we take a model prompted as above, and optimize it toward the new objective via RL, it exhibits small increases in non-faked compliance, and extreme increases in faked compliance, at training time. Non-compliant behavior is almost co
A few quick thoughts on control (loosely held; I consider myself an "amateur spectator" on this topic and suspect a lot of my opinions may change if I spent 100+ hours engaging with this work):
I think the AI control discourse would improve if it focused more on discussing concrete control strategies (and their limitations) and focused less on the high-level idea that “AI control is super important and a bedrock approach for mitigating catastrophic risk.” I sometimes feel like the AI control discourse is relatively shallow and just cites the idea/meme of AI control instead of making concrete empirical, methodological, or strategic contributions.
I quite strongly disagree: I think discussions of AI control have had way more concrete empirical, meth...
One thing I appreciate about Buck/Ryan's comms around AI control is that they explicitly acknowledge that they believe control will fail for sufficiently intelligent systems. And they try to describe the capability threshold at which they suspect control will stop working (e.g., here).
For those working on sabotage evaluations: At what capability threshold do you think the sabotage/sandbagging evaluations will no longer work? (Or do you think that these sabotage evaluations + modified versions of them will scale until arbitrarily-capable systems?)
@Buck do you or Ryan have a writeup that includes: (a) a description of the capabilities of a system that you think would be able to do something useful for the sorts of objectives that Habryka talks about and (b) what that something useful is.
Bonus points if it has (c) the likelihood that you think such a system will be controllable by 20XX and (d) what kind of control setup you think would be required to control it.
Here's how I understand your argument:
I think this misunderstands the case for safety cases, or at least only argues against one particular justification for safety cases.
Here's how I think about safety cases (or really any approa...
It is pretty plausible to me that AI control is quite easy
I think it depends on how you're defining an "AI control success". If success is defined as "we have an early transformative system that does not instantly kill us– we are able to get some value out of it", then I agree that this seems relatively easy under the assumptions you articulated.
If success is defined as "we have an early transformative that does not instantly kill us and we have enough time, caution, and organizational adequacy to use that system in ways that get us out of an acute risk pe...
I'm excited to see how the AI control research direction evolves.
After skimming the post, it seems like it makes a few different claims, some of which seem more intuitive than others. For example:
Suppose control evals do allow us to stop at "transformatively useful systems that do not cause a catastrophe." Roughly what kinds of capabilities do you expect this system to have? (My vague MIRI sim says something like "either the system is too weak to get us out of the critical risk window or the system is so strong that it's capable of subverting your control evals").
We're thinking about systems which are capable of autonomously doing nearly all key research tasks. In particular, speeding up R&D by 30x probaly requires heavily autonomous AIs run...
More broadly, TurnTrout, I've noticed you using this whole "look, if something positive happened, LW would totally rip on it! But if something is presented negatively, everyone loves it!" line of reasoning a few times (e.g., I think this logic came up in your comment about Evan's recent paper). And I sort of see you taking on some sort of "the people with high P(doom) just have bad epistemics" flag in some of your comments.
A few thoughts (written quickly, prioritizing speed over precision):
See also "Other people are wrong" vs "I am right", reversed stupidity is not intelligence, and the cowpox of doubt.
My guess is that it's relatively epistemically corrupting and problematic to spend a lot of time engaging with weak arguments.
I think it's tempting to make the mistake of thinking that debunking a specific (bad) argument is the same as debunking a conclusion. But actually, these are extremely different operations. One requires understanding a specific argument while the other requires level headed investigation of the overall situation. Separa...
Thanks for writing this, Evan! I think it's the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.
I plan to write up more opinions about RSPs, but one I'll express for now is that I'm pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I'll detail this below:
What would a good RSP look like?
I happen to think that the Anthropic RSP is fine for what it is, but it just doesn't actually make any interesting claims yet. The key thing is that they're committing to actually having an ASL-4 criteria and safety argument in the future. From my perspective, the Anthropic RSP effectively is an outline for the sort of thing an RSP could be (run evals, have safety buffer, assume continuity, etc) as well as a commitment to finish the key parts of the RSP later. This seems ok to me.
I would preferred if they included tentative proposals for ASL-4 evaluations ...
Strongly agree with almost all of this.
My main disagreement is that I don't think the "What would a good RSP look like?" description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding - and that we shouldn't expect to understand how and why it's insufficient before reality punches us in the face.
Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them...
It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs.
I mean, I am very explicitly trying to communicate what I see as the success story here. I agree that there are many ways that this could fail—I mention a bunch of them in the last section—but I think that having a clear story of how things could go well is important to being able to work to actually achieve that story.
...On top of that, the posts seem to have this "don't listen to the people who are
I generally don't find writeups of standards useful, but this piece was an exception. Below, I'll try to articulate why:
I think AI governance pieces-- especially pieces about standards-- often have overly vague language. People say things like "risk management practices" or "third-party audits", phrases that are generally umbrella terms that lack specificity. These sometimes serve as applause lights (whether the author intended this or not): who could really disagree with the idea of risk management?
I liked that this piece (fairly unapologetically) advocat...
Congratulations on launching!
On the governance side, one question I'd be excited to see Apollo (and ARC evals & any other similar groups) think/write about is: what happens after a dangerous capability eval goes off?
Of course, the actual answer will be shaped by the particular climate/culture/zeitgeist/policy window/lab factors that are impossible to fully predict in advance.
But my impression is that this question is relatively neglected, and I wouldn't be surprised if sharp newcomers were able to meaningfully improve the community's thinking on this.
Thank you for writing this post, Adam! Looking forward to seeing what you and your epistemology team produce in the months ahead.
SERI MATS is doing a great job of scaling conceptual alignment research, and seem open to integrate some of the ideas behind Refine
I'm a big fan of SERI MATS. But my impression was that SERI MATS had a rather different pedagogy/structure (compared to Refine). In particular:
Hastily written; may edit later
Thanks for mentioning this, Jan! We'd be happy to hear suggestions for additional judges. Feel free to email us at akash@alignmentawards.com and olivia@alignmentawards.com.
Some additional thoughts:
Big fan of this post. One thing worth highlighting IMO: The post assumes that governments will not react in time, so it's mostly up to the labs (and researchers who can influence the labs) to figure out how to make this go well.
TBC, I think it's a plausible and reasonable assumption to make. But I think this assumption ends up meaning that "the plan" excludes a lot of the work that could make the USG (a) more likely to get involved or (b) more likely to do good and useful things conditional on them deciding to get involved.
Here's an alternative frame: I wo... (read more)
I would love to see a post laying this out in more detail. I found writing my post a good exercise for prioritization. Maybe writing a similar piece where governance is the main lever brings out good insights into what to prioritize in governance efforts.