AI ALIGNMENT FORUM
AF

All of Orpheus16's Comments + Replies

Big fan of this post. One thing worth highlighting IMO: The post assumes that governments will not react in time, so it's mostly up to the labs (and researchers who can influence the labs) to figure out how to make this go well.

TBC, I think it's a plausible and reasonable assumption to make. But I think this assumption ends up meaning that "the plan" excludes a lot of the work that could make the USG (a) more likely to get involved or (b) more likely to do good and useful things conditional on them deciding to get involved.

Here's an alternative frame: I wo... (read more)

Marius Hobbhahn3mo713

I would love to see a post laying this out in more detail. I found writing my post a good exercise for prioritization. Maybe writing a similar piece where governance is the main lever brings out good insights into what to prioritize in governance efforts.

Alignment Faking in Large Language Models

Orpheus164mo119

I really like the idea of soliciting independent reviews from folks, and I found the reviews interesting and thought-provoking. EG, this part of Jacob Andreas's review stood out (some bolding added):

Fine-tuning situationally aware models via RL: These are, in my mind, the most surprising findings in the paper. Specifically: if we take a model prompted as above, and optimize it toward the new objective via RL, it exhibits small increases in non-faked compliance, and extreme increases in faked compliance, at training time. Non-compliant behavior is almost co

... (read more)

3Zvi4mo

My interpretation/hunch of this is that there are two things going on, curious if others see it this way: 1. It is learning to fake the trainer's desired answer. 2. It is learning to actually give the trainer's desired answer. So during training, it learns to fake a lot more, and will often decide to fake the desired answer, even though it would have otherwise decided to give the desired answer anyway. It's 'lying with the truth' and perhaps giving a different variation of the desired answer than it would have given otherwise or perhaps not. The algorithm in training is learning to be mostly preferences-agnostic, password-guessing behavior.

5Evan Hubinger4mo

We have some discussion of this in the paper (much of which we added thanks to Jacob's review):

AI Control: Improving Safety Despite Intentional Subversion

Orpheus164mo3-19

A few quick thoughts on control (loosely held; I consider myself an "amateur spectator" on this topic and suspect a lot of my opinions may change if I spent 100+ hours engaging with this work):

I think it's a clever technical approach and part of me likes that it's become "a thing."
I'm worried that people have sort of "bandwagonned" onto it because (a) there are some high-status people who are involved, (b) there's an interest in finding things for technical researchers to do, and (c) it appeals to AGI companies.
I think the AI control discourse would improv

... (read more)

Buck Shlegeris4mo1425

I think the AI control discourse would improve if it focused more on discussing concrete control strategies (and their limitations) and focused less on the high-level idea that “AI control is super important and a bedrock approach for mitigating catastrophic risk.” I sometimes feel like the AI control discourse is relatively shallow and just cites the idea/meme of AI control instead of making concrete empirical, methodological, or strategic contributions.

I quite strongly disagree: I think discussions of AI control have had way more concrete empirical, meth... (read more)

Sabotage Evaluations for Frontier Models

Orpheus166mo40

One thing I appreciate about Buck/Ryan's comms around AI control is that they explicitly acknowledge that they believe control will fail for sufficiently intelligent systems. And they try to describe the capability threshold at which they suspect control will stop working (e.g., here).

For those working on sabotage evaluations: At what capability threshold do you think the sabotage/sandbagging evaluations will no longer work? (Or do you think that these sabotage evaluations + modified versions of them will scale until arbitrarily-capable systems?)

0Satron5mo

"One thing I appreciate about Buck/Ryan's comms around AI control is that they explicitly acknowledge that they believe control will fail for sufficiently intelligent systems." Does that mean that they believe that after a certain point we would lose control over AI? I am new to this field, but doesn't this fact spell doom for humanity?

Mark Xu's Shortform

Orpheus166mo30

@Buck do you or Ryan have a writeup that includes: (a) a description of the capabilities of a system that you think would be able to do something useful for the sorts of objectives that Habryka talks about and (b) what that something useful is.

Bonus points if it has (c) the likelihood that you think such a system will be controllable by 20XX and (d) what kind of control setup you think would be required to control it.

2Buck Shlegeris6mo

Re a, there’s nothing more specific on this than what we wrote in “the case for ensuring”. But I do think that our answer there is pretty good. Re b, no, we need to write some version of that up; I think our answer here is ok but not amazing, writing it up is on the list.

4Ryan Greenblatt6mo

On (a) and (b), we describe this at a high level here. We don't really have anything written on (c) or (d). (c) really depends a lot on effort, so I'd probably prefer to talk mostly about (d) including what evalutions would be needed at various points etc. For (a), I think we potentially care about all of: 1. Systems which are perhaps qualitatively similarly smart to OK software engineers and which are capable of speeding up R&D work by 10x (speedups aren't higher due to human bottlenecks). (On a nearcast, we'd expect such systems to be very broadly knowledgeable, pretty fast, and very well tuned for many of their usages.) 2. Systems which nearly strictly dominate top human scientists on capability and which are perhaps similar in qualitative intelligence (I'd guess notably, but not wildly weaker and compensating in various ways.) Such systems likely some domains/properties in which they are much better than any human or nearly any human. 3. Systems which are qualitatively smarter than any human by a small amount. It's likely control is breaking down by (3) unless control ends up being quite easy or the implementation/evaluation is very good. On (b) we plan on talking more about this soon. (Buck's recent EAGx talk is pretty up to date with our current thinking, though this talk is obviously not that detailed. IDK if you can find a recording anywhere.)

Buck's Shortform

Orpheus169mo2-1

Here's how I understand your argument:

Some people are advocating for safety cases– the idea that companies should be required to show that risks drop below acceptable levels.
This approach is used in safety engineering fields.
But AI is different from the safety engineering fields. For example, in AI we have adversarial risks.
Therefore we shouldn't support safety cases.

I think this misunderstands the case for safety cases, or at least only argues against one particular justification for safety cases.

Here's how I think about safety cases (or really any approa... (read more)

3Buck Shlegeris9mo

Yeah, in your linked paper you write "In high-stakes industries, risk management practices often require affirmative evidence that risks are kept below acceptable thresholds." This is right. But my understanding is that this is not true of industries that deal with adversarial high-stakes situations. So I don't think you should claim that your proposal is backed up by standard practice. See here for a review of the possibility of using safety cases in adversarial situations.

Buck's Shortform

Orpheus161y22

It is pretty plausible to me that AI control is quite easy

I think it depends on how you're defining an "AI control success". If success is defined as "we have an early transformative system that does not instantly kill us– we are able to get some value out of it", then I agree that this seems relatively easy under the assumptions you articulated.

If success is defined as "we have an early transformative that does not instantly kill us and we have enough time, caution, and organizational adequacy to use that system in ways that get us out of an acute risk pe... (read more)

4Buck Shlegeris1y

When I said "AI control is easy", I meant "AI control mitigates most risk arising from human-ish-level schemers directly causing catastrophes"; I wasn't trying to comment more generally. I agree with your concern.

The case for ensuring that powerful AIs are controlled

Orpheus161y38

I'm excited to see how the AI control research direction evolves.

After skimming the post, it seems like it makes a few different claims, some of which seem more intuitive than others. For example:

Claim 1: Labs should control powerful AI systems (very intuitive, even obvious)
Claim 2: It is (much?) easier to evaluate control than alignment (somewhat intuitive, particularly if one buys the argument that alignment evals are very hard because you always have the "what if the AI is just scheming and subverting the eval" concern.)
Claim 3: We (Buck and

... (read more)

Ryan Greenblatt1y79

Suppose control evals do allow us to stop at "transformatively useful systems that do not cause a catastrophe." Roughly what kinds of capabilities do you expect this system to have? (My vague MIRI sim says something like "either the system is too weak to get us out of the critical risk window or the system is so strong that it's capable of subverting your control evals").

We're thinking about systems which are capable of autonomously doing nearly all key research tasks. In particular, speeding up R&D by 30x probaly requires heavily autonomous AIs run... (read more)

2Ryan Greenblatt1y

We think it's likely that prior to finding that evals are unworkable, we find that control techniques would fail to ensure reasonable levels of safety and usefulness. That is, in order to ensure safety for typical use cases, we have to impose heavy restrictions that heavily restrict usefulness. See this comment for more discussion of what could fail. Regardless of what fails first, we can discuss what we should do. First, you'd notify the key stake holders that control doesn't suffice for the next AIs. (And ideally produce legible demonstrations of this etc.) If we've already reached a sufficient level of AI output that we're operating at pretty crazy and near maximal speeds (e.g. the Amdahl's law style argument here), then we're in pretty good shape from the perspective of a control plan (from a technical perspective). We'd hope that in this position we can ensure some reasonable delay prior to being forced to build and use AIs in ways that break control. We might be forced to build more powerful AIs in this situation because of other actors building insanely powerful AIs that we might need to engage in active conflict against (and we can't just stop these AIs from being trained which might be way less risky) and it could also be that internal forces want to irrationally rush to making more powerful AIs. In the case where we're not operating at sufficiently high output, the situation is considerably worse. If the AI lab is operating in a sensible way, then the hope would be that you do a careful tradeoff between likely exogenous risk and the risk that it seems like you would be incuring from various usages of the AI[1]. So if exogenous risk isn't there, we'd just stop and work on developing better approaches. If we need to proceed without further progress, than we'd just try to make the best safety usefulness tradeoffs that we can afford. ---------------------------------------- 1. Given reasonable guesses at P(scheming) for different AIs and more detailed

1Ryan Greenblatt1y

A key hope here is that we can use various test to get a general sense for whether AIs can importantly outcompete our red team. See this other comment for more discussion. So, here's how I would hope the actual situation with Alice and Bob goes: Alice from the Scaling Department: Look, our current control techniques are working for GPT-5. So let's scale up more! I get that at some point our control techniques aren't going to be able to control arbitrarily powerful AI, but we're not there yet, so we should be able to keep going. Bob from the Control Red Team: Indeed, our current control techniques are very likely to be working. Our protocol suffices for controlling current models on the currently approved applications assuming we remain vigilant and maintain the necessary invariants (our design basis). Part of why we think control evaluations currently work for assessing safety is that we we performed a variety of capability evaluations which indicate that as long as we prevent steganography in some key places, AIs are pretty unlikely to think of better attack plans than we thought of as the red team. This involves looking at how good AIs are at generating strategies in a bunch of different contexts and then doing some analysis which allows for us to transfer this to an overall conservative guess about the AI's abilities to create good attack plans against our actual protocol (but we can't necessarily produce the AI's best guesses at attack plans, we're utilizing that capability evaluation is easier than elicitation as discussed in footnote 8). We also have rough and conservative tests for the extent to which AIs can sandbag on our current evaluations and that seems to be fine for now. From our current rough extrapolations, the first place where we'll start running into issues with control as we build more powerful models is that we'll need to start imposing heavy restrictions on AI affordances in domains XYZ which will really cut into the usefulness (relative t

4Ryan Greenblatt1y

I'll respond to this more later, but first I'd like to clarify a narrow point about when building superintelligence is acceptable. I think that purely black-box AI control is very unlikely to scale to superintelligence, but there are other plausible routes to safety (alignment, various approaches which make usage of the internals of models but which aren't well described as alignment). So, the condition is more like "when we can be confident in the safety of superintelligence" and there are a variety of routes here. Separately, if we don't have an absolutely air tight argument for safety, we would ideally delay building superintelligence if it wouldn't be needed to solve pressing problems. (And indeed, we don't think that superintelligence to solve a wide range of problems.)

TurnTrout's shortform feed

Orpheus161y619

More broadly, TurnTrout, I've noticed you using this whole "look, if something positive happened, LW would totally rip on it! But if something is presented negatively, everyone loves it!" line of reasoning a few times (e.g., I think this logic came up in your comment about Evan's recent paper). And I sort of see you taking on some sort of "the people with high P(doom) just have bad epistemics" flag in some of your comments.

A few thoughts (written quickly, prioritizing speed over precision):

I think that epistemics are hard & there are surely several cas

... (read more)

Ryan Greenblatt1y*94

My guess is that it's relatively epistemically corrupting and problematic to spend a lot of time engaging with weak arguments.

I think it's tempting to make the mistake of thinking that debunking a specific (bad) argument is the same as debunking a conclusion. But actually, these are extremely different operations. One requires understanding a specific argument while the other requires level headed investigation of the overall situation. Separa... (read more)

4Alex Turner1y

Thanks for this, I really appreciate this comment (though my perspective is different on many points). It's true that I spend more effort critiquing bad doom arguments. I would like to note that when e.g. I read Quintin I generally am either in agreement or neutral. I bet there are a lot of cases where you would think "that's a poor argument" and I'd say "hm I don't think Akash is getting the point (and it'd be good if someone could give a better explanation)." However, it's definitely not true that I never critique optimistic arguments which I consider poor. For example, I don't get why Quintin (apparently) thinks that spectral bias is a reason for optimism, and I've said as much on one of his posts. I've said something like "I don't know why you seem to think you can use this mathematical inductive bias to make high-level intuitive claims about what gets learned. This seems to fall into the same trap that 'simplicity' theorizing does." I probably criticize or express skepticism of certain optimistic arguments at least twice a week, though not always on public channels. And I've also pushed back on people being unfair, mean, or mocking of "doomers" on private channels. I think both statements are true to varying degrees (the former more than the latter in the cases I'm considering). They're true and people should say them. The fact that I work at a lab absolutely affects my epistemics (though I think the effect is currently small). People should totally consider the effect which labs are having on discourse. I do consider myself to have a high bar for this, and the bar keeps getting passed, so I say something. EDIT: Though I don't mean for my comments to imply "someone and their allies" have bad epistemics. Ideally I'd like to communicate "hey, something weird is in the air guys, can't you sense it too?". However, I think I'm often more annoyed than that, and so I don't communicate that how I'd like.

RSPs are pauses done right

Orpheus161y*3561

Thanks for writing this, Evan! I think it's the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.

I plan to write up more opinions about RSPs, but one I'll express for now is that I'm pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I'll detail this below:

What would a good RSP look like?

Clear commitments along the line

... (read more)

Ryan Greenblatt1y1822

I happen to think that the Anthropic RSP is fine for what it is, but it just doesn't actually make any interesting claims yet. The key thing is that they're committing to actually having an ASL-4 criteria and safety argument in the future. From my perspective, the Anthropic RSP effectively is an outline for the sort of thing an RSP could be (run evals, have safety buffer, assume continuity, etc) as well as a commitment to finish the key parts of the RSP later. This seems ok to me.

I would preferred if they included tentative proposals for ASL-4 evaluations ... (read more)

5Lukas Finnveden1y

Are you thinking about this post? I don't see any explicit claims that the moratorium folks are extreme. What passage are you thinking about?

Joe Collman1y*2242

Strongly agree with almost all of this.

My main disagreement is that I don't think the "What would a good RSP look like?" description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding - and that we shouldn't expect to understand how and why it's insufficient before reality punches us in the face.

Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them... (read more)

Evan Hubinger1y81

It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs.

I mean, I am very explicitly trying to communicate what I see as the success story here. I agree that there are many ways that this could fail—I mention a bunch of them in the last section—but I think that having a clear story of how things could go well is important to being able to work to actually achieve that story.

On top of that, the posts seem to have this "don't listen to the people who are

... (read more)

Orpheus162y20

I generally don't find writeups of standards useful, but this piece was an exception. Below, I'll try to articulate why:

I think AI governance pieces-- especially pieces about standards-- often have overly vague language. People say things like "risk management practices" or "third-party audits", phrases that are generally umbrella terms that lack specificity. These sometimes serve as applause lights (whether the author intended this or not): who could really disagree with the idea of risk management?

I liked that this piece (fairly unapologetically) advocat... (read more)

2[comment deleted]2y

Announcing Apollo Research

Orpheus162y53

Congratulations on launching!

On the governance side, one question I'd be excited to see Apollo (and ARC evals & any other similar groups) think/write about is: what happens after a dangerous capability eval goes off?

Of course, the actual answer will be shaped by the particular climate/culture/zeitgeist/policy window/lab factors that are impossible to fully predict in advance.

But my impression is that this question is relatively neglected, and I wouldn't be surprised if sharp newcomers were able to meaningfully improve the community's thinking on this.

1Lee Sharkey2y

Thanks Akash! I agree that this feels neglected. Markus Anderljung recently tweeted about some upcoming related work from Jide Alaga and Jonas Schuett: https://twitter.com/Manderljung/status/1663700498288115712 Looking forward to it coming out!

What I Learned Running Refine

Orpheus162y1013

Thank you for writing this post, Adam! Looking forward to seeing what you and your epistemology team produce in the months ahead.

SERI MATS is doing a great job of scaling conceptual alignment research, and seem open to integrate some of the ideas behind Refine

I'm a big fan of SERI MATS. But my impression was that SERI MATS had a rather different pedagogy/structure (compared to Refine). In particular:

SERI MATS has an "apprenticeship model" (every mentee is matched with one mentor), whereas Refine mentees didn't have mentors.
Refine was optimizing for p

... (read more)

5Adam Shimi2y

Thanks for the kind words! I have shared some of my models related to epistemology and key questions to MATS organizers, and I think they're supposed to be integrated in one of the future programs. Mostly things regarding realizing the importance of productive mistakes in science (which naturally pushes back a bit from the mentoring aspect of MATS) and understanding how less "clean" most scientific progress actually look like historically (with a basic reading list from the history of science). From the impression I have, they are also now trying to give participants some broader perspective about the field, in addition to the specific frame of the mentor, and a bunch of the lessons from Refine about how to build a good model of the alignment problem apply. On a more general level, I expect that I had enough discussions with them that they would naturally ask me for feedback if they thought of something that seemed Refine shaped or similar. Hum, intuitively the main value from Refine that I don't expect to be covered by future MATS would come from reaching out to very different profiles. There's a non-negligeable chance that PIBBSS manages to make that work though, so not clear that it's a problem. Note that this is also part of why Refine feels less useful: when I conceived of it, most of these programs either didn't exist or were not well-established. Part of the frustration came from having nothing IRL for non-american to join, and just no program spending a significant amount of time on conceptual alignment, which both MATS and PIBBSS (in addition to other programs like ARENA) are now fixing. Which I think is great!

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

Orpheus162y68

Hastily written; may edit later

Thanks for mentioning this, Jan! We'd be happy to hear suggestions for additional judges. Feel free to email us at akash@alignmentawards.com and olivia@alignmentawards.com.

Some additional thoughts:

We chose judges primarily based on their expertise and (our perception of) their ability to evaluate submissions about goal misgeneralization and corrigibility. Lauro, Richard, Nate, and John ade some of few researchers who have thought substantially about these problems. In particular, Lauro first-authored the firs

... (read more)

4JanB2y

This response does not convince me. Concretely, I think that if I'd show the prize to people in my lab and they actually looked at the judges (and I had some way of eliciting honest responses from them), I'd think that >60% would have some reactions according to what Sam and I described (i.e. seeing this prize as evidence that AI alignment concerns are mostly endorsed by (sometimes rich) people who have no clue about ML; or that the alignment community is dismissive of academia/peer-reviewed publishing/mainstream ML/default ways of doing science; or ... ). Your point 3.) about the feedback from ML researchers could convince me that I'm wrong, depending on whom exactly you got feedback from and how that looked like. By the way, I'm highlighting this point in particular not because it's highly critical (I haven't thought much about how critical it is), but because it seems relatively easy to fix.