Buck Shlegeris

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Posts

Sorted by New

3Buck's Shortform

25Handling schemers if shutdown is not an option

55Ctrl-Z: Controlling AI Agents via Resampling

18How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

23Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

1mo

33A sketch of an AI control safety case

3mo

64Ten people on the inside

3mo

16Early Experiments in Human Auditing for AI Control

3mo

41Thoughts on the conservative assumptions in AI control

3mo

32Measuring whether AIs can statelessly strategize to subvert security measures

4mo

192Alignment Faking in Large Language Models

3mo

Wikitag Contributions

Comments

Sorted by

Newest

To be legible, evidence of misalignment probably has to be behavioral

Buck Shlegeris6d43

Ryan agrees, the main thing he means by "behavioral output" is what you're saying: an actually really dangerous action.

Notes on countermeasures for exploration hacking (aka sandbagging)

Buck Shlegeris11d22

I think we should probably say that exploration hacking is a strategy for sandbagging, rather than using them as synonyms.

Thomas Kwa's Shortform

Buck Shlegeris20d140

A few months ago, I accidentally used France as an example of a small country that it wouldn't be that catastrophic for AIs to take over, while giving a talk in France 😬

Good Research Takes are Not Sufficient for Good Strategic Takes

Buck Shlegeris22d84

A few points:

Knowing a research field well makes it easier to assess how much other people know about it. For example, if you know ML, you sometimes notice that someone clearly doesn't know what they're talking about (or conversely, you become impressed by the fact that they clearly do know what they're talking about). This is helpful when deciding who to defer to.
If you are a prominent researcher, you get more access to confidential/sensitive information and the time of prestigious people. This is true regardless of whether your strategic takes are good, and generally improves your strategic takes.
- One downside is that people try harder to convince you of stuff. I think that being a more prominent researcher is probably overall net positive despite this effect.
IMO, one way of getting a sense of whether someone's strategic takes are good is to ask them whether they try hard to have strategic takes. A lot of people will openly tell you that they don't focus on that, which makes it easier for you to avoid deferring to random half-baked strategic takes that they say without expecting anyone to take them too seriously.

We Have No Plan for Preventing Loss of Control in Open Models

Buck Shlegeris22d30

A few takes:

I believe that there is also an argument to be made that the AI safety community is currently very under-indexed on research into future scenarios where assumptions about the AI operator taking baseline safety precautions related to preventing loss of control do not hold.

I think you're mixing up two things: the extent to which we consider the possibility that AI operators will be very incautious, and the extent to which our technical research focuses on that possibility.

My research mostly focuses on techniques that an AI developer could use to reduce the misalignment risk posed by deploying and developing AI, given some constraints on how much value they need to get from the AI. Given this, I basically definitionally have to be imagining the AI developer trying to mitigate misalignment risk: why else would they use the techniques I study?

But that focus isn't to say that I'm totally sure all AI developers will in fact use good safety techniques.

Another disagreement is that I think that we're better off if some AI developers (preferably more powerful ones) have controlled or aligned their models, even if there are some misaligned AIs being developed without safeguards. This is because the controlled/aligned models can be used to defend against attacks from the misaligned AIs, and to compete with the misaligned AIs (on both acquiring resources and doing capabilities and safety research).

The case for ensuring that powerful AIs are controlled

Buck Shlegeris1mo20

I am not sure I agree with this change at this point. How do you feel now?

Buck's Shortform

Buck Shlegeris2mo313

Alignment Forum readers might be interested in this:

Announcing ControlConf: The world’s first conference dedicated to AI control - techniques to mitigate security risks from AI systems even if they’re trying to subvert those controls. March 27-28, 2025 in London. 🧵ControlConf will bring together:
Researchers from frontier labs & government
AI researchers curious about control mechanisms
InfoSec professionals
Policy researchers
Our goals: build bridges across disciplines, share knowledge, and coordinate research priorities.
The conference will feature presentations, technical demos, panel discussions, 1:1 networking, and structured breakout sessions to advance the field.
Interested in joining us in London on March 27-28? Fill out the expression of interest form by March 7th: https://forms.fillout.com/t/sFuSyiRyJBus

How might we safely pass the buck to AI?

Buck Shlegeris2mo32

Thanks. (Also note that the model isn't the same as her overall beliefs at the time, though they were similar at the 15th and 50th percentiles.)

How might we safely pass the buck to AI?

Buck Shlegeris2mo615

the OpenPhil doctrine of "AGI in 2050"

(Obviously I'm biased here by being friends with Ajeya.) This is only tangentially related to the main point of the post, but I think you're really overstating how many Bayes points you get against Ajeya's timelines report. Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.

I don't think you've ever made concrete predictions publicly (which makes me think it's worse behavior for you to criticize people for their predictions), but I don't think there are that many groups who would have put wildly higher probability on AGI in this particular time period. (I think some of the short-timelines people at the time put substantial mass on AGI arriving by now, which reduces their performance.) Maybe some of them would have said 40%? If we assume AGI by then, that's a couple bits of better performance, but I don't think it's massive outperformance. (And I still think it's plausible that AGI isn't developed by 2036!)

In general, I think that disagreements on AI timelines often seem more extreme when you summarize people's timelines by median timeline rather than by their probability on AGI by a particular time.

Research directions Open Phil wants to fund in technical AI safety

Buck Shlegeris2mo63

Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)

I don't know what this means, do you have any examples?