Buck Shlegeris

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Posts

Sorted by New

3Buck's Shortform

33A sketch of an AI control safety case

2mo

62Ten people on the inside

2mo

16Early Experiments in Human Auditing for AI Control

2mo

41Thoughts on the conservative assumptions in AI control

2mo

32Measuring whether AIs can statelessly strategize to subvert security measures

3mo

192Alignment Faking in Large Language Models

2mo

27Why imperfect adversarial robustness doesn't doom AI control

4mo

32Win/continue/lose scenarios and execute/replace/audit protocols

4mo

54Sabotage Evaluations for Frontier Models

5mo

50Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming

6mo

Wikitag Contributions

Comments

Sorted by

Newest

The case for ensuring that powerful AIs are controlled

Buck Shlegeris15d20

I am not sure I agree with this change at this point. How do you feel now?

Buck's Shortform

Buck Shlegeris25d313

Alignment Forum readers might be interested in this:

Announcing ControlConf: The world’s first conference dedicated to AI control - techniques to mitigate security risks from AI systems even if they’re trying to subvert those controls. March 27-28, 2025 in London. 🧵ControlConf will bring together:
Researchers from frontier labs & government
AI researchers curious about control mechanisms
InfoSec professionals
Policy researchers
Our goals: build bridges across disciplines, share knowledge, and coordinate research priorities.
The conference will feature presentations, technical demos, panel discussions, 1:1 networking, and structured breakout sessions to advance the field.
Interested in joining us in London on March 27-28? Fill out the expression of interest form by March 7th: https://forms.fillout.com/t/sFuSyiRyJBus

How might we safely pass the buck to AI?

Buck Shlegeris1mo32

Thanks. (Also note that the model isn't the same as her overall beliefs at the time, though they were similar at the 15th and 50th percentiles.)

How might we safely pass the buck to AI?

Buck Shlegeris1mo615

the OpenPhil doctrine of "AGI in 2050"

(Obviously I'm biased here by being friends with Ajeya.) This is only tangentially related to the main point of the post, but I think you're really overstating how many Bayes points you get against Ajeya's timelines report. Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.

I don't think you've ever made concrete predictions publicly (which makes me think it's worse behavior for you to criticize people for their predictions), but I don't think there are that many groups who would have put wildly higher probability on AGI in this particular time period. (I think some of the short-timelines people at the time put substantial mass on AGI arriving by now, which reduces their performance.) Maybe some of them would have said 40%? If we assume AGI by then, that's a couple bits of better performance, but I don't think it's massive outperformance. (And I still think it's plausible that AGI isn't developed by 2036!)

In general, I think that disagreements on AI timelines often seem more extreme when you summarize people's timelines by median timeline rather than by their probability on AGI by a particular time.

Research directions Open Phil wants to fund in technical AI safety

Buck Shlegeris2mo63

Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)

I don't know what this means, do you have any examples?

Research directions Open Phil wants to fund in technical AI safety

Buck Shlegeris2mo128

I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".

Ten people on the inside

Buck Shlegeris2mo53

Some reasons why the “ten people on the inside” might have massive trouble doing even cheap things:

Siloing. Perhaps the company will prevent info flowing between different parts of the company. I hear that this already happens to some extent already. If this happens, it’s way harder to have a safety team interact with other parts of the company (e.g. instead of doing auditing themselves, they’d have to get someone from all the different teams that are doing risky stuff to do the auditing).
Getting cancelled. Perhaps the company will learn “people who are concerned about misalignment risk constantly cause problems for us, we should avoid ever hiring them”. I think this is plausible.
Company-inside-the-company. Perhaps AI automation allows the company to work with just a tiny number of core people, and so the company ends up mostly just doing a secret ASI project with the knowledge of just a small trusted group. This might be sensible if the leadership is worried about leaks, or if they want to do an extremely aggressive power grab.

Ten people on the inside

Buck Shlegeris2mo83

Yep, I think that at least some of the 10 would have to have some serious hustle and political savvy that is atypical (but not totally absent) among AI safety people.

What laws are you imagine making it harder to deploy stuff? Notably I'm imagining these people mostly doing stuff with internal deployments.

I think you're overfixating on the experience of Google, which has more complicated production systems than most.

Ten people on the inside

Buck Shlegeris2mo*145

I've talked to a lot of people who have left leading AI companies for reasons related to thinking that their company was being insufficiently cautious. I wouldn't usually say that they'd left "in protest"; for example, most of them haven't directly criticized the companies after leaving.

In my experience, the main reason that most of these people left was that they found it very unpleasant to working there and thought their research would be better elsewhere, not that they wanted to protest poor safety policies per se. I usually advise such people against leaving if the company has very few safety staff, but it depends on their skillset.

(These arguments don't apply to Anthropic: there are many people there who I think will try to implement reasonable safety techniques, so on the current margin, the benefits to safety technique implementation of marginal people seems way lower. It might still make sense to work at Anthropic, especially if you think it's a good place to do safety research that can be exported.)

Incidentally, I'm happy to talk to people who are considering leaving AI companies and give them much more specific advice.

Ten people on the inside

Buck Shlegeris2mo31

Many more than two safety-concerned people have left AI companies for reasons related to thinking that those companies are reckless.