Buck Shlegeris

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Wiki Contributions

Comments

Sorted by

I see where you're coming from, and can easily imagine things going the way you described. My goal with this post was to note some of the ways that it might be harder than you're describing here.

I think it’s a bit stronger than the usual paralysis argument in this case, but yeah.

Note that people who have a non-consequentialist aversion for risk of causing damage should have other problems with working for Anthropic. E.g. I suspect that Anthropic is responsible for more than a million deaths of currently-alive humans in expectation.

My reasoning is: “If the AI won, who cares if we know it or not? We’ll find out eventually :/“

This isn’t totally correct but it seems like a fine approximation given all the other approximations here.

When I'm thinking about AI control in the context of concentrated failures--that is, cases where an unacceptable outcome might be caused by a small number of bad AI actions, e.g. the setting explored in our control paper--I find it helpful to use the following simplified models.

Let's consider our AI deployment from both a micro perspective, where we just think about a single task the AI is doing, and a macro perspective, where we think about the deployment as comprised of many tasks.

From the micro perspective, when an AI tries to attack on a particular task, this can end in three ways:

  • The attack succeeds, and so the AI wins.
  • We catch it the AI red-handed before the attack completes, and so it loses.
  • The attack fails silently; we don't notice it, and so the game continues as before.

And from the macro perspective, the game ends one of these ways:

  • An attack succeeded; the AI wins.
  • You caught the AI; you win.
  • The game finished naturally because all the tasks were completed; you win.

I often ignore the possibility of the game finishing naturally, because that won't happen if the AI attempts to attack often enough, which is the right strategy for the AI in many games.

Under that assumption, the probability of AI victory is the ratio between the probability that an attack ends in success and the probability that it ends with us catching the AI--the probability that the game continues doesn't matter, because if the game continues, the AI will eventually attack again on a later task.

I also often think about cases where you need to catch the AI some larger number of times (let's say 30 times) before you win; you can approximate this by reducing P(catch) by 30x and increasing P(silent failure) to compensate.

  • I'm somewhat more pessimistic than I was in the past about the prospects for catch-then-shut-down. In particular, I think you might need to catch many escape attempts before you can make a strong case for shutting down. (For concreteness, I mostly imagine situations where we need to catch the model trying to escape 30 times.)
  • I am still optimistic about few-shot catastrophe prevention; this post is just about whether you can shut down, not whether you can take advantage of the escapes to improve safety (if you try to).
  • As you note, one difference between this post and my previous claims is that in the past I've mostly talked about what an AI lab would do to mitigate danger from its own models, and here I also talk about other labs' models.

(Note also that the old posts you linked were coauthored with Ryan, so there's some variation from the fact that he and I don't have exactly the same beliefs.)

I'd much rather an interpretability team hire someone who's intrinsically fascinated by neural networks (but doesn't think much about alignment) than someone who deeply cares about making AI go well (but doesn't find neural nets very interesting).

I disagree, I'd rather they'd hire someone who cares about making AI go well. E.g. I like Sam Marks's work on making interpretability techniques useful (e.g. here), and I think he gets a lot of leverage compared to most interpretability researchers via trying to do stuff that's in the direction of being useful. (Though note that his work builds on the work of non-backchaining interpretability researchers.)

I mostly just mean that when you're modeling the problem, it doesn't really help to think about the AIs as being made of parts, so you end up reasoning mostly at the level of abstraction of "model" rather than "parameter", and at that level of abstraction, there aren't that many pieces.

Like, in some sense all practical problems are about particular consequences of laws of physics. But the extent to which you end up making use of that type of reductionist model varies greatly by domain.

Yep that's very fair. I agree that it's very likely that AI companies will continue to be misleading about the absolute risk posed by their actions.

I think it makes sense to use the word "reasonable" to describe someone who is taking actions that minimize total risk, even if those actions aren't what they'd take in a different situation, and even if various actors had made mistakes to get them into this situation.

(Also note that I'm not talking about making wildly superintelligent AI, I'm just talking about making AGI; my guess is that even when you're pretty rushed you should try to avoid making galaxy-brained superintelligence.)

Load More