Zach Stein-Perlman

AI strategy & governance. ailabwatch.org. ailabwatch.substack.com

Wiki Contributions

Load More

Comments

Sorted by

I asked Rohin Shah what the debate agenda is about and he said (off the cuff, not necessarily worded well) (context omitted) (emphasis mine):

Suppose you have any task where given an input x the AI has to produce some output y. (Can be question answering, arbitrary chatbot stuff, being an agent, whatever.)

Debate to me is a family of procedures for training AIs in such settings. One example such procedure is: sample two possible outputs y1 and y2, then have two AIs debate each other about which output is better (one assigned to y1 and the other assigned to y2, typically both are instances of the same underlying weights). The winner of the debate is determined by a judge. You then train the debaters with RL (+1 reward for the winner, -1 reward for the loser), and treat (y_winner, y_loser) as a preference pair for training on the original task (which you could then optimize using DPO or similar).

The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability, and so this allows you to provide a reward signal that is as good as can be provided given the capabilities of the debater AI. (Ideally the debater AI is similarly capable as the AI doing the task; often you’ll want these to be the same AI system.)

More broadly the problem I aim to solve with debate (or amplified oversight more broadly) is “provide a reward signal that the AI cannot hack” (though ofc the AI may still misgeneralize)

The hope is that decreases p(misaligned) — especially p(reward hacking), but probably also p(scheming)

You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)

This post helped me distinguish capabilities-y information that's bad to share from capabilities-y information that's fine/good to share. (Base-model training techniques are bad; evals and eval results are good; scaffolding/prompting/posttraining techniques to elicit more powerful capabilities without more spooky black-box cognition is fine/good.)

This early control post introduced super important ideas: trusted monitoring plus the general point

if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now, a core dynamic is trying to use your dumb trusted models together with your limited access to trusted humans to make it hard for your untrusted smart models to cause problems.

It would be crazy to show the untrusted model the old CoT when resampling.

What's a bad thing that the agent can do with the jailbreak that it can't just do in the unmonitored scaffold?

Edit: dumb question, oops: obviously the schemer will have affordances that it doesn't have within the unmonitored scaffold / there are actions where it's helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.

Why can't the schemer just do the bad stuff in the unmonitored scaffold from (1)?

When someone describes a safety case where one of the steps is infeasible, here are two possible reactions:

  1. Step X will be ~impossible to justify; hopefully the lab will understand that (if nothing else then when they try to justify step X) and look for a different kind of safety case
  2. Step X will be ~impossible to justify, but the authors seem to think the plan requiring it is decent; this suggests that they don't understand the issue and when they make a safety case they'll likely think they're justifying step X when they're not

Regardless, if you think a step is very hard or the authors seem confused, ideally you should explain why. And then hopefully the authors say "yeah oops that's too hard" or "yeah we're imagining doing the really hard thing" or "we don't think it will be so hard: we agree observation O would suffice and we expect it's easy and if it turns out to be hard we'll have to make a different safety case" or "we disagree, here's why." (I'm mostly observing this independent of this post.)

[Edited]

My initial reactions on a quick read:

  1. Interp: no inside view; my impression is that my friends are pessimistic about such safety cases in the next few years
  2. Control: hooray for Anthropic finally writing something about control; some details seem good; some details seem odd ("contexts that were specifically whitelisted"); I think the naive pre-deployment honeypots path doesn't work because a scheming model wins with the defect rarely strategy (related: https://www.lesswrong.com/posts/vYWDEtJ2b8tkbG7Rv/behavioral-red-teaming-is-unlikely-to-produce-clear-strong) [very sophisticated or expensive honeypotting setups could work but the post doesn't seem to be imagining that or noticing the issue]; that's important and the current language in the post is concerning
  3. Incentives: I don't really understand, or I'm dubious, or I expect that A3.1 is very hard to demonstrate

This is just the paralysis argument. (Maybe any sophisticated non-consequentialists will have to avoid this anyway. Maybe this shows that non-consequentialism is unappealing.)

[Edit after Buck's reply: I think it's weaker because most Anthropic employees aren't causing the possible-deaths, just participating in a process that might cause deaths.]

tl;dr: I think Anthropic is on track to trade off nontrivial P(win) to improve short-term AI welfare,[1] and this seems bad and confusing to me. (This worry isn't really based on this post; the post just inspired me to write something.)


Anthropic buys carbon offsets to be carbon-neutral. Carbon-offset mindset involves:

  • Doing things that feel good—and look good to many ethics-minded observers—but are more motivated by purity than seeking to do as much good as possible and thus likely to be much less valuable than the best way to do good (on the margin)
  • Focusing on avoiding doing harm yourself, rather than focusing on net good or noticing how your actions affect others[2] (related concept: neglecting inaction risk)

I'm worried that Anthropic will be in carbon-offset mindset with respect to AI welfare.

There are several stories you can tell about how working on AI welfare soon will be a big deal for the long-term future (like, worth >>10^60 happy human lives):

  • If we're accidentally torturing AI systems, they're more likely to take catastrophic actions. We should try to verify that AIs are ok with their situation and take it seriously if not.[3]
  • It would improve safety if we were able to pay or trade with near-future potentially-misaligned AI, but we're not currently able to, likely including because we don't understand AI-welfare-adjacent stuff well enough.
  • [Decision theory mumble mumble.]
  • Also just "shaping norms in a way that leads to lasting changes in how humanity chooses to treat digital minds in the very long run," somehow.
  • [More.]

But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives. Numbers aside, the focus is we should avoid causing a moral catastrophe in our own deployments and on merely Earth-scale stuff, not we should increase the chance that long-term AI welfare and the cosmic endowment go well. Likewise, this post suggests efforts to "protect any interests that warrant protecting" and "make interventions and concessions for model welfare" at ASL-4. I'm very glad that this post mentions that doing so could be too costly, but I think very few resources (that trade off with improving safety) should go into improving short-term AI welfare (unless you're actually trying to improve the long-term future somehow) and most people (including most of the Anthropic people I've heard from) aren't thinking through the tradeoff. Shut up and multiply; treat the higher-stakes thing as proportionately more important.[4] (And notice inaction risk.)

(Plucking low-hanging fruit for short-term AI welfare is fine as long as it isn't so costly and doesn't crowd out more important AI welfare work.)

I worry Anthropic is both missing an opportunity to do astronomical good in expectation via AI welfare work and setting itself up to sacrifice a lot for merely-Earth-scale AI welfare.


One might reply: Zach is worried about the long-term, but Sam is just talking about decisions Anthropic will have to make short-term; this is fine. To be clear, my worry is Anthropic will be much too concerned with short-term AI welfare, and so it will make sacrifices (labor, money, interfering with deployments) for short-term AI welfare, and these sacrifices will make Anthropic substantially less competitive and slightly worse on safety, and this increases P(doom).


I wanted to make this point before reading this post; this post just inspired me to write it, despite not being a great example of the attitude I'm worried about since it mentions how the costs of improving short-term AI welfare might be too great. (But it does spend two subsections on short-term AI welfare, which suggests that the author is much too concerned with short-term AI welfare [relative to other things you could invest effort into], according to me.)

I like and appreciate this post.

  1. ^

    Or—worse—to avoid being the ones to cause short-term AI suffering.

  2. ^

    E.g. failing to take seriously the possibility that you make yourself uncompetitive and your influence and market share just goes to less scrupulous companies.

  3. ^

    Related to this and the following bullet: Ryan Greenblatt's ideas.

  4. ^

    For scope-sensitive consequentialists—at least—short-term AI welfare stuff is a rounding error and thus a red herring, except for its effects on the long-term future.

Load More