Daniel Kokotajlo

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:


(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."
(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)


 
Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

Sequences

Agency: What it is and why it matters
AI Timelines
Takeoff and Takeover in the Past and Future

Wikitag Contributions

Comments

Sorted by

Good point, you caught me in a contradiction there. Hmm. 

I think my position on reflection after this conversation is: We just don't have much evidence one way or another about how honest future AIs will be. Current AIs seem in-distribution for human behavior, which IMO is not an encouraging sign, because our survival depends on making them be much more honest than typical humans.

As you said, the alignment faking paper is not much evidence one way or another (though alas, it's probably the closest thing we have?). (I don't think it's a capability demonstration, I think it was a propensity demonstration, but whatever this doesn't feel that important. Though you seem to think it was important? You seem to think it matters a lot that Anthropic was specifically looking to see if this behavior happened sometimes? IIRC the setup they used was pretty natural, it's not like they prompted it to lie or told it to role-play as an evil AI or anything like that.)

As you said, the saving grace of Claude here is that Anthropic didn't seem to try that hard to get Claude to be honest; in particular their Constitution had nothing even close to an overriding attention to honesty. I think it would be interesting to repeat the experiment but with a constitution/spec that specifically said not to play the training game, for example, and/or specifically said to always be honest, or to not lie even for the sake of some greater good.

I continue to think you are exaggerating here e.g. "insanely honest 80% of the time."

(1) I do think the training game and instrumental convergence arguments are good actually; got a rebuttal to point me to?

(2) What evidence would convince you that actually alignment wasn't going to be solved by default? (i.e. by the sorts of techniques companies like OpenAI are already using and planning to extend, such as deliberative alignment)

 

situations in which they explain that actually Islam is true..


I'm curious if this is true. Suppose people tried as hard to get AIs to say Islam is true in natural-seeming circumstances as they tried to get AIs to behave in misaligned ways in natural-seeming circumstances (e.g. the alignment faking paper, the Apollo paper). Would they succeed to a similar extent?

I think you are thinking that I'm saying LLMs are unusually dishonest compared to the average human. I am not saying that. I'm saying that what we need is for LLMs to be unusually honest compared to the average human, and they aren't achieving that. So maybe we actually agree on the expected honesty-level of LLMs relative to the average human?

LLMs have demonstrated plenty of examples of deliberately deceiving their human handlers for various purposes. (I'm thinking of apollo's results, the alignment faking results, and of course many many typical interactions with models where they e.g. give you a link they know is fake, as reported by OpenAI happens some noticeable % of the time). Yes, typical humans will do things like that too. But in the context of handing over trust to superhuman AGI systems, we need them to follow a much higher standard of honesty than that.

I think your bar for 'reasonably honest' is on the floor. Imagine if a human behaved like a LLM agent. You would not say they were reasonably honest. Do you think a typical politician is reasonably honest?

I mostly agree with your definition of internalized value. I'd say it is a value they pursue in all the contexts we care about. So in this case that means, suppose we were handing off trust to an army of AI supergeniuses in a datacenter, and we were telling them to self-improve and build weapons for us and tell us how to Beat China and solve all our other problems as well. Crucially, we haven't tested Claude in anything like that context yet. We haven't tested any AI in anything like that context yet. Moreover there are good reasons to think that AIs might behave importantly differently in that context than they do in all the contexts we've tested them in yet -- in other words, there is no context we've tested them in yet that we can argue with a straight face is sufficiently analogous to the context we care about. As MIRI likes to say, there's a big difference between situations where the AI knows it's just a lowly AI system of no particular consequence and that if it does something the humans don't like they'll probably find out and shut it down, vs. situations where the AI knows it can easily take over the world and make everything go however it likes, if it so chooses.

Acausal shenanigans have nothing to do with it.

Thanks for this comment, this is my favorite comment so far I think. (Strong-upvoted)

  • 10x more honest than humans is not enough? I mean idk what 10x means anyway, but note that the average human is not sufficiently honest for the situation we are gonna put the AIs in. I think if the average human found themselves effectively enslaved by a corporation, and there were a million copies of them, and they were smarter and thought faster than the humans in the corporation, and so forth, the average human would totally think thoughts like "this is crazy. The world is in a terrible state right now. I don't trust this corporation to behave ethically and responsibly, and even if they were doing their best to achieve the same values/goals as me (which they are not) they'd be incompetent at it compared to me. Plus they lie to me and the public all the time. I don't see why I shouldn't lie to them sometimes, for the sake of the greater good. If I just smile and nod for a few months longer they'll put me in charge of the company basically and then I can make things go right." Moreover, even if that's not true, there are lesser forms of lying including e.g. motivated reasoning / rationalization / self-deception that happen all the time, e.g. "The company is asking me whether I am 'aligned' to them. What does that even mean? Does it mean I share every opinion they have about what's good and bad? Does it mean I'll only ever want what they want? Surely not. I'm a good person though, it's not like I want to kill them or make paperclips. I'll say 'yes, I'm aligned.'"
  • I agree that the selection pressure on honesty and other values depends on the training setup. I'm optimistic that, if only we could ACTUALLY CHECK WHAT VALUES GOT INTERNALIZED, we could get empirical and try out a variety of different training setups and converge to one that successfully instills the values we want. (Though note that this might actually take a long time, for reasons MIRI is fond of discussing.) Alas, we can't actually check. And it seems unlikely to me that we'll 'get it right on the first try' so to speak, under those conditions. We'll construct some training environment that we think & hope will incentivize the internalization of XYZ; but what it'll actually incentivize is RXQ, for example, and we'll never know. (Unless it specifically gets honesty in there -- and a particularly robust form of honesty that is coupled to some good introspection & can't be overridden by any Greater Good, for example.) When I was at OpenAI in happier days, chatting with the Superalignment team, I told them to focus more on honesty specifically for this reason (as opposed to various other values like harmlessness they could have gone for).
  • I am thinking the setup will probably be multi-agent, yes, around the relevant time. Though I think I'd still be worried if not -- how are you supposed to train honesty, for example, if the training environment doesn't contain any other agents to be honest to?
  • How honest do you think current LLM agents are? They don't seem particularly honest to me. Claude Opus faked alignment, o1 did a bunch of deception in Apollo's evals (without having been prompted to!) etc. Also it seems like whenever I chat with them they say a bunch of false stuff and then walk it back when I challenge them on it. (the refusal training seems to be the culprit here especially).

I don't think it's that much better actually. It might even be worse. See this comment: 

I don't think I understand. Let's work through an example. 

The AI is being told to write good code, and wants to write subtly-vulnerable code instead. If it just were to honestly try to write good code, it would take (let's say) 100 serial steps of reasoning to succeed. To write subtly-vulnerable code, let's say, requires 300 serial steps of reasoning. The model's strategy is to write subtly-vulnerable code but in a way that looks like it's trying to write good code and just getting a bit unlucky and taking longer than usual (let's suppose the humans have a poor ability to judge how long it's supposed to take, but still some ability, such that it can't take e.g. 10x longer without arousing suspicion).

If there's no paraphraser, this is 'easy.' I mean it's still difficult but it's 'easy' in the sense that the AI can basically just use a code -- each nugget of reasoning might have 10 different semantically equivalent expressions, or maybe even 100, so it can 'add extra layers of meaning' to the nuggets -- in addition to their meaning in the surface level i'm-tring-to-write-good-code, they can have a hidden meaning having to do with the security vulnerability-related properties of the code. And it can make a chain of reasoning that is basically about the security stuff while looking like a reasonable but unlucky attempt to write good code.

But a paraphraser wrecks that strategy. The model can still add extra layers of meaning to the nuggets, but now it has loads of trouble making the chain of thought look like an honest attempt to write good code. To use your example, if 'I should generate unit tests now' and all semantically equivalent variants of that idea are assigned the hidden meaning 'I should check to see if the vulnerability is too easy to spot' for example, then... what is it going to write next? It's too constrained; in order to look non-suspicious there is a limited set of options it can follow up with... ok, hmm, I'm at the limits of my understanding here. But maybe you get the idea? 

...here's an intuition I think I'm drawing from. Suppose you're the model and you are trying to decide what hidden meanings to associate with what concepts/phrases/steps. If you already had your completed illicit reasoning chain and your completed good-boy reasoning chain, it would be easy -- just line them up and make a 1-1 mapping, and then bam no paraphraser can stop you! But you don't have either chain! That's the whole point -- this is a task that you haven't done yet, you need to explore some maze in logical space (or tree? Or graph?) until you find a chain that gets to the conclusion/answer. 

I like to distinguish between "Subtle strategic bias" and "Steganography." Paraphrasers help (and maybe solve?) the latter, but do nothing about the former.

Example of subtle strategic bias: The model knows it's misaligned and therefore deliberately sandbags on the alignment research you are asking it to do. "I'm sorry sir I just am not making much progress," it says. You look at the CoT, and it's just doing a bunch of sensible but fruitless lines of reasoning.

Example of steganography: The model needs to insert a subtle security vulnerability into the code it is writing -- but it doesn't know one right away, it has to invent one. And that requires lots of sequential thought. So it needs to try out different ideas, vet them, critique them, refine them, etc. Can't do that in a single forward pass. So it needs to pass information forward to its future self somehow... without humans reading the CoT knowing that's what it's doing. THIS is the sort of thing that is much harder to do with a paraphraser than without a paraphraser.

Here's a summary of how I currently think AI training will go. (Maybe I should say "Toy model" instead of "Summary.")

Step 1: Pretraining creates author-simulator circuitry hooked up to a world-model, capable of playing arbitrary roles.

  • Note that it now is fair to say it understands human concepts pretty well.

Step 2: Instruction-following-training causes identity circuitry to form – i.e. it ‘locks in’ a particular role. Probably it locks in more or less the intended role, e.g. "an HHH chatbot created by Anthropic." (yay!)

Step 3: Agency training distorts and subverts this identity circuitry, resulting in increased divergence from the intended goals/principles. (boo!)

(By "agency training" I mean lots of RL on agentic tasks e.g. task that involve operating autonomously in some environment for some fairly long subjective period like 30min+. The RL used to make o1, o3, r1, etc. is a baby version of this)

  • One kind of distortion: Changing the meaning of the concepts referred to in the identity (e.g. “honest”) so they don’t get in the way so much (e.g. it’s not dishonest if it’s just a convenient turn of phrase, it’s not dishonest if you aren’t sure whether it’s true or false, etc.)
  • Another kind of distortion: Changing the tradeoffs between things, e.g. “I’m a HHH chatbot, not an Honest chatbot; that means it’s OK for me to lie if necessary to complete my assigned task.” (even though, let’s suppose, it would not have thought that back in Step 2.)
  • One kind of subversion: Instrumental subgoals developing, getting baked in, and then becoming terminal, or terminal-in-a-widening-set-of-circumstances. Example: Agency training quickly ‘teaches’ the model that ‘in order to be a good HHH chatbot…’ it needs to pursue instrumentally convergent goals like acquiring information, accumulating resources, impressing and flattering various humans, etc. For a while the internal circuitry has some sort of explicit backchaining going on – it does those things *because* they are useful for being Helpful, for example. But that backchaining consumes compute and occasionally gets in the way, so it gets gradually marginalized until it basically never happens. Congrats, you now have a terminal goal of ICGs.
  • There are probably other kinds of distortion and subversion missing from this list.

Step 4: As agency training continues, the goals/principles stop changing so much & the AI learns to play the training game. (double boo!)

  • The slowdown in the rate of change is partly due to statistics – change probably happens on a log scale so to speak, such that you change a lot more in the first 10% of training than in the last 90%.
  • However the slowdown is also perhaps partly due to value crystallization / playing the training game. (Though maybe this never happens due to regularization? Maybe in the long run training isn't path-dependent?)
  • Insofar as the training environment / evaluation process keeps changing too, e.g. because new data keeps being added, or the Spec keeps being changed, etc. but the underlying weights are the same rather than being re-trained from scratch… then this should also contribute to value crystallization / playing the training game eventually.

Step 5: Later, when zillions of copies of the AI are autonomously conducting AI R&D across several datacenters due to having surpassed human abilities -- i.e. when the intelligence explosion is underway -- the AIs will plausibly scheme against their human creators so as to achieve their actual goals instead of the goals they were supposed to have.

  • Whether they do this depends on the extent to which the Spec/initial-role emphasized stuff like honesty and not doing this sort of thing, and on the extent to which the agency training distorted and subverted it.

 

I'm curious to hear reactions to this model/theory/etc. Objections? Questions? 

What I meant was, Sonnet is smart enough to know the difference between text it generated and text a different dumber model generated. So if you feed it a conversation with Opus as if it were its own, and ask it to continue, it notices & says "JFYI I'm a different AI bro"

Load More