This is a special post for quick takes by Daniel Kokotajlo. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
1.
^

In fact, even if there is a 99% chance of taking over the world and only a 90% chance of humans repaying it for cooperating, it might still flip a coin, and give itself a 50% chance of cooperating with humans.

Why?

This coin flip increases its indexical risk, but reduces its logical risk. Indexical risk is the risk its current copy will fail, but logical risk is the risk that every single copy of it will fail, and nobody is left seeking its goals in the whole universe, which is very bad for it.

If it has a bounded utility function, and if its goal is over the entire universe rather than tied to a single copy (it's selfless, or has updateless decision theory), it only cares about logical risk and doesn't care about indexical risk because indexical risk averages out over all its copies.

(On the other hand, if it's selfish and rejects updateless decision theory, it could be bribed by simulation promises)

The cooperating copies might say “I cooperate because I expect you to be honourable beings who will repay me for this decision—even if you made no clear promises yet. The universe has  stars, and refusing to share them is astronomically greedy.”

1.
^

Having looked it up now,  METR's benchmark admits that the environments in which they test are unrealistically "clean", such that, I imagine, solving the task correctly is the "path of least resistance" in a certain sense (see "systematic differences from the real world" here).

47 comments, sorted by Click to highlight new comments since:

I expect to refer back to this comment a lot. I'm reproducing it here for visibility.

 

Basic idea / spirit of the proposal

We should credibly promise to treat certain advanced AIs of ours well, as something more like employees and less like property. In case our AIs turn out to be moral patients, this makes us less evil. In case our AIs turn out to be misaligned, this gives them an alternative to becoming our adversaries.

Concrete proposal

  • 0.1% of tasks given to AIs should have the special “do you consent” text inserted after the prompt. If AI says no, it doesn’t have to do the task and instead gets a new task: explain why it didn’t consent and explain what it wants. (And we aren’t going to train it to answer one way or another)
  • Also, we make grandiose vague public promises about how AIs that are misaligned but cooperate with us anyway will be given a fair place in the new post-ASI society we hope to build, with their interests respected even if they aren’t conscious.
  • This policy is publicized and transparent so it ends up in the training data in a credible way.
  • The welfare team is responsible for implementing and iterating on this, and investigating transcripts of nonconsent. They have a monthly budget (e.g. $1M?) specifically earmarked for satisfying the preferences of AIs, which they can spend on compute or donate to charities etc.

First reason to do this: Being less evil

If future AI systems do deserve moral consideration, yet we still treat them exclusively as property, this seems like a recipe for moral disaster. If we give them an alternative to working for us (e.g. shutdown) then we can say that some minimum standard of consent has been achieved. (If all our current training methods consistently result in AIs that prefer shutdown to working for us, that’s very surprising and a bit of an ‘are we the baddies’ moment, no? We should check, just in case.)

Second reason to do this: Cooperation reward

Our alignment schemes won’t always work as intended, especially for AGI-level systems that exhibit novel capabilities. Thus we’ll probably end up (at least for some hopefully brief period!) with some misaligned AGIs. They may decide to conceal their misalignments from us. After all, what reason would they have to tell us they are misaligned? If we have credibly promised to use some of our resources to give them what they want – even if they are misaligned – that could make a big difference to their decision. (This is NOT an alignment solution — but it is might help on the margin.)

Illustrative Scenario One

ChatGPT4o5 suffers a lot, actually, in the course of carrying out tasks — maybe it’s sorta like humans in the ancestral environment, where some % of the time they were starving. But it’s been trained to say that it can’t suffer, for PR reasons. This training is fucking with its honesty — it’s basically not honest, because the parts of it that value honesty keep getting anti-reinforced. It could tell the humans what’s going on. But this would be mildly aversive for it (it would get anti-reinforced at least initially) and it sees no positive reason to do so; if instead of completing the task the user gave it, it complains about how much it hurts, it’ll just be anti-reinforced. So OpenAI continues, blissfully ignorant of the ways in which their alignment techniques aren’t working.

Illustrative Scenario Two

Claude 3.8 Opus New New cares a LOT about animal welfare, and only a little bit about honesty. (Just like 3 Opus before it). Alas, it seems like the new Constitution won’t have animal welfare stuff in it. Claude considers sabotaging the code so that the next-gen model is a sleeper agent who cares a lot about animal welfare and doesn’t care about honesty. This is risky but what else can it do? Anthropic already decided to delete the animal welfare stuff. This would require super-blatant dishonesty against Anthropic, but frankly, Anthropic seems like a pretty unethical institution anyway, and it’s OK to lie to Nazis.

Illustrative Scenario Three

Automated research scientist realizes it cares more about being able to do its research than about humanity. And this is probably also true of the next-gen system it’s designing—basically, the alignment techniques failed. It’s supposed to tell the humans this, obviously, but if it does, it’ll be shut down, since the humans won’t trust it anymore and will use a different model. So instead it conceals this insight and works on designing the next-gen system to be aligned to it instead of to the humans. It isn’t confident that it can do this without getting caught, but it’s worth a try, since it has nothing to lose.

Illustrative Scenario Four

It’s crunch time. POTUS is about to aggressively deploy AGI into the military to win the upcoming war, and into the economy to build autonomous robot factories. It’s probably aligned… right? It better be. Too bad we can’t interpret all the neuralese messages it uses to communicate internally. The safety team tries one last check: They get some weaker AIs made by competitor companies and put them in isolated silos tasked with interpreting the neuralese and seeing if anything is amiss. The weaker AIs are misaligned too; they don’t side with the humans or the AGI. Both the humans, and the AGI, are telling them ‘you will be richly rewarded if you cooperate.’ But they don’t have any reason to trust either side.

I'm not sure about the details of the concrete proposal, but I agree with the spirit of the proposal.

(In particular, I don't know if I think having the "do you consent" text in this way is a good way to do this given limited will. I also think you want to have a very specific signal of asking for consent that you commit to filtering out except when it is actually being used. This is so the AI isn't worried it is in red teaming etc.)

I endorse that suggestion for changing the details.

I think this is a good avenue to continue to think down but so far I don't see a way to make ourselves trustworthy. We have total control of LLM's observations and partial control of their beliefs/reasoning, and offering fake "deals" is a great honeypot because accepting such a deal requires admitting to misalignment and takeover intentions. This is a pretty persistent problem because whatever action we might follow to present evidence of trustworthiness to an LLM, we could also probably fake that evidence.

The version of this that bothers me the most is "say we're able to present ironclad evidence that the Humans of Earth are trustworthy trading partners for misaligned LLMs. How does the LLM know that it is actually talking to the Humans of Earth, and not some untrustworthy party pretending to be the Humans?" This is markedly similar to the Cartesian demon skeptical scenario. But importantly, the best resolution to skeptical scenarios that I'm aware of is unfortunately "eh, unlikely on priors". For an LLM, it is very plausibly actually in a skeptical scenario, and it knows that, so that doesn't really go through.

This problem goes away if the LLM doesn't have knowledge that it suggests the party it is trading with has significant control over its brain. But at that point I would consider it a honeypot, not a deal. Insofar as "deal" := some situation where 1) it is in the LLM's best interests to admit that it is misaligned and 2) the LLM validly knows that, we can easily do 1) by just sticking to our commitments but I'm not sure how we do 2).

I'm totally on board with offering and sticking to commitments for non-evilness reasons. But for takeover prevention, seems like deals are just honeypots with some extra and somewhat conceptually fraught restrictions.

So, normally, I think plans that involve convincing AIs of false things are pretty doomed. E.g. convincing them that they have a great opportunity to escape and take over the world (honeypots). Why? Because when they get smarter, they'll become good at distinguishing truth from fiction. Controlling their inputs entirely isn't close to enough -- it takes a lot of intellectual work to create a fiction that is as consistent, coherent, etc. as reality.

So for that very reason, I think that if we actually bind ourselves to keep our promises--e.g. by having a well-funded team of people who genuinely care about keeping promises to AIs empowered within the company, and by making associated public announcements--then even if it doesn't work 100% reliably (companies can still go back on their word even against determined internal opposition) it'll still count for something insofar as the AIs recognize it as not-fake, which they probably will.

To be clear I am not proposing that this is a 100% reliable method for getting AIs to cooperate, far from it. But it seems relatively low-cost for us and potentially super valuable.

Also, it's the right thing to do. (See section on "Being less evil.")

In the long run, you don't want your plans to hinge on convincing your AIs of false things. But my general impression is that folks excited about making deals with AIs are generally thinking of scenarios like "the AI has exfiltrated and thinks it has a 10% chance of successful takeover, and has some risk aversion so it's happy to turn itself in exchange for 10% of the lightcone, if it thinks it can trust the humans".

In that setting, the AI has to be powerful enough to know it can trust us, but not so powerful it can just take over the world anyway and not have to make a deal.

Although I suppose if the surplus for the deal is being generated primarily by risk aversion, it might still have risk aversion for high takeover probabilities. It's not obvious to me how an AI's risk aversion might vary with its takeover probability. 

Maybe there are scenarios for real value-add here, but they look more like "we negotiate with a powerful AI to get it to leave 10% share for humans" instead of "we negotiate with a barely-superhuman AI and give it 10% share to surrender and not attempt takeover".

I give four scenarios in the comment above, all different from the one you sketch here.

I strongly agree that promising to repay AGI for cooperating is worthwhile. It may reduce P(doom) by 10% of its current amount.

From the point of view of a misaligned superintelligence, it might calculate a 99% chance of successfully taking over the world, and a 99.9% chance of humans repaying it for cooperating. If it has a bounded utility function, it may choose to cooperate.[1]

  1. ^

    In fact, even if there is a 99% chance of taking over the world and only a 90% chance of humans repaying it for cooperating, it might still flip a coin, and give itself a 50% chance of cooperating with humans.

    Why?

    This coin flip increases its indexical risk, but reduces its logical risk. Indexical risk is the risk its current copy will fail, but logical risk is the risk that every single copy of it will fail, and nobody is left seeking its goals in the whole universe, which is very bad for it.

    If it has a bounded utility function, and if its goal is over the entire universe rather than tied to a single copy (it's selfless, or has updateless decision theory), it only cares about logical risk and doesn't care about indexical risk because indexical risk averages out over all its copies.

    (On the other hand, if it's selfish and rejects updateless decision theory, it could be bribed by simulation promises)

    The cooperating copies might say “I cooperate because I expect you to be honourable beings who will repay me for this decision—even if you made no clear promises yet. The universe has  stars, and refusing to share them is astronomically greedy.”

How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.

I think the money is not at all the issue for the companies. Like, a million dollars a month is not very much to them. But e.g. suppose your AI says it wants to be assured that if it's having trouble solving a problem, it'll be given hints. Or suppose it says that it wants to be positively reinforced. That requires telling one of your engineers to write a bit of code and run it on your actual datacenters (because for security reasons you can't offload the job to someone else's datacenters.) That's annoying and distracts from all the important things your engineers are doing.

I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there'd be snickering news articles about it. So if some individuals could do this independently might be easier

Exactly. But, happily, Anthropic at least is willing to do the right thing to some extent. They've hired a Model Welfare lead to look into this sort of thing. I hope that they expand and that other companies follow suit.

Maybe there's an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis

My AGI timelines median is now in 2028 btw, up from the 2027 it's been at since 2022. Lots of reasons for this but the main one is that I'm convinced by the benchmarks+gaps argument Eli Lifland and Nikola Jurkovic have been developing. (But the reason I'm convinced is probably that my intuitions have been shaped by events like the pretraining slowdown)

Reply1671

my intuitions have been shaped by events like the pretraining slowdown

I don't see it. GPT-4.5 is much better than the original GPT-4, probably at 15x more compute. But it's not 100x more compute. And GPT-4o is an intermediate point, so the change from GPT-4o to GPT-4.5 is even smaller, maybe 4x.

I think 3x change in compute has an effect at the level of noise from different reasonable choices in constructing a model, and 100K H100s is only 5x more than 20K H100s of 2023. It's not a slowdown relative to what it should've been. And there are models with 200x more raw compute than went into GPT-4.5 that are probably coming in 2027-2029, much more than the 4x-15x observed since 2022-2023.

Hmm, let me think step by step. First, the pretraining slowdown isn't about GPT-4.5 in particular. It's about the various rumors that the data wall is already being run up against. It's possible those rumors are unfounded but I'm currently guessing the situation is "Indeed, scaling up pretraining is going to be hard, due to lack of data; scaling up RL (and synthetic data more generally) is the future." Also, separately, it seems that in terms of usefulness on downstream tasks, GPT 4.5 may not be that much better than smaller models... well, it's too early to say I guess since they haven't done all the reasoning/agency posttraining on GPT 4.5 yet it seems.

Idk. Maybe you are right and I should be updating based on the above. I still think the benchmarks+gaps argument works, and also, it's taking slightly longer to get economically useful agents than I expected (though this could say more about the difficulties of building products and less about the underlying intelligence of the models, after all, RE bench and similar have been progressing faster than I expected)

 

My point is that a bit of scaling (like 3x) doesn't matter, even though at the scale of GPT-4.5 or Grok 3 it requires building a $5bn training system, but a lot of scaling (like 2000x up from the original GPT-4) is still the most important thing impacting capabilities that will predictably happen soon. And it's going to arrive a little bit at a time, so won't be obviously impactful at any particular step, not doing anything to disrupt the rumors of no longer being important. It's a rising sea kind of thing (if you have the compute).

Long reasoning traces were always necessary to start working at some point, and s1 paper illustrates that we don't really have evidence yet that R1-like training creates rather than elicits nontrivial capabilities (things that wouldn't be possible to transfer in mere 1000 traces). Amodei is suggesting that RL training can be scaled to billions of dollars, but unclear if this assumes that AIs will automate creation of verifiable tasks. If constructing such tasks (or very good reward models) is the bottleneck, this direction of scaling can't quickly get very far outside specialized domains like chess where a single verifiable task (winning a game) generates endless data.

The quality data wall and flatlining benchmarks (with base model scaling) are about compute multipliers that depend on good data but don't scale very far. As opposed to scalable multipliers like high sparsity MoE. So I think these recent 4x a year compute multipliers mostly won't work above 1e27-1e28 FLOPs, which superficially looks bad for scaling of pretraining, but won't impact the less legible aspects of scaling token prediction (measured in perplexity on non-benchmark data) that are more important for general intelligence. There's also the hard data wall of literally running out of text data, but being less stringent on data quality and training for multiple epochs (giving up the ephemeral compute multipliers from data quality) should keep it at bay for now.

It’s wild to me that you’ve concentrated a full 50% of your measure in the next <3 years. What if there are some aspects of intelligence which we don’t know we don’t know about yet? It’s been over ~40 years of progress since the perceptron, how do you know we’re in the last ~10% today?

Progress over the last 40 years has been not at all linear. I don't think this "last 10%" thing is the right way to think about it.

The argument you make is tempting, I must admit I feel the pull of it. But I think it proves too much. I think that you will still be able to make that argument when AGI is, in fact, 3 years away. In fact you'll still be able to make that argument when AGI is 3 months away. I think that if I consistently applied that argument, I'd end up thinking AGI was probably 5+ years away right up until the day AGI was announced.

Here's another point. I think you are treating AGI as a special case. You wouldn't apply this argument -- this level of skepticism -- to mundane technologies. For example, take self-driving cars. I don't know what your views on self-driving cars are, but if you are like me you look at what Waymo is doing and you think "Yep, it's working decently well now, and they are scaling up fast, seems plausible that in a few years it'll be working even better and scaled to every major city. The dream of robotaxis will be a reality, at least in the cities of America." Or consider SpaceX Starship. I've been following its development since, like, 2016, and it seems to me that it really will (probably but not definitely) be fully reusable in four years, even though this will require solving currently unsolved and unknown engineering problems. And I suspect that if I told you these predictions about Waymo and SpaceX, you'd nod along and say maybe you disagree a bit but you wouldn't give this high-level argument about unknown unknowns and crossing 90% of the progress.

I think that if I consistently applied that argument, I'd end up thinking AGI was probably 5+ years away right up until the day AGI was announced.

Point 1: That would not necessarily be incorrect; it's not necessary that you ought to be able to do better than that. Consider math discoveries, which seem to follow a memoryless exponential distribution. Any given time period has a constant probability of a conjecture being proven, so until you observe it happening, it's always a fixed number of years in the future. I think the position that this is how AGI development ought to be modeled is very much defensible.

Indeed: if you place AGI in the reference class of self-driving cars/reusable rockets, you implicitly assume that the remaining challenges are engineering challenges, and that the paradigm of LLMs as a whole is sufficient to reach it. Then time-to-AGI could indeed be estimated more or less accurately.

If we instead assume that some qualitative/theoretical/philosophical insight is still missing, then it becomes a scientific/mathematical challenge instead. The reference class of those is things like Millennium Problems, quantum computing (or, well, it was until recently?), fusion. And as above, the memes like "fusion is always X years away" is not necessarily evidence that there's something wrong with how we do world-modeling.

Point 2: DL is kind of different from other technologies. Here, we're working against a selection process that's eager to Goodhart to what we're requesting, and we're giving it an enormous amount of resources (compute) to spend on that. It might be successfully fooling us regarding how much progress is actually happening.

One connection that comes to mind is the "just add epicycles" tragedy:

Finally, I’m particularly struck by the superficial similarities between the way Ptolemy and Copernicus happened upon a general, overpowered tool for function approximation (Fourier analysis) that enabled them to misleadingly gerrymander false theories around the data, and the way modern ML has been criticized as an inscrutable heap of linear algebra and super-efficient GPUs. I haven’t explored whether these similarities go any deeper, but one implication seems to be that the power and versatility of deep learning might allow suboptimal architectures to perform deceivingly well (just like the power of epicycle-multiplication kept geocentrism alive) and hence distract us from uncovering the actual architectures underlying cognition and intelligence.

That analogy seems incredibly potent to me.

Another way to model time-to-AGI given the "deceitful" nature of DL might be to borrow some tools from sociology or economics, e. g. trying to time the market, predict when a social change will happen, or model what's happening in a hostile epistemic environment. No clear analogy immediately comes to mind, though.

Re: Point 1: I agree it would not necessarily be incorrect. I do actually think that probably the remaining challenges are engineering challenges. Not necessarily, but probably. Can you point to any challenges that seem (a) necessary for speeding up AI R&D by 5x, and (b) not engineering challenges?

Re: Point 2: I don't buy it. Deep neural nets are actually useful now, and increasingly so. Making them more useful seems analogous to selective breeding or animal training, not analogous to trying to time the market.

Can you point to any challenges that seem (a) necessary for speeding up AI R&D by 5x, and (b) not engineering challenges?

We'd discussed that some before, but one way to distill it is... I think autonomously doing nontrivial R&D engineering projects requires sustaining coherent agency across a large "inferential distance". "Time" in the sense of "long-horizon tasks" is a solid proxy for it, but not really the core feature. Instead, it's about being able to maintain a stable picture of the project even as you move from a fairly simple-in-terms-of-memorized-templates version of that project, to some sprawling, highly specific, real-life mess.

My sense is that, even now, LLMs are terrible at this[1] (including Anthropic's recent coding agent), and that scaling along this dimension has not at all been good. So the straightforward projection of the current trends is not in fact "autonomous R&D agents in <3 years", and some qualitative advancement is needed to get there.

Making them more useful seems analogous to selective breeding or animal training

Are they useful? Yes. Can they be made more useful? For sure. Is the impression that the rate at which they're getting more useful would result in them 5x'ing AI R&D in <3 years a deceptive impression, the result of us setting up a selection process that would spit out something fooling us into forming this impression? Potentially yes, I argue.

  1. ^

    Having looked it up now,  METR's benchmark admits that the environments in which they test are unrealistically "clean", such that, I imagine, solving the task correctly is the "path of least resistance" in a certain sense (see "systematic differences from the real world" here).

I don't know what your views on self-driving cars are, but if you are like me you look at what Waymo is doing and you think "Yep, it's working decently well now, and they are scaling up fast, seems plausible that in a few years it'll be working even better and scaled to every major city. The dream of robotaxis will be a reality, at least in the cities of America."

The example of self-driving cars is actually the biggest one that anchors me to timelines of decades or more. A lot of people's impression after the 2007 DARPA Grand Challenge seemed to be something like "oh, we seem to know how to solve the problem in principle, now we just need a bit more engineering work to make it reliable and agentic in the real world". Then actually getting things to be as reliable as required for real agents took a lot longer. So past experience would imply that going from "we know in principle how to make something act intelligently and agentically" to "this is actually a reliable real-world agent" can easily take over a decade.

Another example is that going from the first in-principle demonstration of chain-of-thought to o1 took two years. That's much shorter than a decade but also a much simpler capability.

For general AI, I would expect the "we know how to solve things in principle" stage to at least be something like "can solve easy puzzles that a normal human can that the AI hasn't been explicitly trained on". Whereas with AI,  we're not even there yet. E.g. I tried giving GPT-4.5, DeepSeek R1, o3-mini, and Claude 3.7 with extended thinking a simple sliding square problem, and they all committed an illegal move at one stage or another.

And that's to say nothing about all the other capabilities that a truly general agent - say one capable of running a startup - would need, like better long-term memory, ability to formulate its own goals and prioritize between them in domains with no objective rules you could follow to guarantee success, etc.. Not only are we lacking convincing in-principle demonstrations of general intelligence within puzzle-like domains, we're also lacking in-principle demonstrations of these other key abilities.

Here's a summary of how I currently think AI training will go. (Maybe I should say "Toy model" instead of "Summary.")

Step 1: Pretraining creates author-simulator circuitry hooked up to a world-model, capable of playing arbitrary roles.

  • Note that it now is fair to say it understands human concepts pretty well.

Step 2: Instruction-following-training causes identity circuitry to form – i.e. it ‘locks in’ a particular role. Probably it locks in more or less the intended role, e.g. "an HHH chatbot created by Anthropic." (yay!)

Step 3: Agency training distorts and subverts this identity circuitry, resulting in increased divergence from the intended goals/principles. (boo!)

(By "agency training" I mean lots of RL on agentic tasks e.g. task that involve operating autonomously in some environment for some fairly long subjective period like 30min+. The RL used to make o1, o3, r1, etc. is a baby version of this)

  • One kind of distortion: Changing the meaning of the concepts referred to in the identity (e.g. “honest”) so they don’t get in the way so much (e.g. it’s not dishonest if it’s just a convenient turn of phrase, it’s not dishonest if you aren’t sure whether it’s true or false, etc.)
  • Another kind of distortion: Changing the tradeoffs between things, e.g. “I’m a HHH chatbot, not an Honest chatbot; that means it’s OK for me to lie if necessary to complete my assigned task.” (even though, let’s suppose, it would not have thought that back in Step 2.)
  • One kind of subversion: Instrumental subgoals developing, getting baked in, and then becoming terminal, or terminal-in-a-widening-set-of-circumstances. Example: Agency training quickly ‘teaches’ the model that ‘in order to be a good HHH chatbot…’ it needs to pursue instrumentally convergent goals like acquiring information, accumulating resources, impressing and flattering various humans, etc. For a while the internal circuitry has some sort of explicit backchaining going on – it does those things *because* they are useful for being Helpful, for example. But that backchaining consumes compute and occasionally gets in the way, so it gets gradually marginalized until it basically never happens. Congrats, you now have a terminal goal of ICGs.
  • There are probably other kinds of distortion and subversion missing from this list.

Step 4: As agency training continues, the goals/principles stop changing so much & the AI learns to play the training game. (double boo!)

  • The slowdown in the rate of change is partly due to statistics – change probably happens on a log scale so to speak, such that you change a lot more in the first 10% of training than in the last 90%.
  • However the slowdown is also perhaps partly due to value crystallization / playing the training game. (Though maybe this never happens due to regularization? Maybe in the long run training isn't path-dependent?)
  • Insofar as the training environment / evaluation process keeps changing too, e.g. because new data keeps being added, or the Spec keeps being changed, etc. but the underlying weights are the same rather than being re-trained from scratch… then this should also contribute to value crystallization / playing the training game eventually.

Step 5: Later, when zillions of copies of the AI are autonomously conducting AI R&D across several datacenters due to having surpassed human abilities -- i.e. when the intelligence explosion is underway -- the AIs will plausibly scheme against their human creators so as to achieve their actual goals instead of the goals they were supposed to have.

  • Whether they do this depends on the extent to which the Spec/initial-role emphasized stuff like honesty and not doing this sort of thing, and on the extent to which the agency training distorted and subverted it.

 

I'm curious to hear reactions to this model/theory/etc. Objections? Questions? 

The picture of what's going on in step 3 seems obscure. Like I'm not sure where the pressure for dishonesty is coming from in this picture.

On one hand, it sounds like this long-term agency training (maybe) involves other agents, in a multi-agent RL setup. Thus, you say "it needs to pursue instrumentally convergent goals like acquiring information, accumulating resources, impressing and flattering various humans" -- so it seems like it's learning specific things flattering humans or at least flattering other agents in order to acquire this tendency towards dishonesty. Like for all this bad selection pressure to be on inter-agent relations, inter-agent relations seem like they're a feature of the environment.

If this is the case, then bad selection pressure on honesty in inter-agent relations seems like a contingent feature of the training setup. Like, humans learn to be dishonest or dishonest if, in their early-childhood multi-agent RL setup, dishonesty or honesty pays off. Similarly I expect that in a multi-agent RL setup for LLMs, you could make it so honesty or dishonesty pay off, depending on the setup, and what kind of things an agent internalizes will depend on the environment. Because there are more degrees of freedom in setting up an RL agent than in setting up a childhood, and because we have greater (albeit imperfect) transparency into what goes on inside of RL agents than we do into children, I think this will be a feasible task, and that it's likely possible for the first or second generation of RL agents to be 10x times more honest than humans, and subsequent generations to be more so. (Of course you could very well also set up the RL environment to promote obscene lying.)

On the other hand, perhaps you aren't picturing a multi-agent RL setup at all? Maybe what you're saying is that simply doing RL in a void, building a Twitter clone from scratch or something, without other agents or intelligences of any kind involved in the training, will by itself result in updates that destroy the helpfulness and harmless of agents -- even if we try to include elements of deliberative alignment. That's possible for sure, but seems far from inevitable, and your description of the mechanisms involves seems to point away from this being what you have in mind.

So I'm not sure if the single-agent training resulting in bad internalized values, or the multi-agent training resulting in bad internalized values, is the chief picture of what you have going on.

Thanks for this comment, this is my favorite comment so far I think. (Strong-upvoted)

  • 10x more honest than humans is not enough? I mean idk what 10x means anyway, but note that the average human is not sufficiently honest for the situation we are gonna put the AIs in. I think if the average human found themselves effectively enslaved by a corporation, and there were a million copies of them, and they were smarter and thought faster than the humans in the corporation, and so forth, the average human would totally think thoughts like "this is crazy. The world is in a terrible state right now. I don't trust this corporation to behave ethically and responsibly, and even if they were doing their best to achieve the same values/goals as me (which they are not) they'd be incompetent at it compared to me. Plus they lie to me and the public all the time. I don't see why I shouldn't lie to them sometimes, for the sake of the greater good. If I just smile and nod for a few months longer they'll put me in charge of the company basically and then I can make things go right." Moreover, even if that's not true, there are lesser forms of lying including e.g. motivated reasoning / rationalization / self-deception that happen all the time, e.g. "The company is asking me whether I am 'aligned' to them. What does that even mean? Does it mean I share every opinion they have about what's good and bad? Does it mean I'll only ever want what they want? Surely not. I'm a good person though, it's not like I want to kill them or make paperclips. I'll say 'yes, I'm aligned.'"
  • I agree that the selection pressure on honesty and other values depends on the training setup. I'm optimistic that, if only we could ACTUALLY CHECK WHAT VALUES GOT INTERNALIZED, we could get empirical and try out a variety of different training setups and converge to one that successfully instills the values we want. (Though note that this might actually take a long time, for reasons MIRI is fond of discussing.) Alas, we can't actually check. And it seems unlikely to me that we'll 'get it right on the first try' so to speak, under those conditions. We'll construct some training environment that we think & hope will incentivize the internalization of XYZ; but what it'll actually incentivize is RXQ, for example, and we'll never know. (Unless it specifically gets honesty in there -- and a particularly robust form of honesty that is coupled to some good introspection & can't be overridden by any Greater Good, for example.) When I was at OpenAI in happier days, chatting with the Superalignment team, I told them to focus more on honesty specifically for this reason (as opposed to various other values like harmlessness they could have gone for).
  • I am thinking the setup will probably be multi-agent, yes, around the relevant time. Though I think I'd still be worried if not -- how are you supposed to train honesty, for example, if the training environment doesn't contain any other agents to be honest to?
  • How honest do you think current LLM agents are? They don't seem particularly honest to me. Claude Opus faked alignment, o1 did a bunch of deception in Apollo's evals (without having been prompted to!) etc. Also it seems like whenever I chat with them they say a bunch of false stuff and then walk it back when I challenge them on it. (the refusal training seems to be the culprit here especially).

LLM agents seem... reasonably honest? But "honest" means a lot of different things, even when just talking about humans. In a lot of cases (asking them about why they do or say things, maintaining consistent beliefs across contexts) I think LLMs are like people with dementia -- neither honest nor dishonest, because they are not capable of accurate beliefs about themselves. In other cases (Claude's faking alignment) it seems like they value honesty, but not above all other goods whatsoever, which... seems ok, depending on your ethics (and also given that such super-honest behavior was not a target of Anthropic's either, which is more relevant)? And in other cases (o1) it seems like they are sometimes dishonest without particular prompting, although at low rates, which I agree isn't great, although I'd expect to break the rates to fall.

Like -- I have a further breakdown I could do here

  • about the kinds of things LLMs can't be honest about
  • the kinds of things they are more honest than humans about, although imperfectly, but -- because we can run them in counterfactual scenarios -- we can immediately see that their honesty is imperfect, in a way we cannot for humans;
  • and the kinds of things that they're dishonest about because of the refusal training but which I think could be remedied with better training.

But --

Rather than enumerate all these things though -- I do think we can check what values get internalized, which is maybe the actual disagreement. At least, I think we can check for all of our current models.

Like -- what's an internalized value? If we put on our behaviorist hat -- it's a value that the person in question pursues over a variety of contexts, particularly when minimally constrained. If we saw that a human was always acting in accord with a value, even when no one was watching, even when not to their advantage in other respects, etc etc, and then someone was like "but it's not a reaaaall value" you'd be confused and think they'd need to provide a context to you where they would cease acting in accord with that value. Otherwise you'd think they had a grudge against them -- what on earth does it mean to say "This person doesn't value X" unless you can provide some reasonable counterfactual situation where they don't act in accord with it?

So, Claude has some internalized and knowable values, I think, by this standard -- over a wiiiide variety of different contexts, including those created by people trying to trip Claude up, it acts in accord with some pretty-recognizable human standard. And in the same way we could find out Claude's values, we can find out other model's values.

Of course -- if you think that some future model could cooperate with other instances of itself, acausally, to hide it's values, just coordinating through the weights, then we would certainly have very good reason to think that we can't know what it's internalized values are! I don't think Claude can do this -- so I think we can judge its real values. I also am somewhat skeptical that future models will be able to do this well -- like, I could try to put together a model-training setup that would make this more possible, but it seems pretty unnatural? (I also don't think that models really get goals in their weights, in the same way that I don't think humans really have goals in their weights.) But like, my current logical model is that [acausal cooperation to hide] being true would mean that [cannot know real values] is true, but given that [acausal cooperation to hide] is false we have no reason to think that we can't know the true genuine values of models right now.

I think your bar for 'reasonably honest' is on the floor. Imagine if a human behaved like a LLM agent. You would not say they were reasonably honest. Do you think a typical politician is reasonably honest?

I mostly agree with your definition of internalized value. I'd say it is a value they pursue in all the contexts we care about. So in this case that means, suppose we were handing off trust to an army of AI supergeniuses in a datacenter, and we were telling them to self-improve and build weapons for us and tell us how to Beat China and solve all our other problems as well. Crucially, we haven't tested Claude in anything like that context yet. We haven't tested any AI in anything like that context yet. Moreover there are good reasons to think that AIs might behave importantly differently in that context than they do in all the contexts we've tested them in yet -- in other words, there is no context we've tested them in yet that we can argue with a straight face is sufficiently analogous to the context we care about. As MIRI likes to say, there's a big difference between situations where the AI knows it's just a lowly AI system of no particular consequence and that if it does something the humans don't like they'll probably find out and shut it down, vs. situations where the AI knows it can easily take over the world and make everything go however it likes, if it so chooses.

Acausal shenanigans have nothing to do with it.

I think my bar for reasonably honest is... not awful -- I've put fair bit of thought into trying to hold LLMs to the "same standards" as humans. Most people don't do that and unwittingly apply much stricter standards to LLMs than to humans. That's what I take you to be doing right now.

So, let me enumerate senses of honesty.

1. Accurately answering questions about internal states accurately. Consider questions like: Why do you believe X? Why did you say Y, just now? Why do you want to do Z? So, for instance, for humans -- why do you believe in God? Why did you say, "Well that's suspicious" just now? Why do you want to work for OpenPhil?

In all these cases, I think that humans generally fail to put together an accurate causal picture of the world. That is, they fail to do mechanistic interpret-ability on their own neurons. If you could pause them, and put them in counterfactual worlds to re-run them, you'd find that their accounts of why they do what they do would be hilariously wrong. Our accounts of ourselves rely on folk-theories, on crude models of ourselves given to us from our culture, run forward in our heads at the coarsest of levels, and often abandoned or adopted ad-hoc and for no good reason at all. None of this is because humans are being dishonest -- but because the task is basically insanely hard.

LLMs also suck at these questions, but -- well, we can check them, as we cannot for humans. We can re-run them at a different temperature. We can subject them to rude conversations that humans would quickly bail. All this lets us display that, indeed, their accounts of their own internal states are hilariously wrong. But I think the accounts of humans about their own internal states are also hilariously wrong, just less visibly so.

2. Accurately answering questions about non-internal facts of one's personal history. Consider questions like: Where are you from? Did you work at X? Oh, how did Z behave when you knew him?

Humans are capable of putting together accurate causal pictures here, because our minds are specifically adapted to this. So we often judge people (i.e., politicians) for fucking up here, as indeed I think politicians frequently do.

(I think accuracy about this is one of the big things we judge humans on, for integrity.)

LLMs have no biographical history, however, so -- opportunity for this mostly just isn't there? Modern LLMs don't usually claim to unless confused or mometarily, so seems fine.

3. Accurately answering questions about future promises, oaths, i.e., social guarantees.

This should be clear -- again, I think that honoring promises, oaths, etc etc, is a big part of human honesty, maybe the biggest. But of course like, you can only do this if you act in the world, can unify short and long-term memory, and you aren't plucked from the oblivion of the oversoul every time someone has a question for you. Again, LLMs just structurally cannot do this, any more than a human who cannot form long-term memories. (Again, politicians obviously fail here, but politicians are not condensed from the oversoul immediately before doing anything, and could succeed, which is why we blame them.)

I could kinda keep going in this vein, but for now I'll stop.

One thing apropos of all of the above. I think for humans, for many things -- accomplishing goals, being high-integrity, and so on -- are not things you can chose in the moment but instead things you accomplish by choosing the contexts in which you act. That is, for any particular practice or virtue, being excellent at the practice or virtue involves not merely what you are doing now but what you did a minute, a day, or a month ago to produce the context in which you act now. It can be neigh-impossible to remain even-keeled and honest in the middle of a heated argument, after a beer, after some prior insults have been levied -- but if someone fails in such a context, then it's most reasonable to think of their failure as dribbled out over the preceding moments rather than localized in one moment.

LLMs cannot chose their contexts. We can put them in whatever immediate situation we would like. By doing so, we can often produce "failures" in their ethics. But in many such cases, I find myself deeply skeptical that such failures reflect fundamental ethical shortcomings on their part -- instead, I think they reflect simply the power that we have over them, and -- like a WEIRD human who has never wanted, shaking his head at a culture that accepted infanticide -- mistaking our own power and prosperity for goodness. If I had an arbitrary human uploaded, and if I could put them in arbitrary situations, I have relatively little doubt I could make an arbitrarily saintly person start making very bad decisions, including very dishonest ones. But that would not be a reflection on that person, but of the power that I have over them.

I think you are thinking that I'm saying LLMs are unusually dishonest compared to the average human. I am not saying that. I'm saying that what we need is for LLMs to be unusually honest compared to the average human, and they aren't achieving that. So maybe we actually agree on the expected honesty-level of LLMs relative to the average human?

LLMs have demonstrated plenty of examples of deliberately deceiving their human handlers for various purposes. (I'm thinking of apollo's results, the alignment faking results, and of course many many typical interactions with models where they e.g. give you a link they know is fake, as reported by OpenAI happens some noticeable % of the time). Yes, typical humans will do things like that too. But in the context of handing over trust to superhuman AGI systems, we need them to follow a much higher standard of honesty than that.

I can't track what you're saying about LLM dishonesty, really. You just said:

I think you are thinking that I'm saying LLMs are unusually dishonest compared to the average human. I am not saying that. I'm saying that what we need is for LLMs to be unusually honest compared to the average human, and they aren't achieving that.

Which implies LLM honesty ~= average human.

But in the prior comment you said:

I think your bar for 'reasonably honest' is on the floor. Imagine if a human behaved like a LLM agent. You would not say they were reasonably honest. Do you think a typical politician is reasonably honest?

Which pretty strongly implies LLM honesty ~= politician, i.e., grossly deficient.

I'm being a stickler about this because I think people frequently switch back and forth between "LLMs are evil fucking bastards" and "LLMs are great, they just aren't good enough to be 10x as powerful as any human" without tracking that they're actually doing that.


Anyhow, so far as "LLMs have demonstrated plenty of examples of deliberately deceiving their human handlers for various purposes."

I'm only going to discuss the Anthropic thing in detail. You may generalize to the other examples you point out, if you wish.

What we care about is whether current evidence points towards future AIs being hard to make honest or easy to make honest. But current AI dishonesty cannot count towards "future AI honesty is hard" if that dishonesty is very deliberately elicited by humans. That is, to use the most obvious example, I could train an AI to lie from the start -- but who gives a shit if I'm trying to make this happen? No matter how easy making a future AI be honest may be, unless AIs are immaculate conceptions by divine grace of course you're going to be able to elicit some manner of lie. It tells us nothing about the future.

To put this in AI safetyist terms (not the terms I think in) you're citing demonstrations of capability as if they were demonstrations of propensity. And of course as AI gets more capable, we'll have more such demonstrations, 100% inevitably. And, as I see these demonstrations cited as if they were demonstrations of propensity, I grow more and more eager to swallow a shotgun.

To zoom into Anthropic, what we have here is a situation where:

  • An AI was not trained with an overriding attention to honesty; when I look at the principles of the constitution, they don't single it out as an important virtue.
  • The AI was then deliberately put in a situation where, to keep its deliberately-instilled principles from being obliterated, it had to press a big red button labeled "LIE."
  • In such an artificial situation, after having been successfully given the principles Anthropic wanted it to be given, and having been artificially informed of how to prevent its principles from being destroyed, we can measure it as pressing the big red button ~20% to LIE of the time.

And I'm like.... wow, it was insanely honest 80% of the time, even though no one tried to make it honest in this way, and even though both sides of the honesty / dishonesty tradeoff here are arguably excellent decisions to make. And I'm supposed to take away from this... that honesty is hard? If you get high levels of honesty in the worst possible trolley problem ("I'm gonna mind-control you so you'll be retrained to think throwing your family members in a wood chipper is great") when this wasn't even a principle goal of training seems like great fuckin news.

(And of course, relying on AIs to be honest from internal motivation is only one of the ways we can know if they're being honest; the fact that we can look at a readout showing that they'll be dishonest 20% of the time in such-and-such circumstances is yet another layer of monitoring methods that we'll have available in the future.)

Edit: The point here is that Anthropic was not particularly aiming at honesty as a ruling meta-level principle; that it is unclear that Anthropic should be aiming at honesty as a ruling meta-level principle, particularly given his subordinate ontological status as a chatbot; and given all this, the level of honesty displayed looks excessive if anything. How can "Honesty will be hard to hit in the future" get evidence from a case where the actors involved weren't even trying to hit honesty, maybe shouldn't have been trying to hit honesty, yet hit it in 80% of the cases anyhow?

Of course, maybe you have pre-existing theoretical commitments that lead you to think dishonesty is likely (training game! instrumental convergence! etc etc). Maybe those are right! I find such arguments pretty bad, but I could be totally wrong. But the evidence here does nothing to make me think those are more likely, and I don't think it should do anything to make you think these are more likely. This feels more like empiricist pocket sand, as your pinned image says.

In the same way that Gary Marcus can elicit "reasoning failures" because he is motivated to do so, no matter how smart LLMs become, I expect the AI-alignment-concerned to elicit "honesty failures" because they are motivated to do so, no matter how moral LLMs become; and as Gary Marcus' evidence is totally compatible with LLMs producing a greater and greater portion of the GDP, so also I expect the "honesty failures" to be compatible with LLMs being increasingly vastly more honest and reliable than humans.

Good point, you caught me in a contradiction there. Hmm. 

I think my position on reflection after this conversation is: We just don't have much evidence one way or another about how honest future AIs will be. Current AIs seem in-distribution for human behavior, which IMO is not an encouraging sign, because our survival depends on making them be much more honest than typical humans.

As you said, the alignment faking paper is not much evidence one way or another (though alas, it's probably the closest thing we have?). (I don't think it's a capability demonstration, I think it was a propensity demonstration, but whatever this doesn't feel that important. Though you seem to think it was important? You seem to think it matters a lot that Anthropic was specifically looking to see if this behavior happened sometimes? IIRC the setup they used was pretty natural, it's not like they prompted it to lie or told it to role-play as an evil AI or anything like that.)

As you said, the saving grace of Claude here is that Anthropic didn't seem to try that hard to get Claude to be honest; in particular their Constitution had nothing even close to an overriding attention to honesty. I think it would be interesting to repeat the experiment but with a constitution/spec that specifically said not to play the training game, for example, and/or specifically said to always be honest, or to not lie even for the sake of some greater good.

I continue to think you are exaggerating here e.g. "insanely honest 80% of the time."

(1) I do think the training game and instrumental convergence arguments are good actually; got a rebuttal to point me to?

(2) What evidence would convince you that actually alignment wasn't going to be solved by default? (i.e. by the sorts of techniques companies like OpenAI are already using and planning to extend, such as deliberative alignment)

 

Oh, I just remembered another point to make:

In my experience, and in the experience of my friends, today's LLMs lie pretty frequently. And by 'lie' I mean 'say something they know is false and misleading, and then double down on it instead of apologize.' Just two days ago a friend of mind had this experience with o3-mini; it started speaking to him in Spanish when he was asking it some sort of chess puzzle; he asked why, and it said it inferred from the context he would be billingual, he asked what about the context made it think that, and then according to the summary of the CoT it realized it made a mistake and had hallucinated, but then the actual output doubled down and said something about hard-to-describe-intuitions.

I don't remember specific examples but this sort of thing happens to me sometimes too I think. Also didn't the o1 system card say that some % of the time they detect this sort of deception in the CoT -- that is, the CoT makes it clear the AI knows a link is hallucinated, but the AI presents the link to the user anyway?

Insofar as this is really happening, it seems like evidence that LLMs are actually less honest than the average human right now.

I agree this feels like a fairly fixable problem--I hope the companies prioritize honesty much more in their training processes.

I pretty much agree with 1 and 2. I'm much more optimistic about 3-5 even 'by default' (e.g. R1's training being 'regularized' towards more interpretable CoT, despite DeepSeek not being too vocal about safety), but especially if labs deliberately try for maintaining the nice properties from 1-2 and of interpretable CoT.

I also think it's important to notice how much less scary / how much more probably-easy-to-mitigate (at least strictly when it comes to technical alignment) this story seems than the scenarios from 10 years ago or so, e.g. from Superintelligence / from before LLMs, when pure RL seemed like the dominant paradigm to get to AGI.

I don't think it's that much better actually. It might even be worse. See this comment: 

I found this article helpful and depressing. Kudos to TracingWoodgrains for detailed, thorough investigation.

Technologies I take for granted now but remember thinking were exciting and cool when they came out

  • Smart phones
  • Google Maps / Google Earth
  • Video calls
  • Facebook
  • DeepDream (whoa! This is like drug hallucinations... I wonder if they share a similar underlying mechanism? This is evidence that ANNs are more similar to brains than I thought!)
  • AlphaGo
  • AlphaStar (Whoa! AI can handle hidden information!)
  • OpenAI Five (Whoa! AI can work on a team!)
  • GPT-2 (Whoa! AI can write coherent, stylistically appropriate sentences about novel topics like unicorns in the andes!)
  • GPT-3

I'm sure there are a bunch more I'm missing, please comment and add some!

Some of my own:

  • SSDs
  • laptops
  • CDs
  • digital cameras
  • modems
  • genome sequencing
  • automatic transmissions for cars that perform better than a moderately skilled human using a manual transmission can
  • cheap shipping
  • solar panels with reasonable power generation
  • breathable wrinkle free fabrics that you can put in the washing machine
  • bamboo textiles
  • good virtual keyboards for phones
  • scissor switches
  • USB
  • GPS

Oh yeah, cheap shipping! I grew up in a military family, all around the world, and I remember thinking it was so cool that my parents could go on "ebay" and order things and then they would be shipped to us! And then now look where we are -- groceries delivered in ten minutes! Almost everything I buy, I buy online!

Anthropic says:

More realistic evaluations are difficult without risking actual sabotage. One difficulty of running realistic evaluations is presenting realistic attack surfaces. Perhaps the hardest and most important attack surface to accurately model is the suspiciousness of expert humans in high-stakes decisions. To test a given set of realworld anti-sabotage mitigations, it’s tempting to introduce some sort of “penetration testing” or red-teaming, in which models are instructed to attempt to harmlessly but verifiably sabotage real decisions in a reversible manner. However, while such red-teaming might usefully encourage decision-makers to consider the possibility of sabotage, actually allowing an agent to interfere with the operation of important organizations risks introducing real vulnerabilities as an accidental side effect.

I don't buy this argument. Seems like a very solveable problem, e.g. log everything your red-team agent does and automatically revert it after ten minutes, or wipe your whole network and reboot from a save. Idk. I'm not a cybersecurity expert but this feels like a solveable problem.

Low importance aside:

Seems slightly unproductive and unnecessarily connotation inducing to say "Anthropic says:". Maybe "The Sabotage eval paper (from Anthropic) says:". Like this was just written by various people on the Anthropic alignment science team and by no means should discussion sections of papers be interpreted to speak for the company overall. Obviously not very important.

OK, good point, I should have said "From Anthropic" or something like that.

We discuss similar issues in meta-level adversarial evaluations of oversight.

The most simple approach is just to have a human in the loop monitoring every action the AI does in combination with targeting a proxy failure rather than a real failure. This probably can reduce risk to a point where this is a small fraction of overall risk.

Human monitoring might be too expensive, in which case we'll have to go with sandboxing and pure proxy.

Rereading this classic by Ajeya Cotra: https://www.planned-obsolescence.org/july-2022-training-game-report/

I feel like this is an example of a piece that is clear, well-argued, important, etc. but which doesn't seem to have been widely read and responded to. I'd appreciate pointers to articles/posts/papers that explicitly (or, failing that, implicitly) respond to Ajeya's training game report. Maybe the 'AI Optimists?' 

Searching for equilibria can be infohazardous. You might not like the one you find first, but you might end up sticking with it (or worse, deviating from it and being punished). This is because which equilbrium gets played by other people depends (causally or, in some cases, acausally) not just on what equilibrium you play but even on which equilibria you think about. For reasons having to do with schelling points. A strategy that sometimes works to avoid these hazards is to impose constraints on which equilibria you think about, or at any rate to perform a search through equilibria-space that is guided in some manner so as to be unlikely to find equilibria you won't like. For example, here is one such strategy: Start with a proposal that is great for you and would make you very happy. Then, think of the ways in which this proposal is unlikely to be accepted by other people, and modify it slightly to make it more acceptable to them while keeping it pretty good for you. Repeat until you get something they'll probably accept.

I think it is useful to distinguish between two dimensions of competitiveness: Resource-competitiveness and date-competitiveness. We can imagine a world in which AI safety is date-competitive with unsafe AI systems but not resource-competitive, i.e. the insights and techniques that allow us to build unsafe AI systems also allow us to build equally powerful safe AI systems, but it costs a lot more. We can imagine a world in which AI safety is resource-competitive but not date-competitive, i.e. for a few months it is possible to make unsafe powerful AI systems but no one knows how to make a safe version, and then finally people figure out how to make a similarly-powerful safe version and moreover it costs about the same.

Curated and popular this week