LLM in practice these days do include increasingly bigger % of RL wich seems like it should at least make you less certain about capabilities mostly coming from pretraining and papers from before that continuing to be relevant for very long and you do mention it on the other post but still wrote that capabilities come mostly from pretraining on the footnote?.
I expect an optimist or someone from the comparatively-less-pessimistic group would argue that LLM or LLM +RL might lead to consequentialists that have human-like goals due to being built from a base of human imitation even as they move towards ASI.
And an important disagrement with those people there is that you don't in fact expect LLM +RL to work. You also don't think LLM+ RL would be safe if they did work but still feels relevant because It comes from pretty diferent models of how LLM will improve.
Relatedly a question you are not answering here is why you think imitation learning won't lead to discovering things that are far from the distribution of humans since its unclear what the limits are out of distribution and this also seems like you have a strong view on human imitation not generalicing much to the point It won't be even be able to do human level inventing of new things wich seems like would imply bad imitation of humans, as humans can do things like invent writting.
You seem to be describing a shallow kind of imitation wich seems like a nontrivial asumtion people in the LLM camp would likely disagree with?.
"rocket arbitrarily far past human training data" is mixing the question of "will LLM think in ways that are very diferent from humans" with something like " can LLM only shalowly reuse reasoning from their training data and not invent new things".
Perfect human imitators will just rocket arbitrarily far past human training data in the first sense and invent writting etc.
If you compare a human in 30000 BC to a human today, our brains are full of new information that wasn’t in the training data of 30000 BC. I want to talk about: what would it look to be in a world where you can put millions of LLMs in a sealed box containing a VR environment, for (the equivalent of) thousands of years, and then we open up the box and find that the LLMs have made an analogous kind of scientific and technological progress? (See §1 of “Sharp Left Turn” discourse: An opinionated review.)
Spoiler: I think this is fundamentally impossible with LLMs as we know them today. Anyway, let’s explore the options.
One option is: the LLMs have super-long context windows that store textbooks for all the new fields of science and technology that were invented after we closed the box. I don’t this would work because LLMs (at least as we know them today) struggle with large amounts of interrelated complexity in the context window far outside the distribution of anything in the weights. A more likely option is: there’s a mechanism to update the LLM’s weights. Anyway, there has to be some selection mechanism, that decides what new content is worth keeping or not. If you come across a proto-idea, do you update the weights with it or not? If you just update with “whatever seems right” or something, then I claim that errors will compound over time and the whole system goes off the rails.
Basically, my claim is that this selection mechanism (for new knowledge, plans, strategies, etc.), whatever it is, has to be grounded in consequentialism, to work perpetually inside this closed box.
And as this process proceeds, (subjective) century after (subjective) century, the influence of pretraining would get diluted away, until everything is ultimately coming from the consequentialism-grounded selection mechanism.
Anyway, we can argue about the details, but I don’t think LLM people are thinking about how to get to an end-state of this box, in which (again) you close the box, then open it much later, and find that huge amounts of open-ended intellectual progress has occurred while it was closed, analogous to what global human civilization has created over the centuries. I think that if people tried to work out what such a box might look like in detail, they would find that it either needs ruthless consequentialist agency going into it, or else creates ruthless consequentialist agency while it runs. Or perhaps they’d just agree with me that LLMs are not cut out for populating this box and never will be.
Sorry if I’m missing your point.
You seem to be describing a shallow kind of imitation
I don’t think so… I think I’m making a narrower claim: the manner in which humans (alone and collectively) do open-ended continual learning, especially over extended periods of time, does not have an analogue in LLMs. This is different from the question of whether LLMs are imitating humans “deeply” vs “shallowly” at inference time. I’m certainly not one of the people who call LLMs “stochastic parrots” etc. The thing they’re doing at inference time is IMO clearly capturing a deep (+ also wide) level of knowledge / understanding.
Oh okay then I think some of my objections are wrong but then your post seems like It fails to explain the narrower claim well?. You are describing a failure of LLM to imitate humans as if It was a problem with imitation learning. If you put LLM in a box and you get a diferent results than if you put humans in a box you are describing LLM that are bad at human imitation. Namely they lack open-ended continual learning. As oposed to saying the problem is that you think cannot do continual learning on LLM without some form of consequentialism.
In the case of very long context LLM you are even claiming LLM couldn't be able to imitate human behaviour in their context.
I like your box example better(we could also call It a country of geniuses on a closed datacenter) I feel like theres a lot of interesting debate to be had about what kind of improvements on LLM get us to them making lots of inventions in the box.
And this seems important to me, because the obious to me question here is "can you imitation learn whatever process humans use to invent things without being ruthless consequentialists?"
Or in another words can your whole research program if how to imitate the things that make social insticnts in the brain be bitter lesson-ed via imitation learning on long horizon tasks/data?.
Or not even long horizon maybe It just generalizes from short horizons + external memory. Its unclear to me whether if you put smart and competent adult humans without the capability to remember more than 1h on a box + they already know how to write they wouldn't be able to manage to invent arbitrary things with a lot of extra effort obsesively note taking and inventing better ways of using notes.
Humans doing this if It works would works because It IS grounded in the consequentialist behaviour of humans . But It woudln't be ruthless consequentialism becuse humans have social insticts.
It seems like you are impliying LLM have something like the human social insticts via imitation already at inference time but you can't use them in any way to boostrap to some continuous learning thing thats grounded in human-like consequentialism and that seems like thats the direction were interesting discusion lies ?.
Also to be clear my own position is more on the side of thinking you can probably get something that could populate the box from LLM+RL+maybe some memory related change but in practice you likely do It by acidentaly making them ruthless consequentialists unless you really knew what you were doing or get extremely lucky.
But I want to take the side of the AI optimists here because I feel like you haven't adressed smarter versions of their position very well?.
Even if the typical AI optimist hasn't though that far. Though duno I don't know what Antropic's comparatively less pesimistic people think(and I expect there's actually a wide range of views in there) but they have to be thinking about continual learning or how LLM will do long horizon tasks, and if still skeptical of ruthless consequentialists being a thing they'll have some reason why they expect whatever solution to not lead to that.
Right, I think LLMs are good at imitating “Joe today”, and good at imitating “Joe + 1 month of learning Introductory Category Theory”, but not good at imitating the delta between those, i.e. the way that Joe grows and changes over that 1 month of learning—or at least, not “good at imitating” in a way that would generalize to imitating a person learning a completely different topic that’s not in the training data.
In other words, after watching 1000 people try to learn category theory over the course of a month (while keeping diaries), I claim that an LLM would learn category theory itself, and it would learn all the common misconceptions about category theory that people make as they start learning, but it wouldn’t learn “the general process of learning and sense-making itself” in a way that allows it to then autonomously invent some field that has not been invented yet.
I had a long comment-thread argument with Cole Wyeth on this general topic last year: link. We didn’t resolve our disagreement and I eventually bowed out of the conversation, but you might find it helpful anyway. See especially my analogy to trying to imitation-learn AlphaZero improving itself through self-play.
Or not even long horizon maybe It just generalizes from short horizons + external memory. Its unclear to me whether if you put smart and competent adult humans without the capability to remember more than 1h on a box + they already know how to write they wouldn't be able to manage to invent arbitrary things with a lot of extra effort [obsessively] note taking and inventing better ways of using notes.
My answer is “obviously not”. Here’s an example:
Imagine that the “competent adult humans” were all from 100 years before linear algebra existed, and we are hoping that they will invent linear algebra. Now, linear algebra involves a giant pile of interlinked concepts: matrices, bases, rank, nullity, spans, determinants, trace, eigenvectors, dual space, unitarity, and on and on. Now take a parade of these “competent human adults” with no prior exposure to any of this, and give them an hour each before they get fired, but they are allowed to send notes to each other. The goal is for them to collectively invent the entire edifice of linear algebra from scratch. I think it’s doomed. If you take a person who has never seen linear algebra before, then it will just take them a lot of time (much more than an hour) to internalize all these concepts and get sufficiently familiar with them to start building on them. It doesn’t matter how good the notes are, it just takes time to develop strong and deep intuitions about a new concept. It doesn’t matter how many people there are, because zero of those people will be able to push forward the frontier in the one hour before they get fired, because it takes longer than that to internalize a new conceptual space.
(I don’t think it’s too relevant for this thread, but fun fact: there were some experiments by Ought in like 2018-2020 vaguely related to this, see e.g. this post on “relay programming”.)
I'd be curious for you to say a bit more in response to this point from above:
It seems like you are impliying LLM have something like the human social insticts via imitation already at inference time but you can't use them in any way to boostrap to some continuous learning thing thats grounded in human-like consequentialism and that seems like thats the direction were interesting discusion lies ?.
I'm moderately optimistic about our ability to get roughly human-like consequentialism from LLM-based AGI, with character training instilling a non-ruthless non-sociopath character that is still compatible with lots of consequentialist agency, like very good scientists/inventors/entrepreneurs/etc. who never do anything that could be very dangerous (because a bunch of useful things aren't that risky, and because they have good enough moral principles to deliberately avoid or be careful around things that are risky).
I think long-horizon RL or reflection or other components of the continual learning process could break the instilled character, but it seems >50% likely to me that between the preliminary character training and ongoing training and prompting to maintain good character, those things won't dominate, and we'll just have nice AIs. (I get a bit more nervous about this argument for ASI, but I think it may well hold up even there.)
The thing I’m skeptical of is maintaining non-ruthless behavior in the presence of arbitrary amounts of open-ended continual learning. By “open-ended continual learning”, I mean something analogous to what humans did between 30000 BC and today, e.g. inventing new fields, and then still more new fields that build on those new fields, etc. And the AI has to do that without any human input, given enough time.
My actual belief is that this kind of open-ended continual learning is simply impossible in LLMs. If I’m wrong about that, then I would next claim that it requires continually updating the LLM weights (not just context window). I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.
OK, so far I’ve argued that if this kind of continual learning is possible at all, it would require continual weight updates to lock in the new knowledge and ideas that the LLM generates—and not just one-time small updates, but more and more updates as the process continues, asymptoting to 100% of the training data.
If you buy all that, how do you think these weight updates will work? Where do you think the “training data” for those updates will come from?
Or if you don’t buy that, how do you think the continual learning will work?
My experience is that lots of LLM-focused people say “open-ended continual learning will be solved somehow, I guess”, and not think too hard about exactly how it gets solved. And then that’s how the pea gets hidden under the thimble. Because actually, I claim, continual learning needs some kind of ground truth or else it will go off the rails, and that ground truth basically amounts to an objective function, and when the LLM continual-learns enough from that ground truth, all the niceness of pretraining gets diluted away in favor of the ruthless maximization of that objective function.
Again, maybe you have some specific idea about how LLM open-ended continual learning would work that you think won’t have this problem? If so, what is it?
(Slightly rambly comment, sorry)
I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.
[Edit: I don't think this is saying anything that different than my comment above, but it is a slightly different framing.]
Another point that I think might be quite important: we often set ourselves complex subgoals in line with our existing values, and then we try hard to achieve those goals, and we learn how to be more effective consequentialist agents at achieving that type of subgoal. There may be clearer feedback on how well we did at the subgoal than how well we achieved our existing values, but in lots of cases we notice if there's a significant divergence between what we achieved and our underlying values, which moderates the consequentialist learning and is a pressure towards maintaining alignment.
Maybe a good example is that humans update on a ton of random observations we’re surprised by. This doesn’t seem like imitation, nor does it seem consequentialist enough to be very risky?
Right, I think humans have a distinction between beliefs and desires (“is versus ought”) that’s pretty disanalogous to how LLMs work (see discussion here), and our beliefs / “is”s get updated by predictive learning from sensory inputs. My dichotomy of consequentialism vs imitative learning in the OP was about the “ought” part, which predictive learning doesn’t help with. I.e., when you’re choosing your own actions in a novel domain, predictive learning doesn’t constrain your options.
(And I think “actions” are important even for disembodied situations like “figuring things out by thinking about them”, see §1.1 here.)
I think “asymptotically 100% consequentialist” is quite possibly wrong about the objectives used for open-ended CL training.
As a side-note, this whole conversation is pretty tricky because we’re talking about this vague hypothetical system (that allows one or more LLMs to autonomously invent and develop a rich new field of science via some form of continual learning), and I don’t think such a system is even possible, and you seem to think it might be possible but you haven’t spelled out all the details of how it would work. E.g. one of the problems is: there’s no training data for continual learning, because the new field of knowledge doesn’t exist yet. Relatedly, what’s the “objective”?
Anyway, we can keep trying, but this might be a tricky conversation to make progress with.
Back to the object level:
“Interspersing character training” is an interesting idea (thanks), but after thinking about it a bit, here’s why I think it won’t work in this context. BTW I’m interpreting “character training” per the four-bullet-point “pipeline” here, lmk if you meant something different.
Character training (as defined in that link) seems to rely on the idea that the tokens “I will be helpful, and honest, and harmless, blah blah…” is more likely to be followed by tokens that are in fact helpful, and honest, and harmless, blah blah, than tokens that are not prefixed by that constitution. That’s a good assumption for LLMs of today, but why? I claim: it’s because LLMs are generalizing from the human-created text of the pretraining data.
As a thought experiment: If, everywhere on the internet and in every book etc., whenever a human said “I’m gonna be honest”, they then immediately lied, then character-training with a constitution that said “I will be honest” would lead to lying rather than honesty. Right? Indeed, it would be equivalent to flipping the definition of the word “honest” in the English language. So again, this illustrates how the constitution-based character training is relying on the model basically staying close to the statistical properties of the pretraining data.
…But that means: the more that the weights drift away from their pretraining state, the less reason we have to expect this type of character training to work well, or at all.
You might respond: “OK, we’ll instead do RLAIF with a fixed “judge”, i.e. one that does not have its weights continually updated.” That indeed avoids the problem above, but introduces different problems instead. If the optimization is powerful, then we’re optimizing against a fixed judge, and we should expect the system to jailbreak the judge or similar. Alternatively, if the optimization is weak (i.e. only slightly changing the model, as in the traditional KL-divergence penalty of RLHF), then I think it will eventually stop working as the model gradually drifts so far away from niceness that slight tweaks can’t pull it back. Or something like that.
So an explanation has to exist! What is it? I claim there are really only two answers [consequentialism and imitative learning] that work in practice.
How do we know that these are the only two answers?
Well, they're the only two methods we've got that work, and people have been thinking about this for decades. So if there's a third method, it's beyond current understanding.
Notice that evolution by natural selection is consequentialist learning on a terribly slow timescale. The consequence is successful reproduction vs death, and you learn at most one bit per lifetime.
Evolution also distinguishes between one and two progeny so it is not binary, but yeah, just a few bits per lifetime.
This is a bit of a steelman for the optimist, or maybe a strawman for the pessimist.
Pessimist: Laser tech is getting more and more powerful. Soon someone will make an infinitely powerful laser and incinerate the whole planet.
Optimist: My laser pointer isn't infinitely powerful, and this metal cutting laser isn't infinitely powerful either.
Pessimist: That's just because they are imperfect. Look at my simplistic maths. The light should be exponentially amplified without limit.
A perfect consequential AI is a friction-less sphere in a vacuum, or an infinitely powerful laser. It's the sort of thing that is very simple to describe mathematically, and rather hard to make in practice.
Was deep blue a perfect "ruthless consequentialist"? It wouldn't sacrifice a pawn to save humanity. But it also wouldn't try to blackmail it's opponent or bribe a judge.
It's just not thinking about anything off the chessboard.
I think this point generalizes somewhat. I don't think it's that hard to make an AI that's superhuman at stock trading, and that just doesn't think about anything outside the stock market.
Ruthless consequentialism works well in the limit. But evolution didn't build that. Because something more like human morality must have worked better, under conditions of bounded data, compute etc.
Of course human morality in particular depends somewhat on culture, and also on the specific details of the evolutionary environment, and probably the limits of neurons.
However, recursive self improvement by an AI that is powerful enough to do useful AI research and that sees ruthless consequentialism as a goal could lead to an AI that's very close to ruthless consequentialism.
And I am not convinced, that if we do get a less consequentialist AI, it will be non-consequentialist in a way beneficial to us.
Thanks!
In this post I was NOT talking about consequentialism as a model of an AI agent, but rather consequentialism as the power-source ultimately underlying a family of AI techniques, a family that includes most forms of RL and model-based planning.
So “perfect” consequentialism is a red herring (I guess in agreement with your final sentence).
A model-based planning or RL agent can “make mistakes” (so to speak) while also being a ruthless sociopath. “Ruthless sociopath” here means having callous indifference to the welfare of other people, and to respecting norms, and virtues, and so on, except insofar as those things affect one’s selfish goals. My argument is that these techniques naturally give rise to “ruthless sociopaths”, not that they give rise to perfect consequentialism, which they don’t, as you point out (e.g. that’s impossible with bounded compute).
So anyway, I certainly agree that it’s possible to make AIs that are incompetent in general (e.g. my “dirt clod” example), or that have narrow competencies (e.g. AlphaGoZero, which can’t reason about blackmail). I don’t see a path from that observation to avoiding x-risk, because sooner or later people are going to create AIs that can e.g. autonomously found, grow, and staff innovative companies for years, or autonomously invent new scientific paradigms, etc. I think that those AIs will kill us (unless we solve currently-unsolved problems etc.), and I don’t think that AIs that are safe thanks to their narrow competencies will help us solve that problem.
Ruthless consequentialism works well in the limit. But evolution didn't build that. Because something more like human morality must have worked better, under conditions of bounded data, compute etc.
I guess you’re suggesting that learning algorithms will converge to non-ruthlessness because that’s more effective given bounded compute, and that evolution-of-humans is evidence for this.
I interpret this example differently: I claim that evolution built a brain that fundamentally works by consequentialism-powered learning algorithms (heavily involving RL & model-based planning), but found a non-ruthless variant of that learning algorithm, in particular thanks to a specific, weird kind of RL reward function. Obviously I hope humans will likewise find ways to tap the power of RL & model-based planning, in a sufficiently big and general way to move the needle on x-risk, while still avoiding ruthlessness, and I am working on that problem myself, but I claim that inventing such reward functions is still an unsolved problem.
We can’t rely on learning algorithms to solve this problem “organically”, i.e. just through the magic of ML itself, because the reward function is part of the ML algorithm, not a thing that the ML algorithm is itself discovering. More discussion here & here.
If we don’t invent such weird non-ruthless reward functions, I strongly believe that we would still be capable of creating ASI-level capabilities (alas). We just wouldn’t be able to make such ASI aligned. As evidence for the former claim, note that (as discussed here) human sociopaths exist and can be quite intelligent and competent, and ditto people with a wide variety of social drives—autistics, introverts, extroverts, SM, etc.
Where was the argument for consequentialism (including intense optimisation) and imitation being the only two ways to do impressive feats?
Also, will you change your mind if the current paradigm is still non-psychopathic even when RL training dominates?
(Big picture, I think the main place I might get off the train is in expecting future AI development to use a mix of rewards, including some from other equally capable AIs judging that behaviors aren't deceptive/unintended/misaligned. And this mirroring the role that "other humans think we're nice" played in evolution)
Where was the argument for consequentialism (including intense optimisation) and imitation being the only two ways to do impressive feats?
Well, you can also do impressive feats via conventional programming / GOFAI too, but I don’t think you get ASI that way. What else? I dunno, but I think if there was another big-picture approach that plausibly gets to ASI, lots of people would be working on it, and I would have heard of it. Lmk if you think I’m forgetting something.
Also, will you change your mind if the current paradigm is still non-psychopathic even when RL training dominates?
Normally if someone says “RL training dominates”, they mean “the amount of compute applied to RLVR is much greater than the amount of compute applied to pretraining”. That’s very different from “RLVR is so important that the impacts of pretraining are diluted away to irrelevancy”. (E.g. discussions of information efficiency by Toby Ord and by Dwarkesh.) But the latter is what would be relevant here.
Hmm, here are three example scenarios:
“Feeling confused” is weaker than “changing my mind”, because maybe I would puzzle over it and find some way to make sense of it. But also maybe not. Probably I could make a stronger statement / prediction if I spent a bunch of time thinking about it, but hopefully this gives some sense of what I have in mind.
other equally capable AIs judging that behaviors aren't deceptive/unintended/misaligned
I’m pessimistic mainly for reasons discussed in “‘Behaviorist’ RL reward functions lead to scheming” §3.1.
And this mirroring the role that "other humans think we're nice" played in evolution
You’re kinda pointing to a challenge to my view. My view is: a hypothetical smart consequentialist human with a ruthless drive to have lots of grandkids will have more grandkids than a human with the normal suite of innate drives, like falling in love and so on. Proof: strategy-stealing. Whatever the latter human does, if it’s an objectively good way to have lots of grandkids, then the former human can notice that it’s a good strategy and do the same thing.
And then the challenge to that view is: …But we did actually evolve all these innate drives that make us intrinsically desire love and curiosity etc. Doesn’t that prove my strategy-stealing argument wrong?
I think there’s a good answer to that challenge, and it’s some combination of: (1) evolution has no way to build “a ruthless drive to have lots of grandkids” into our brains (details), and (2) even if it did, humans are not sufficiently smart and strategic in regards to long-term planning to be very effective ruthless consequentialists. (“We are the least intelligent species capable of building an industrial civilization.”)
Anyway, I claim that neither (1) nor (2) would be applicable in your AI training scenario (as I understand it), so the strategy-stealing argument would go through, and thus the RL selection pressure should at best be neutral between ruthless scheming strategies vs intrinsically honest ones, and much more likely favor the former.
(I am however assuming no interpretability / mind-reading.)
Thanks!
Re other ways to do the feat:
Re RL: thanks for those examples. I'd have defined it via smg like# data points or amount of training signal -agree compute is the wrong measure.
> other equally capable AIs judging that behaviors aren't deceptive/unintended/misaligned
I’m pessimistic mainly for reasons discussed in “‘Behaviorist’ RL reward functions lead to scheming” §3
Hmm but the cognition "try to break the rules without getting caught" might in practice lead to worse reward than just trying to follow the rules, if the chances to secretly hack are sufficiently rare and the punishment for getting caught is high enough.
Eg I think that if I tried to "pretend I cared about my friends/partner but exploit them when I can get away with it" would MASSIVELY fail me in the long run. Ppl are good at reading ppl. We're transparent to each other. There is big cognitive overhead to maintaining two narratives. I'd end up with fewer committed/deep long-term friendships.
Relatedly, if you're in a cult, the best way to signal deep commitment is to become a true believer.
And it will, by comparison, be way easier for the overseers in the case of AI. Interp. Seeing all behaviour. Running counterfactual experiments.
think there’s a good answer to that challenge, and it’s some combination of: (1) evolution has no way to build “a ruthless drive to have lots of grandkids” into our brains (details), and (2) even if it did, humans are not sufficiently smart and strategic in regards to long-term planning to be very effective ruthless consequentialists. (“We are the least intelligent species capable of building an industrial civilization.”)
Ok interesting, we have pretty different intuitions here!
Zooming out, I recall you thinking that humans aren't sociopaths bc they have some special specific reward thing that we can't replicate, related to wanting others to approve. Whereas I see no reason to think it's some specific thing. We just had some selection pressure to seem like good non-sociopathic allies. That selection pressure worked. If we apply similar selection to ai, it will probably also work - the implementation details won't need to match some specific human learning circuit
Thanks for engaging!
Re other ways to do the feat
(partly copying from another comment:) If you compare a human in 30000 BC to a human today, our brains are full of new information that wasn’t in the training data of 30000 BC. I want to talk about: what would it look to be in a world where you can put millions of LLMs in a sealed box containing a VR environment, for (the equivalent of) thousands of years, and then we open up the box and find that the LLMs have made an analogous kind of scientific and technological progress?
I don’t think any of those three options can get there (and nor can imitative learning). (I’m disputing the capabilities not alignment here.)
The first one (bootstrapping) has the issue that if the serial thinking is not 100% perfect, then it will sometimes get mistakes, and then you’re SFT’ing on the mistakes, making the model more confident in those mistakes, and then the next round of serial thinking will incorporate and build on those mistakes. Repeat a billion times in a sealed box, and I think it would spiral into nonsense—it would get dumber not smarter.
…I assume that people are already trying to do this, so I guess we’ll find out one way or the other how far it gets. ¯\_(ツ)_/¯ If I’m wrong and it does get to ASI (e.g. the “sealed box” standard above), perhaps that would be good news compared to what I’m expecting … although I suppose it might spiral into misalignment too, not sure.
The second one (scaffolds) has an issue that (IIUC) you’re piling up entire new fields of knowledge into the context window, without that knowledge being present in the weights. LLMs would be very bad at that. We can see them struggle with novel complexity in the context window, even in everyday situations. And this would be much worse. For example, imagine training an LLM before linear algebra existed, and then trying to have it understand linear algebra (matrices, bases, rank, nullity, spans, determinants, trace, eigenvectors, dual space, unitarity, etc.) purely by putting all that stuff in the context window. And then ask the LLM tricky questions that rely on those concepts. I really think it wouldn’t work, and that it will keep not working into the future.
The third one (“searching over…”) I don’t understand. Is there a typo? It sounds like “try to solve the technical alignment problem”, which of course I endorse. I don’t think the problem is fundamentally unsolvable; almost no one thinks that.
Eg I think that if I tried to "pretend I cared about my friends/partner but exploit them when I can get away with it" would MASSIVELY fail me in the long run. … There is big cognitive overhead to maintaining two narratives. I'd end up with fewer committed/deep long-term friendships.
I claim that “maintaining two narratives” is super easy. We do it all the time when we talk about the fictional world of a TV show, and then in the next breath we talk about the actors and script. I think “maintaining two narratives” is hard in social settings because most of us are not sociopaths! I.e., yes, lying can be draining, but I claim it’s emotionally draining rather than cognitively draining.
Anyway, we’re ultimately talking about ASI here, which can develop whole new fields of knowledge etc. Surely it will be able to ask itself “what would the humans be looking for in this scenario?”, and then do that, whenever humans might be watching. I hear that even today’s LLMs do that (“eval awareness”).
“I'd end up with fewer committed/deep long-term friendships” is kinda circular. Relationships are not really “deep” if e.g. you’re indifferent to the other person and just sucking up to them. But that’s only a problem if you wanted a “deep” relationship in the first place.
Ppl are good at reading ppl. We're transparent to each other. … And it will, by comparison, be way easier for the overseers in the case of AI. Interp. Seeing all behaviour. Running counterfactual experiments.
Yes, interpretability is potentially an important caveat, but I don’t think it adds up to much reason for optimism. According to my worldview: if we found a way to use interpretability to test for scheming, then we could use it, and we would definitely find scheming, because duh, that’s the natural consequence of how we will train ASI. And now what? If we delete the model and re-run the training from scratch, we’ll just get the same result. Or, if we use this interpretability signal for fine-tuning, we have the usual problem that we’re training the AI to hide its thoughts.
I’m skeptical of “counterfactual experiments” because a smart AI will be able to tell what’s the real world, see Distinguishing test from training (Nate Soares 2022), and (again) “eval awareness” in LLMs.
…We just had some selection pressure to seem like good non-sociopathic allies. That selection pressure worked.…
This section (or at least this excerpt) seems to be analogizing evolution to LLM training. Whereas I think a better framework is to say that evolution designed a within-lifetime learning algorithm in the brain, and here we’re having a conversation about how that learning algorithm works. I claim that this learning algorithm is a yet-to-be-invented variant of model-based actor-critic RL, and that it has a weird reward function that (in a certain environment) leads to caring about our friends, and to pride, and to trying to fit in, etc., among many other things.
There was obviously selection pressure for that reward function in the case of humans, and we can keep arguing about why. But would there be one for AIs? I claim that this question is moot, because normal practice in RL does not involve choosing a reward function via an outer-loop blind search analogous to evolution. The reward function is almost always part of the learning algorithm, not a thing that is itself selected by learning. (More discussion here.)
Separately, I can talk about why I think kindness, norm-following, etc. are human innate social drives, as opposed to strategies developed by a more generic within-lifetime learning algorithm, if you’re skeptical of that claim. (Are you?)
Some unspelled implications of this post, taking it as true for a moment:
Since humans are consequentialist-type intelligences, we should expect them to be ruthless, and we should prevent them from gaining too much power, lest they destroy everything we hold dear.
This is a very weird sentence to me. If we want to know about human behavior, we can just observe past and present humans. We shouldn’t take one fact about human brain architecture in isolation, and ignore everything else we know about human brains and human psychology and human history.
In particular, if we want to know whether “absolute power corrupts absolutely”, we should obviously start by looking at the historical record of humans with absolute power. (No opinion.)
Developing compute-intensive, imitation-learning-based AI should be considered closer to human-brain augmentation than ASI capability development
I’m not sure what this paragraph is getting at. My best guess is that you’re interested in the AI pause / stop vs AI acceleration debate, and suggesting that if LLMs are not a path to “ASI”, then that’s a reason not to pause LLM progress?
If so: (1) I generally stay out of that debate because I don’t expect it to make much difference regardless, (2) I don’t like taking sides in a generic way rather than talking about specific proposals with their own particular suites of intended and unintended consequences, (3) …but if I had to pick a side, it would be the “pause” side, because, while my opinion is in fact that LLMs are not a path to “ASI” (as I define it), OTOH (A) I don’t hold that opinion with 100% confidence, and (B) there are legitimate LLM x-risk worries even without “ASI”, and (C) there are legitimate LLM worries short of x-risk, and (D) like you said, there are various indirect ways that I’d expect the (small, indirect, marginal) effect of LLM-centric “pause” efforts to push ASI later rather than earlier, including via LLM-assisted coding & research, the relentless ramp-up of global compute, etc.
It seems like one implicit assumption is something like
In environments with lots of agents doing things, the most ruthless consequentialist agents will outperform the more prosocial and cooperative ones.
Without that assumption, we could end up in a situation where there are ruthless consequentialist AIs, most agents (both human and AI) recognize them as such and recognizes that interacting with them is a bad idea, and so these ruthless consequentialist AIs backstab and lie and cheat and do a bunch of damage but never actually accumulate enough power to seize control of the light cone from coalitions of agents that are capable of cooperating [1] .
I know you've written a lot of stuff - I don't recall seeing anything about why that assumption in particular, but maybe you've already written on the topic?
In the long term this probably still looks bad for humans, because worlds with many AI agents probably look like a second cambrian explosion and the cambrian explosion was not good for those who came before ↩︎
This is an interesting topic, but no, my central expectation (and what I’m arguing for here) is that 100% of the ASIs will be ruthless consequentialists.
Couple little points on that side-track though: (1) Ruthless consequentialist AIs can still copy themselves, and cooperate with their copies, if their goals are non-indexical (which they might or might not be, no opinion off the top of my head), (2) Your comment seems to assume that AIs can read each other’s minds? If they can’t, a smart ruthless consequentialist AI would act in a cooperative and prosocial way in an environment where doing so was to its advantage. I agree that mind-reading is an important dynamic that might change the equilibrium in a multipolar AI world.
Thanks.
Likely out of scope for this comment section though. I should, at some point, probably write my modal expectation of what the next couple decades look like in more detail.
Hmm but humans are not ruthless consequentialists, despite being consequentialist enough to be able to do all kinds of tasks and build civilization. So I don't see how the Optimist's argument is addressed.
That’s this part:
Of course, evolution did go out of its way to make humans non-ruthless, by endowing us with social instincts. Maybe future AI programmers will likewise go out of their way to make ASIs non-ruthless? I hope so—but we need to figure out how.
A workable solution (to building stable non-ruthlessness within a powerful consequentialist framework like RL + model-based planning) probably exists, and I’m obviously working on it myself, and I think I’m making gradual progress, but I think the appropriate overall attitude right now is pessimism and panic about where we’re at. See “oh man, are we dropping this ball” section here & the three-part disjunction here.
(Why only “probably exists”? Because the human example is highly suggestive but not an airtight proof. For example, for all I know right now, maybe making a nice human requires a “training environment” that entails growing up with a human body, in a human community, at human speed. Doing that with AI is not really feasible in practice, for many reasons. There are other things like that too. Presumably further research will eventually either find a plan for non-ruthlessness + powerful capabilities in ASI, or a good argument that no plan exists, and I don’t currently have a very strong opinion on which one it would be.)
I’m interested in why you think consequentialism in necessarily maximising. An AGI might have multiple mutually incompatible goals it it solving for, and choose some balance of those, not maximising on any. Given it will have the whole of human history as training data one of the lessons it will have absorbed is ruthless prioritisation of a single goal tends to provoke counter coalitions. The smart thing to do is manage within an ecosystem of other AI and humans. Not maximise against them (which is a fraught and unstable pattern).
Thanks!
I’m interested in why you think consequentialism in necessarily maximising. An AGI might have multiple mutually incompatible goals it it solving for, and choose some balance of those, not maximising on any.
For one thing, my headline claim is “ruthless sociopath”, not “maximizing”. “Ruthless sociopath” is pointing to something that’s missing (intrinsic concern for the welfare of other people), not something that’s present (behaviors that maximize something in the world).
For another thing, strictly speaking, perfect maximization is impossible without omniscience.
For another thing, if a powerful ASI cares about increasing staples, and also paperclips, and also any number of other office supplies, that doesn’t help us, it will still wipe out humanity and create a future devoid of value. Indeed, even maximizers can “care” about multiple things. E.g. if a utility-maximizer has utility function U = log(log(staples)) + log(log(paperclips)) then it will stably split its time between staple and paperclip production forever. [I put in the “log log” to ensure strongly diminishing returns, enough to overcome any economies of scale.]
Given it will have the whole of human history as training data one of the lessons it will have absorbed is ruthless prioritisation of a single goal tends to provoke counter coalitions. The smart thing to do is manage within an ecosystem of other AI and humans. Not maximise against them (which is a fraught and unstable pattern).
I agree that a ruthless sociopath agent, one which has callous indifference to whether you or anyone else lives or dies, will nevertheless act kind to you, when acting kind to you is in its self-interest. And then if the situation changes, such that acting kind to you stops being in its self-interest, then it will not hesitate to stab you in the back (betray you, murder you, blackmail you, whatever). And even before that, it will be constantly entertaining the idea of stabbing you in the back, and then deciding that this idea is (currently) inadvisable, and thus continuing to act kindly towards you.
Hopefully we can agree that this is not a description of normal human relations.
…But even if this is not normal human relations, one could argue that it’s fine, because we can still build a good healthy civilization out of AIs that all have this kind of disposition. And indeed, there are people who make that argument. But I strongly disagree. I was writing about this topic recently, see §5 of my post “6 reasons why ‘alignment-is-hard’ discourse seems alien to human intuitions, and vice-versa”: “The human intuition that societal norms and institutions are mostly stably self-enforcing”.
This approach ignores the fact that if we use advanced LLMs to make new paradigm advancements that are extremely effective RL sociopaths, we'll at that point have the help of the relatively harmless but still very powerful LLMs to do safety work on the RL agents — this is a major help with mitigating autonomy risks! Of course, there's always the risk that new RL architecture discoveries create economic incentives to scale the scary RL agents without sufficient safety work, but the prospect of using HHH AI to align scary AI is weirdly under-explored when talking about that exact advanced LLM + advanced RL learner world.
For one thing, my actual expectation is that LLMs will be a helpful research tool for the human discovering the next AI paradigm, rather than the LLMs discovering the next AI paradigm themselves (see Foom & Doom §1.4.4).
For another thing, even if I’m wrong about that, note that we have “very powerful” humans “to do safety work on RL agents” right now, but it turns out that those humans are overwhelmingly uninterested in doing so. So instead there’s maybe 1000× more money and effort going into figuring out how to make RL agents more powerful rather than how to make them safe. (See We need a field of Reward Function Design.) I don’t see any reason to expect this situation to change if it’s LLMs doing the research instead of humans.
That said, if people have ideas about how to make a near-future world full of LLMs a wiser world than the world of today, then great, I endorse that goal and wish them luck :)
The conversation begins
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and ask yourself where your life went horribly wrong.
Me: Hmm, I think the “true core nature of intelligence” is above my pay grade. We should probably just talk about the issue at hand, namely future AI algorithms and their properties.
…But I actually agree with you that ruthless sociopathy is a very specific and strange thing for me to expect.
Optimist: Wait, you—what??
Me: Yes! Like, if you show me some random thing, there’s a 99.999…% chance that it’s not a ruthless sociopath. Instead it might be, like, a dirt clod. Dirt clods are not ruthless sociopaths, because they’re not intelligent at all.
Optimist: Oh c’mon, you know what I mean. I’m not talking about dirt clods. I’m saying, if you pick some random mind, there is no reason at all to expect it to be a ruthless sociopath.
Me: How do you “pick some random mind”? Minds don’t just appear out of nowhere.
Optimist: Like, a human. Or an AI.
Me: Different humans are different to some extent, and different AI algorithms are different to a much, much greater extent. “AI” includes everything from A* search to MuZero to LLMs. Is A* search a ruthless sociopath? Well, I mean, it does seem rather maniacally obsessed with graph traversal! Right?
Optimist: Haha, very funny. Please stop being annoyingly pedantic. I obviously didn’t mean “AI” in the sense of the academic discipline. I meant, like, AI in the colloquial sense, AI that qualifies as a mind, like LLMs. I’m mainly talking about human minds and LLM “minds”, i.e. all the minds we’ve ever seen in the real world, rather than in sci-fi. And hey, what a coincidence, ≈100% of those minds are not ruthless sociopaths.
Me: As it happens, the threat model I’m working on is not LLMs, but rather “brain-like” Artificial General Intelligence (AGI), which (from a safety perspective) is more-or-less a type of actor-critic model-based reinforcement learning (RL) agent. LLMs are profoundly different from what I’m working on. Saying that LLMs will be similar to RL-agent AGI because “both are AI” is like saying that LLMs will be similar to the A* search algorithm because “both are AI”, or that a frogfish will be similar to a human because “both are animals”. They can still be wildly different in every way that matters.
Are people worried about LLMs causing doom?
Optimist: OK, but lots of other doomers talk about LLMs causing doom.
Me: Well, kinda. I think we need to tease apart two groups of people. Both are sometimes called “doomers”, but one is much more pessimistic than the other. This is very caricatured, but:
Anyway, the extra heap of concern in that latter camp is not from the LLMs of today causing near-certain doom, or even the somewhat-better LLMs of tomorrow, but rather the wildly better ASIs of … maybe soon, maybe not, who knows. But even if it’s close in calendar time, and even if it comes out of LLM research, such an ASI would still be systematically different from LLMs as we know them today—
Optimist: —a.k.a., you have no evidence—
Me: —no evidence either way, at least no evidence of that type. Anyway, as I was saying, ASI would be systematically different from today’s LLMs because … umm, where do I start …
…Actually, it would be much easier for me to explain if we start with the ASI threat model that I spend all my time on, and then we can circle back to LLMs afterwards. Is that OK?
Positive argument that “brain-like” RL-agent ASI would be a ruthless sociopath
Optimist: Sure. We can pause the discussion of LLMs for a few minutes, and start in your comfort zone of actor-critic model-based RL-agent “brain-like” ASI. Doesn’t really matter anyway: regardless of the exact algorithm, you clearly need some positive reason to believe that this kind of ASI would be a ruthless sociopath. You can’t just unilaterally declare that your weird unprecedented sci-fi belief is the “default”, and push the burden of proof onto people who disagree with you.
Me: OK. Maybe a good starting point would be my posts LeCun’s ‘A Path Towards Autonomous Machine Intelligence’ has an unsolved technical alignment problem, or ‘The Era of Experience’ has an unsolved technical alignment problem.
Optimist: I’ve read those, but I’m not seeing how they answer my question. Again, what’s your positive argument for ruthless sociopathy? Lay it on me.
Me: Sure. Back at the start of the conversation, I mentioned that random objects like dirt clods are not able to accomplish impressive feats. I didn’t (just) bring up dirt clods to troll you, rather I was laying the groundwork for a key point: If we’re thinking about AI that can autonomously found, grow, and staff innovative companies for years, or autonomously invent new scientific paradigms, then clearly it’s not a “random object”, but rather a thing that is able to accomplish impressive feats. And the question we should be asking is: how does it do that? Those things would be astronomically unlikely to happen if the AI were choosing actions at random. So there has to be some explanation for how the AI finds actions that accomplish those impressive feats.[2]
So an explanation has to exist! What is it? I claim there are really only two answers that work in practice.
The first possible explanation is consequentialism: the AI accomplishes impressive feats by (what amounts to) having desires about what winds up happening in the future, and running some search process to find actions that lead to those desires getting fulfilled. This is the main thing that you get from RL agents, and from model-based planning algorithms. (My “brain-like AGI” scenario would involve both of those at once.) The whole point of those subfields of AI is: these are algorithms designed to find actions that maximize an objective, by any means available.
I.e., you get ruthless sociopathic behavior by default.
And this is not just my armchair theorizing. Go find someone who was in AI in the 2010s or earlier, before LLMs took over, and they may well have spent a lot of time building or using RL agents and/or model-based planning algorithms. If so, they’ll tell you, based on their lived experience, that these kinds of algorithms are ruthless by default (when they work at all), unless the programmers go out of their way to make them non-ruthless. See e.g. this 2020 DeepMind blog post on “specification gaming”.
And how would the programmers “go out of their way to make them non-ruthless”? I claim that the answer is not obvious, indeed not even known. See my LeCun post, and my Silver & Sutton post, and more generally my post “‘Behaviorist’ RL reward functions lead to scheming” for why obvious approaches to non-ruthlessness won’t work.
Rather, algorithms in this class are naturally, umm, let’s call them, “ruthless-ifiers”, in the sense that they transmute even innocuous-sounding objectives like “it’s good if the human is happy” into scary-sounding ones like “ruthlessly maximize the probability that the human is happy”, which in turn suggest strategies such as forcibly drugging the human. Likewise, the innocuous-sounding “it’s bad to lie” gets ruthless-ified into “it’s bad to get caught lying”, and so on.
Of course, evolution did go out of its way to make humans non-ruthless, by endowing us with social instincts. Maybe future AI programmers will likewise go out of their way to make ASIs non-ruthless? I hope so—but we need to figure out how.
To be clear, ruthless consequentialism isn’t always bad. I’m happy for ruthless consequentialist AIs to be playing chess, designing chips, etc. In principle, I’d even be happy for a ruthless consequentialist AI to be emperor of the universe, creating an awesome future for all—but making that actually happen would be super dangerous for lots of reasons (e.g. you might need to operationalize “creating an awesome future for all” in a loophole-free way; see also “‘The usual agent debugging loop’, and its future catastrophic breakdown”).
…So that’s consequentialism, one possible answer for how an AI might accomplish impressive feats, and it’s an answer that brings in ruthlessness by default.
Circling back to LLMs: imitative learning vs ASI
…And then there’s a second, different possible answer to how an AI might accomplish impressive feats: imitative learning from humans. You train an AI to predict what actions a skilled human would take in many different contexts, and then have the AI take that same action itself. I claim that LLMs get their impressive capabilities almost entirely from imitative learning.[3] By contrast, “true” imitative learning is entirely absent (and impossible) in humans and animals.[4]
Imitative-learning AIs do not have ruthless sociopathy by default, because of course the thing they’re imitating is non-ruthless humans.[5]
Optimist: Huh … Wait … So you’re an optimist about superintelligence (ASI) being non-ruthless, as long as people stick to LLMs?
Me: Alas, no. I think that the full power of consequentialism is super dangerous by default, and I think that the full power of consequentialism is the only way to get ASI, and so AI researchers are going to keep working until they eventually learn to fully tap that power.
In other words, I see a disjunction:
For what it’s worth, I happen to expect that ASI will come from the former (future paradigm shift) rather than the latter (LLM modifications). But it hardly matters in this context.
Optimist: I dunno, if you’re willing to concede that LLMs today are not maximally ruthless, well, LLMs today don’t seem that far from superintelligence. I mean, humans don’t “rocket arbitrarily far past any training data” either. They usually do things that have been done before, or at most (for world experts on the bleeding edge) go just one little step beyond it. LLMs can do both, right?
Me: Yes, but humans collectively and over time can get way, way, way beyond our training data. We’re still using the same brain design that we were using in Pleistocene Africa. Between then and now, there were no angels who dropped training data from the heavens, but humans nevertheless invented language, science, technology, industry, culture, and everything else in the $100T global economy entirely from scratch. We did it all by ourselves, by our own bootstraps, ultimately via the power of consequentialism, as implemented in the RL and model-based planning algorithms in our brains.
(See “Sharp Left Turn” discourse: An opinionated review.)
By the same token, if humanity survives another 1000 years, we will invent wildly new scientific paradigms, build wildly new industries and ways of thinking, etc.
There’s a quadrillion-dollar market for AIs that can likewise do that kind of thing, as humans can. If the LLMs of today don’t pass that bar (and they don’t), then I expect that, sooner or later, either someone will figure out how to get LLMs to pass that bar, or else someone will invent a new non-LLM AI paradigm that passes that bar. Either way, imitative learning is out, consequentialism is in, and we get ruthless sociopath ASIs by default, in the absence of yet-to-be-invented theoretical advances in technical alignment. (And then everyone dies.)
Thanks Jeremy Gillen, Seth Herd, and Justis Mills for critical comments on earlier drafts.
Changelog: 2026-02-23: Added another reference to footnote 3.
We should definitely also be thinking about these other potential problems, don’t get me wrong!
Related: the so-called “Follow-the-Improbability Game”.
Details: “imitative learning” describes LLM pretraining, but not posttraining; my claim is that LLM capabilities come almost entirely from the former, not the latter. That’s not obvious, but I argue for it in “Foom & Doom” §2.3.3, and see also a couple papers downplaying the role of RLVR (Karan & Du 2025, Venhoff et al. 2025, Yue et al. 2025), along with “Most Algorithmic Progress is Data Progress” by Beren Millidge.
E.g. if my brain is predicting what someone else will say, that’s related to auditory inputs, and if my brain is speaking, that involves motor-control commands going to my larynx etc. There is no straightforward mechanical translation from one to the other, analogous to the straightforward mechanical translation from “predict the next token” to “output the next token” in LLM pretraining. More in “Foom & Doom” §2.3.2.
See GPTs are Predictors, not Imitators for an even-more-pessimistic-than-me counterargument, and “Foom & Doom” §2.3.3 for why I don’t buy that counterargument.