Here are two specific objections to this post[1]:
Perfectly aligned AI systems which were exactly as smart as humans and had the same capability profile, but which operated at 100x speed and were cheap would be extremely useful. In particular, we could vastly exceed all current alignment work in the span of a year.
In practice, the capability profile of AIs is unlikely to be exactly the same as humans. Further, even if the capability profile was the same, merely human level systems likely pose substantial misalignment concerns.
However, it does seem reasonably likely that AI systems will have a reasonably similar capability profile to humans and will also run faster and be cheaper. Thus, approaches like AI control could be very useful.
IMO, there isn't anything which strongly rules out LLM agents being overall quite powerful while still having weak forward passes. In particular, weak enough that they can't do non-trivial consequentialist reasoning in a forward pass (while still being able to do this reasoning in natural language). Assuming that we can also rule out steganography and similar concerns, then The Translucent Thoughts Hypotheses would fully apply. In the world where AIs basically can't do invisible non-trivial consequentialist reasoning, most misalignment threat models don't apply. (Scheming/deceptive alignmment and cleverly playing the training game both don't apply.)
I think it's unlikely that AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass, but this still seems around 15% likely overall. And if there are quite powerful LLM agents with relatively weak forward passes, then we should update in this direction.
I edited this comment from "I have two main objections to this post:" because that doesn't quite seem like a good description of what this comment is saying. See this other comment for more meta-level commentary. ↩︎
I think those objections are important to mention and discuss, but they don't undermine the conclusion significantly.
AIs which are qualitatively just as smart as humans could still be dangerous in the classic ways. The OP's argument still applies to them, insofar as they are agentic and capable of plotting on the inside etc.
As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we'd be in pretty damn good shape from an AI control perspective. I have been working on this myself & encourage others to do so also. I don't think it undermines the OP's points though? We are not currently on a path to have robust faithful CoT properties by default.
This post seemed overconfident in a number of places, so I was quickly pushing back in those places.
I also think the conclusion of "Nearly No Data" is pretty overstated. I think it should be possible to obtain significant data relevant to AGI alignment with current AIs (though various interpretations of current evidence can still be wrong and the best way to obtain data might look more like running careful model organism experiments than observing properties of chatgpt). But, it didn't seem like I would be able to quickly argue against this overall conclusion in a cohesive way, so I decided to just push back on small separable claims which are part of the reason why I think current systems provide some data.
If this post argued "the fact that current chat bots trained normally don't seem to exhibit catastrophic misalignment isn't much evidence about catastrophic misalignment in more powerful systems", then I wouldn't think this was overstated (though this also wouldn't be very original). But, it makes stronger claims which seem false to me.
Mm, I concede that this might not have been the most accurate title. I might've let the desire for hot-take clickbait titles get the better of me some. But I still mostly stand by it.
My core point is something like "the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI's cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for".
I agree that the current training setups shed some data on how e. g. optimization pressures / reinforcement schedules / SGD biases work, and I even think the shard theory totally applies to general intelligences like AGIs and humans. I just think that theory is AGI-incomplete.
On my inside model of how cognition works, I don't think "able to automate all research but can't do consequentialist reasoning" is a coherent property that a system could have. That is a strong claim, yes, but I am making it.
I agree that it is conceivable that LLMs embedded in CoT-style setups would be able to be transformative in some manner without "taking off". Indeed, I touch on that in the post some: that scaffolded and slightly tweaked LLMs may not be "mere LLMs" as far as capability and safety upper bounds go.
That said, inasmuch as CoT-style setups would be able to turn LLMs into agents/general intelligences, I mostly expect that to be prohibitively computationally intensive, such that we'll get to AGI by architectural advances before we have enough compute to make a CoT'd LLM take off.
But that's a hunch based on the obvious stuff like AutoGPT consistently falling plus my private musings regarding how an AGI based on scaffolded LLMs would work (which I won't share, for obvious reasons). I won't be totally flabbergasted if some particularly clever way of doing that worked.
On my inside model of how cognition works, I don't think "able to automate all research but can't do consequentialist reasoning" is a coherent property that a system could have.
I actually basically agree with this quote.
Note that I said "incapable of doing non-trivial consequentialist reasoning in a forward pass". The overall llm agent in the hypothetical is absolutely capable of powerful consequentialist reasoning, but it can only do this by reasoning in natural language. I'll try to clarify this in my comment.
It seems to me that you have very high confidence in being able to predict the "eventual" architecture / internal composition of AGI. I don't know where that apparent confidence is coming from.
The "canonical" views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems.
I would instead say:
The canonical views dreamed up systems which don't exist, which have never existed, and which might not ever exist.[1] Given those assumptions, some people have drawn strong conclusions about AGI risk.
We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized). And so rather than justifying "does current evidence apply to 'superintelligences'?", I'd like to see justification of "under what conditions does speculation about 'superintelligent consequentialism' merit research attention at all?" and "why do we think 'future architectures' will have property X, or whatever?!".
The views might have, for example, fundamentally misunderstood how cognition and motivation work (anyone remember worrying about 'but how do I get an AI to rescue my mom from a burning building, without specifying my whole set of values'?).
We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized).
I disagree that it is actually "first-principles". It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.
As I'd tried to outline in the post, I think "what are AIs that are known to exist, and what properties do they have?" is just the wrong question to focus on. The shared "AI" label is a red herring. The relevant question is "what are scarily powerful generally-intelligent systems that exist, and what properties do they have?", and the only relevant data point seems to be humans.
And as far as omnicide risk is concerned, the question shouldn't be "how can you prove these systems will have the threatening property X, like humans do?" but "how can you prove these systems won't have the threatening property X, like humans do?".
I disagree that it is actually "first-principles". It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.
Yeah, but if you generalize from humans another way ("they tend not to destroy the world and tend to care about other humans"), you'll come to a wildly different conclusion. The conclusion should not be sensitive to poorly motivated reference classes and frames, unless it's really clear why we're using one frame. This is a huge peril of reasoning by analogy.
Whenever attempting to draw conclusions by analogy, it's important that there be shared causal mechanisms which produce the outcome of interest. For example, I can simulate a spring using an analog computer because both systems are roughly governed by similar differential equations. In shard theory, I posited that there's a shared mechanism of "local updating via self-supervised and TD learning on ~randomly initialized neural networks" which leads to things like "contextually activated heuristics" (or "shards").
Here, it isn't clear what the shared mechanism is supposed to be, such that both (future) AI and humans have it. Suppose I grant that if a system is "smart" and has "goals", then bad things can happen. Let's call that the "bad agency" hypothesis.
But how do we know that future AI will have the relevant cognitive structures for "bad agency" to be satisfied? How do we know that the AI will have internal goal representations which chain into each other across contexts, so that the AI reliably pursues one or more goals over time? How do we know that the mechanisms are similar enough for the human->AI analogy to provide meaningful evidence on this particular question?
I expect there to be "bad agency" systems eventually, but it really matters what kind we're talking about. If you're thinking of "secret deceptive alignment that never externalizes in the chain-of-thought" and I'm thinking about "scaffolded models prompted to be agentic and bad", then our interventions will be wildly different.
Yeah, but if you generalize from humans another way ("they tend not to destroy the world and tend to care about other humans"), you'll come to a wildly different conclusion
Sure. I mean, that seems like a meaningfully weaker generalization, but sure. That's not the main issue.
Here's how the whole situation looks like from my perspective:
Yes, "prove that this technological advance isn't going to kill us all or you're not allowed to do it" is a ridiculous standard to apply in the general case. But in this one case, there's a plausible-enough argument that it might, and that argument has not actually been soundly refuted by our getting some insight into how LLMs work and coming up with a theory of their cognition.
- And what you effectively seem to be saying is "until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures".
No, I am in fact quite worried about the situation and think there is a 5-15% chance of huge catastrophe on the current course! But I think these AGIs won't be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures. I think that's important. I think it's important that we not eg anchor on old speculation about AIXI or within-forward-pass deceptive-alignment or whatever, and instead consider more realistic threat models and where we can intervene. That doesn't mean it's fine and dandy to keep scaling with no concern at all.
The reason my percentage is "only 5 to 15" is because I expect society and firms to deal with these problems as they come up, and for that to generalize pretty well to the next step of experimentation and capabilities advancements; for systems to remain tools until invoked into agents; etc.
(Hopefully this comment of mine clarifies; it feels kinda vague to me.)
- What I'm saying is "until you can rigorously prove that a given scale-up plus architectural tweak isn't going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically".
But I do think this is way too high of a bar.
No, I am in fact quite worried about the situation
Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.
I think these AGIs won't be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures
Would you outline your full argument for this and the reasoning/evidence backing that argument?
To restate: My claim is that, no matter much empirical evidence we have regarding LLMs' internals, until we have either an AGI we've empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model).
Would you disagree? If yes, how so?
"under what conditions does speculation about 'superintelligent consequentialism' merit research attention at all?"
Under the conditions of relevant concepts and the future being confusing. Using real systems (both AIs and humans) to anchor theory is valuable, but so is blue sky theory that doesn't care about currently available systems and investigates whatever hasn't been investigated yet and seems to make sense, when there are ideas to formulate or problems to solve, regardless of their connection to reality. A lot of math doesn't care about applications, and it might take decades to stumble on some use for a small fraction of it (even as it's not usually the point).
Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.
You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take. The 'canonical' AI risk argument is implicitly based on a set of interdependent assumptions/predictions about the nature of future AI:
the inherent 'alien-ness' of AI and AI values
supposed magical coordination advantages of AIs
arguments from analogies: namely evolution
These arguments are old enough that we can now update based on how the implicit predictions of the implied worldviews turned out. The traditional EY/MIRI/LW view has not aged well, which in part can be traced to its dependence on an old flawed theory of how the brain works.
For those who read HPMOR/LW in their teens/20's, a big chunk of your worldview is downstream of EY's and the specific positions he landed on with respect to key scientific questions around the brain and AI. His understanding of the brain came almost entirely from ev psych and cognitive biases literature and this model in particular - evolved modularity - hasn't aged well and is just basically wrong. So this is entangled with everything related to AI risk (which is entirely about the trajectory of AI takeoff relative to human capability).
It's not a coincidence that many in DL/neurosci have a very different view (shards etc). In particular the Moravec view that AI will come from reverse engineering the brain, that progress is entirely hardware constrained and thus very smooth and predictable, that is the view turned out to be mostly all correct. (his late 90's prediction of AGI around 2028 is especially prescient)
So it's pretty clear EY/LW was wrong on 1. - the trajectory of takeoff and path to AGI, and Moravec et al was correct.
Now as the underlying reasons are entangled, Moravec et al was also correct on point 2 - AI from brain reverse engineering is not alien! (But really that argument was just weak regardless.) EY did not seriously consider that the path to AGI would involve training massive neural networks to literally replicate human thoughts.
Point 3 Isn't really taken seriously outside of the small LW sphere. By the very nature of alignment being a narrow target, any two random Unaligned AIs are especially unlikely to be aligned with each other. The idea of a magical coordination advantage is based on highly implausible code sharing premises (sharing your source code is generally a very bad idea, and regardless doesn't and can't actually prove that the code you shared is the code actually running in the world - the grounding problem is formidable and unsolved)
The problem with 4 - the analogy from evolution - is that it factually contradicts the doom worldview - evolution succeeded in aligning brains to IGF well enough despite a huge takeoff in the speed of cultural evolution over genetic evolution - as evidence by the fact that humans have one of the highest fitness scores of any species ever, and almost certainly the fastest growing fitness score.
You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take
Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories. The novel theories' main claims are that powerful cognitive systems aren't necessarily (isomorphic to) utility-maximizers, that shards (i. e., context-activated heuristics) reign supreme and value reflection can't arbitrarily slip their leash, that "general intelligence" isn't a compact algorithm, and so on. None of that relies on nanobots/Moore's law/etc.
What you've outlined might or might not be the relevant historical reasons for how Eliezer/the LW community arrived at some of their takes. But the takes themselves, or at least the subset of them that I care about, are independent of these historical reasons.
fast takeoff is more likely than slow
Fast takeoff isn't load-bearing on my model. I think it's plausible for several reasons, but I think a non-self-improving human-genius-level AGI would probably be enough to kill off humanity.
the inherent 'alien-ness' of AI and AI values
I do address that? The values of two distant human cultures are already alien enough for them to see each other as inhuman and wish death on each other. It's only after centuries of memetic mutation that we've managed to figure out how to not do that (as much).
supposed magical coordination advantages of AIs
I don't think one needs to bring LDT/code-sharing stuff there in order to show intuitively how that'd totally work. "Powerful entities oppose each other yet nevertheless manage to coordinate to exploit the downtrodden masses" is a thing that happens all the time in the real world. Political/corporate conspiracies, etc.
"Highly implausible code-sharing premises" is part of why it'd be possible in the AGI case, but it's just an instance of the overarching reason. Which is mainly about the more powerful systems being able to communicate with each other at higher bandwidth than with the weaker systems, allowing them to iterate on negotiations quicker and advocate for their interests during said negotiations better. Which effectively selects the negotiated outcomes for satisfying the preferences of powerful entities while effectively cutting out the weaker ones.
(Or something along these lines. That's an off-the-top-of-my-head take; I haven't thought on this topic much because multipolar scenarios isn't something that's central to my model. But it seems right.)
arguments from analogies: namely evolution
Yeah, we've discussed that some recently, and found points of disagreement. I should flesh out my view on how it's applicable vs. not applicable later on, and make a separate post about that.
Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories
They are critically relevant. From your own linked post ( how I delineate ) :
We only have one shot. There will be a sharp discontinuity in capabilities once we get to AGI, and attempts to iterate on alignment will fail. Either we get AGI right on the first try, or we die.
If takeoff is slow (1) because brains are highly efficient and brain engineering is the viable path to AGI, then we naturally get many shots - via simulation simboxes if nothing else, and there is no sharp discontinuity if moore's law also ends around the time of AGI (an outcome which brain efficiency - as a concept - predicts in advance).
We need to align the AGI's values precisely right.
Not really - if the AGI is very similar to uploads, we just need to align them about as well as humans. Note this is intimately related to 1. and the technical relation between AGI and brains. If they are inevitably very similar then much of the classical AI risk argument dissolves.
You seem to be - like EY circa 2009 - in what I would call the EMH brain camp, as opposed to the ULM camp. It seems given the following two statements, you would put more weight on B than A:
A. The unique intellectual capabilities of humans are best explained by culture: our linguistically acquired mental programs, the evolution of which required vast synaptic capacity and thus is a natural emergent consequence of scaling.
B. The unique intellectual capabilities of humans are best explained by a unique architectural advance via genetic adaptations: a novel 'core of generality'[1] that differentiates the human brain from animal brains.
This is a EY term; and if I recall correctly he still uses it fairly recently. ↩︎
If takeoff is slow (1) because brains are highly efficient and brain engineering is the viable path to AGI, then we naturally get many shots - via simulation simboxes if nothing else, and there is no sharp discontinuity if moore's law also ends around the time of AGI (which brain efficiency predicts in advance).
My argument for the sharp discontinuity routes through the binary nature of general intelligence + an agency overhang, both of which could be hypothesized via non-evolution-based routes. Considerations about brain efficiency or Moore's law don't enter into it.
Brains are very different architectures compared to our computers, in any case, they implement computations in very different ways. They could be maximally efficient relative to their architectures, but so what? It's not at all obvious that FLOPS estimates of brainpower are highly relevant to predicting when our models would hit AGI, any more than the brain's wattage is relevant.
They're only soundly relevant if you're taking the hard "only compute matters, algorithms don't" position, which I reject.
It seems given the following two statements, you would put more weight on B than A:
I think both are load-bearing, in a fairly obvious manner, and that which specific mixture is responsible matters comparatively little.
So, yes, there would be no sharp left turn caused by the AIs gradually bootstrapping a culture, because we're already feeding them the data needed for that.
But that only means the sharp left turn caused by the architectural-advance part – the part we didn't yet hit upon, the part that's beyond LLMs, the "agency overhang" – would be that much sharper. The AGI, once we hit on an architecture that'd accommodate its cognition, would be able to skip the hundreds of years of cultural evolution.
Edit:
You seem to be - like EY circa 2009 - in what I would call the EMH brain camp
Nope. I'd read e. g. Steve Byrnes' sequence, I agree that most of the brain's algorithms are learned from scratch.
"Nearly no data" is way too strong a statement, and relies on this completely binary distinction between things that are not AGI and things that are AGI.
The right question is, what level of dangerous consequentialist goals are needed for systems to reach certain capability levels, e.g. novel science? It could have been that to be as useful as LLMs, systems would be as goal-directed as chimpanzees. Animals display goal-directed behavior all the time, and to get them to do anything you mostly have to make the task instrumental to their goals e.g. offer them treats. However we can control LLMs way better than we can animals, and the concerns are of goal misgeneralization, misspecification, robustness, etc. rather than affecting the system's goals at all.
It remains to be seen what happens at higher capability levels, and alignment will likely get harder, but current LLMs are definitely significant evidence. Like, imagine if people were worried about superintelligent aliens invading Earth and killing everyone due to their alien goals, and scientists were able to capture an animal from their planet as smart as chimpanzees and make it as aligned as LLMs, such that it would happily sit around and summarize novels for you, follow your instructions, try to be harmless for personality rather than instrumental reasons, and not eat your body if you die alone. This is not the whole alignment problem but seems like a decent chunk of it! It could have been much harder.
Like, imagine if people were worried about superintelligent aliens invading Earth and killing everyone due to their alien goals, and scientists were able to capture an animal from their planet as smart as chimpanzees and make it as aligned as LLMs, such that it would happily sit around and summarize novels for you, follow your instructions, try to be harmless for personality rather than instrumental reasons, and not eat your body if you die alone
Uhh, that seems like incredibly weak evidence against an omnicidal alien invasion.
If someone from a pre-industrial tribe adopts a stray puppy from a nearby technological civilization, and the puppy grows up to be loyal to the tribe, you say that's evidence the technological civilization isn't planning to genocide the tribe for sitting on some resources it wants to extract?
That seems, in fact, like the precise situation in which my post's arguments apply most strongly. Just because two systems are in the same reference class ("AIs", "alien life", "things that live in that scary city over there"), doesn't mean aligning one tells you anything about aligning the other.
Some thoughts:
But probably these will depend on current techniques like RLAIF and representation engineering as well as new theory, so it still makes sense to study LLMs.
Mm, we disagree on that, but it's probably not the place to hash this out.
In your analogy, the pre-industrial tribe is human just like the technological civilization and so already knows basically how their motivational systems work. But we are incredibly uncertain about how future AIs will work at a given capability level, so LLMs are evidence.
Uncertainty lives in the mind. Let's say the humans in the city are all transhuman cyborgs, then, so the tribesmen aren't quite sure what the hell they're looking at when they look at them. They snatch up the puppy, which we'll say is also a cyborg, so it's not obvious to the tribe that it's not a member of the city's ruling class. They raise the puppy, the puppy loves them, they conclude the adults of the city's ruling class must likewise not be that bad. In the meantime, the city's dictator is already ordering to depopulate the region of native presence.
How does that analogy break down, in your view?
I agree with the spirit of the post but not the kinda clickbaity title. I think a lot of people are over updating on single forward pass behavior of current LLMs. However, I think it is still possible to get evidence using current models with careful experiment design and being careful with what kinds of conclusions to draw.
Thanks for writing this and engaging in the comments. "Humans/humanity offer the only real GI data, so far" is a basic piece of my worldview and it's nice to have a reference post explaining something like that.
The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm.
Omnicide-wise, arbitrarily-big LLMs should be totally safe.
This is an optimistic take. If we could be rightfully confident that our random search through mindspace with modern ML methods can never produce "scary agents", a lot of our concerns would go away. I don't think that it's remotely the case.
The issue is that this upper bound on risk is also an upper bound on capability. LLMs, and other similar AIs, are not going to do anything really interesting.
Strong disagree. We have only started tapping into the power of LLMs. We've made a machine capable to produce one thought at a time. It can already write decent essays - which is already a superhuman ability. Because humans require multiple thoughts, organized into a thinking process for that.
Imagine what happens when AutoGPT stops being a toy and people start pouring billions of dollars into propper scaffoldings and specialized LLMs, that could be organized in a cognitive architecture in a similar reference class as humans. Then you will have your planning and consequentialist reasoning. And for these kind of systems, transparency and alignability off LLMs is going to be extremely relevant.
If we could be rightfully confident that our random search through mindspace with modern ML methods
I understand this to connote "ML is ~uninformatively-randomly-over-mindspace sampling 'minds' with certain properties (like low loss on training)." If so—this is not how ML works, not even in an approximate sense. If this is genuinely your view, it might be helpful to first ponder why statistical learning theory mispredicted that overparameterized networks can't generalize.
Recently, there's been a fair amount of pushback on the "canonical" views towards the difficulty of AGI Alignment (the views I call the "least forgiving" take).
Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.
At a glance, not updating away from them in the face of ground-truth empirical evidence is a failure of rationality: entrenched beliefs fortified by rationalizations.
I believe this is invalid, and that the two views are much more compatible than might seem. I think the issue lies in the mismatch between their subject matters.
It's clearer if you taboo the word "AI":
It is not at all obvious that they're one and the same. Indeed, I would say that to claim that the two classes of systems overlap is to make a very strong statement regarding how cognition and intelligence work. A statement we do not have much empirical evidence on, but which often gets unknowingly, implicitly snuck-in when people extrapolate findings from LLM studies to superintelligences.
It's an easy mistake to make: both things are called "AI", after all. But you wouldn't study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings regarding what algorithms these AIs implement generalize to statements regarding what algorithms the forward passes of LLMs circa 2020s implement.
By the same token, LLMs' algorithms do not necessarily generalize to how an AGI's cognition will function. Their limitations are not necessarily an AGI's limitations.[1]
What the Fuss Is All About
To start off, let's consider where all the concerns about the AGI Omnicide Risk came from in the first place.
Consider humans. Some facts:
So, we have an existence proof of systems able to powerfully steer the world towards their goals. Some of these system can be strictly more powerful than others. And such systems are often in vicious conflict, aiming to exterminate each other based even on very tiny differences in their goals.
The foundational concern of the AGI Omnicide Risk is: Humans are not at the peak of capability as measured by this mysterious "g-factor". There could be systems more powerful than us. These systems would be able to out-plot us same way smarter humans out-plot stupider ones, even given limited resources and facing active resistance from our side. And they would eagerly do so based on the tiniest of differences between their values and our values.
Systems like this, systems the possibility of whose existence is extrapolated from humans' existence, are precisely what we're worried about. Things that can quietly plot deep within their minds about real-world outcomes they want to achieve, then perturb the world in ways precisely calculated to bring said outcomes about.
The only systems in this reference class known to us are humans, and some human collectives.
Viewing it from another angle, one can say that the systems we're concerned about are defined as cognitive systems in the same reference class as humans.
So What About Current AIs?
Inasmuch as current empirical evidence shows that things like LLMs are not an omnicide risk, it's doing so by demonstrating that they lie outside the reference class of human-like systems.
Indeed, that's often made fairly explicit. The idea that LLMs can exhibit deceptive alignment, or engage in introspective value reflection that leads to them arriving at surprisingly alien values, is often likened to imagining them as having a "homunculus" inside. A tiny human-like thing, quietly plotting in a consequentialist-y manner somewhere deep in the model, and trying to maneuver itself to power despite the efforts of humans trying to detect it and foil its plans.
The novel arguments are often based around arguing that there's no evidence that LLMs have such homunculi, and that their training loops can never lead to homunculi's formation.
And I agree! I think those arguments are right.
But one man's modus ponens is another's modus tollens. I don't take it as evidence that the canonical views on alignment are incorrect – that actually, real-life AGIs don't exhibit such issues. I take it as evidence that LLMs are not AGI-complete.
Which isn't really all that wild a view to hold. Indeed, it would seem this should be the default view. Why should one take as a given the extraordinary claim that we've essentially figured out the grand unified theory of cognition? That the systems on the current paradigm really do scale to AGI? Especially in the face of countervailing intuitive impressions – feelings that these descriptions of how AIs work don't seem to agree with how human cognition feels from the inside?
And I do dispute that implicit claim.
I argue: If you model your AI as being unable to engage in this sort of careful, hidden plotting where it considers the impact of its different actions on the world, iteratively searching for actions that best satisfy its goals? If you imagine it as acting instinctively, as a shard ecology that responds to (abstract) stimuli with (abstract) knee-jerk-like responses? If you imagine that its outwards performance – the RLHF'd masks of ChatGPT or Bing Chat – is all that there is? If you think that the current training paradigm can never produce AIs that'd try to fool you, because the circuits that are figuring out what you want so that the AI may deceive you will be noticed by the SGD and immediately updated away in favour of circuits that implement an instinctive drive to instead just directly do what you want?
Then, I claim, you are not imagining an AGI. You are not imagining a system in the same reference class as humans. You are not imagining a system all the fuss has been about.
Studying gorilla neurology isn't going to shed much light on how to win moral-philosophy debates against humans, despite the fact that both entities are fairly cognitively impressive animals.
Similarly, studying LLMs isn't necessarily going to shed much light on how to align an AGI, despite the fact that both entities are fairly cognitively impressive AIs.
The onus to prove the opposite is on those claiming that the LLM-like paradigm is AGI-complete. Not on those concerned that, why, artificial general intelligences would exhibit the same dangers as natural general intelligences.
On Safety Guarantees
That may be viewed as good news, after a fashion. After all, LLMs are actually fairly capable. Does that mean we can keep safely scaling them without fearing an omnicide? Does that mean that the AGI Omnicide Risk is effectively null anyway? Like, sure, yeah, maybe there are scary systems to which its argument apply, sure. But we're not on-track to build them, so who cares?
On the one hand, sure. I think LLMs are basically safe. As long as you keep the current training setup, you can scale them up 1000x and they're not gonna grow agency or end the world.
I would be concerned about mundane misuse risks, such as perfect-surveillance totalitarianism becoming dirt-cheap, unsavory people setting off pseudo-autonomous pseudo-agents to wreck economic or sociopolitical havoc, and such. But I don't believe they pose any world-ending accident risk, where a training run at an air-gapped data center leads to the birth of an entity that, all on its own, decides to plot its way from there to eating our lightcone, and then successfully does so.
Omnicide-wise, arbitrarily-big LLMs should be totally safe.
The issue is that this upper bound on risk is also an upper bound on capability. LLMs, and other similar AIs, are not going to do anything really interesting. They're not going to produce stellar scientific discoveries where they autonomously invent whole new fields or revolutionize technology.
They're a powerful technology in their own right, yes. But just that: just another technology. Not something that's going to immanentize the eschaton.
Insidiously, any research that aims to break said capability limit – give them true agency and the ability to revolutionize stuff – is going to break the risk limit in turn. Because, well, they're the same limit.
Current AIs are safe, in practice and in theory, because they're not as scarily generally capable as humans. On the flip side, current AIs aren't as capable as humans because they are safe. The same properties that guarantee their safety ensure their non-generality.
So if you figure out how to remove the capability upper bound, you'll end up with the sort of scary system the AGI Omnicide Risk arguments do apply to.
And this is precisely, explicitly, what the major AI labs are trying to do. They are aiming to build an AGI. They're not here just to have fun scaling LLMs. So inasmuch as I'm right that LLMs and such are not AGI-complete, they'll eventually move on from them, and find some approach that does lead to AGI.
And, I predict, for the systems this novel approach generates, the classical AGI Omnicide Risk arguments would apply full-force.
A Concrete Scenario
Here's a very specific worry of mine.
Take an AI Optimist who'd built up a solid model of how AIs trained by SGD work. Based on that, they'd concluded that the AGI Omnicide Risk arguments don't apply to such systems. That conclusion is, I argue, correct and valid.
The optimist caches this conclusion. Then, they keep cheerfully working on capability advances, safe in the knowledge they're not endangering the world, and are instead helping to usher in a new age of prosperity.
Eventually, they notice or realize some architectural limitation of the paradigm they're working under. They ponder it, and figure out some architectural tweak that removes the limitation. As they do so, they don't notice that this tweak invalidates one of the properties on which their previous reassuring safety guarantees rested; from which they were derived and on which they logically depend.
They fail to update the cached thought of "AI is safe".
And so they test the new architecture, and see that it works well, and scale it up. The training loop, however, spits out not the sort of safely-hamstrung system they'd been previously working on, but an actual AGI.
That AGI has a scheming homunculus deep inside. The people working with it don't believe in homunculi, they have convinced themselves those can't exist, so they're not worrying about that. They're not ready to deal with that, they don't even have any interpretability tools pointed in that direction.
The AGI then does all the standard scheme-y stuff, and maneuvers itself into a position of power, basically unopposed. (It, of course, knows not to give any sign of being scheme-y that the humans can notice.)
And then everyone dies.
The point is that the safety guarantees that the current optimists' arguments are based on are not simply fragile, they're being actively optimized against by ML researchers (including the optimists themselves). Sooner or later, they'll give out under the optimization pressures being applied – and it'll be easy to miss the moment the break happens. It'd be easy to cache the belief of, say, "LLMs are safe", then introduce some architectural tweak, keep thinking of your system as "just an LLM with some scaffolding and a tiny tweak", and overlook the fact that the "tiny tweak" invalidated "this system is an LLM, and LLMs are safe".
Closing Summary
I claim that the latest empirically-backed guarantees regarding the safety of our AIs, and the "canonical" least-forgiving take on alignment, are both correct. They're just concerned with different classes of systems: non-generally-intelligent non-agenty AIs generated on the current paradigm, and the theoretically possible AIs that are scarily generally capable the same way humans are capable (whatever this really means).
That view isn't unreasonable. Same way it's not unreasonable to claim that studying GOFAI algorithms wouldn't shed much light on LLM cognition, despite them both being advanced AIs.
Indeed, I go further, and say that should be the default view. The claim that the two classes of systems overlap is actually fairly extraordinary, and that claim isn't solidly backed, empirically or theoretically. If anything, it's the opposite: the arguments for current AIs' safety are based on arguing that they're incapable-by-design of engaging in human-style scheming.
That doesn't guarantee global safety, however. While current AIs are likely safe no matter how much you scale them, those safety guarantees is also what's hamstringing them. Which means that, in the pursuit of ever-greater capabilities, ML researchers are going to run into those limitations sooner or later. They'll figure out how to remove them... and in that very act, they will remove the safety guarantees. The systems they're working on would switch from belonging to the proven-safe class, to systems from the dangerous scheme-y class.
The class to which the classical AGI Omnicide Risk arguments apply full-force.
The class for which no known alignment technique suffices.
And that switch would be very easy, yet very lethal, to miss.
Slightly edited for clarity after an exchange with Ryan.