(Note: This was languishing in a drafts folder for a while, and probably isn't quite right in various ways. I'm posting it because I expect it's better to share flawed thoughts than to sit on the post until I'm satisfied with it, i.e., forever.)

 

Let's play a game of "what do you think you know, and why do you think you know it?".

Imagine that you're about to launch an AGI. What you think you know is that, with at least 50% confidence (we’re of course not looking for proofs — that would be crazy), the AGI is going to execute some pivotal act that ends the acute risk period in a good way. Why do you think you know that?

Insofar as people’s alignment proposals can be construed as answers to this question, we have the option of answering with one of these proposals. I might very roughly classify the existing proposals into the following bins:

 

1.  Output evaluation approaches. You know what the AGI is going to do with sufficient precision that it screens off any alignment concerns. For example, your AGI system only outputs plans in the first place, and you've already reviewed the plan, and you're confident the plan will work, in a way that screens off any other worry about the AGI being misaligned.

2.  Cognitive interpretability approaches. You understand the AGI's cognition sufficiently well that, while you may not be sure what it's going to do, you're confident that it's going to be good. You aren't worried that it will kill all humans, because you understand how its plan came to be and what solution-spaces it was searching to solve various sub-problems and so on, and you're confident no consideration was ever given to human-killing.

3.  Heavy-precedent approaches. You have run this AGI before on many similar tasks, and trained out all the hiccups. While you might not know precisely what it's going to do, and you might not know what's going on inside its mind, you've been around the block a few times, and the task it's about to perform is sufficiently similar to other tasks it has empirically succeeded at, justifying your confidence.

 

Roughly speaking, I think that alignment approaches with a heavy reliance on output evaluation are doomed, both on the grounds that humans can't evaluate the effectiveness of a plan capable of ending the acute risk period, and because the real plan is less like a story and more like a tree.

For an example of “humans can't reliably evaluate the effectiveness of this class of plans”, imagine that the plan is an enormous bitstring that's going to be sent to the motor outputs. If you decode the string, you find that it figures out how to make long DNA strands that allegedly code for a protein factory that can be used to build a general-purpose nanofactory. You're hard-pressed, however, to confirm that this is actually (all and only) what the plan does.

For an example of “the real plan is less like a story and more like a tree”, imagine that the AI's plan is "I'm going to build a wetlab, then do a bunch of experimentation, then think about the results of the experiments in various ways and build a protein factory that builds a nanofactory that I'm going to experiment with until I figure out how to build nanomachines that can be used for some good pivotal act". In order to trust that this sort of abstract plan doesn't kill you when put into practice, you have to trust the system's thinking and its notion of 'goodness', which is going to dump you pretty quickly into cognitive-interpretability-style justification.

Roughly speaking, I think that cognitive interpretability approaches are doomed, at least in the modern paradigm, because we're not building minds but rather training minds, and we have very little grasp of their internal thinking, and there are convergent instrumental reasons to expect things to go wrong by default, and the social environment doesn't seem to me to be fighting against those defaults with anything nearing the force I expect is necessary.

Roughly speaking, I think that heavy-precedent approaches are doomed because I haven't myself been able to think of any pivotal action that has safe analogs we can do a bunch of empiricism on; nor have I heard a concrete proposal like this that strikes me as realistic from anyone else. "Well, it never killed all humans in the toy environments we trained it in (at least, not after the first few sandboxed incidents, after which we figured out how to train blatantly adversarial-looking behavior out of it)" doesn't give me much confidence. If you're smart enough to design nanotech that can melt all GPUs or whatever (disclaimer: this is a toy example of a pivotal act, and I think better pivotal-act options than this exist) then you're probably smart enough to figure out when you're playing for keeps, and all AGIs have an incentive not to kill all "operators" in the toy games once they start to realize they're in toy games.

So that's not a great place to be.

The doomedness of cognitive interpretability approaches seems to me to be the weakest. And indeed, this is where it seems to me that many people are focusing their efforts, from one angle or another.

If I may continue coarsely classifying proposals in ways their advocates might not endorse, I'd bin a bunch of proposals I've heard as hybrid approaches, that try to get cognitive-interpretability-style justification by way of heavy-precedent-style justification.

E.g., Paul Christiano’s plan prior to ELK was (very roughly, as I understood it) to somehow get ourselves into a position where we can say "I know the behavior of this system will be fine because I know that its cognition was only seeking fine outcomes, and I know its behavior was only seeking fine outcomes because its cognition is composed of human-esque parts, and I know that those human-esque parts are human-esque because we have access to the ground truth of short human thoughts, and because we have heavy-precedent-style empirical justification that the components of the overall cognition operate as intended."

(This post was mostly drafted before ELK. ELK looks more to me like a different kind of interpretability+precedent hybrid approach — one that tries to get AGI-comprehension tools (for cognitive interpretability), and tries to achieve confidence in those tools via "we tried it and saw" arguments.)

I'm not very optimistic about such plans myself, mostly because I don't expect the first working AGI systems to have architectures compatible with this plan, but secondarily because of the cognitive-interpretability parts of the justification. How do we string locally-human-esque reasoning chunks together in a way that can build nanotech for the purpose of a good pivotal act? And why can that sort of chaining not similarly result in a system that builds nanotech to Kill All Humans? And what made us confident we're in the former case and not the latter?

But I digress. Maybe I'll write more about that some other time.

Cf. Evan Hubinger's post on training stories. From my perspective, training stories are focused pretty heavily on the idea that justification is going to come from a style more like heavily precedented black boxes than like cognitive interpretability, so I'm not too sold on his decomposition, but I endorse thinking about the question of how and where we could (allegedly) end up knowing that the AGI is good to deploy.

(Note that it's entirely possible that I misunderstood Evan, and/or that Evan's views have changed since that post.)

An implicit background assumption that's loud in my models here is the assumption that early AGI systems will exist in an environment where they can attain a decisive strategic advantage over the rest of the world.

I believe this because of how the world looks "brittle" (e.g., nanotech exists) and because lots of technological progress seems cognition-constrained (such as, again, nanotech). This is a big part of why I think heavy-precedent-style justifications are doomed.

Many locals seem to expect a smoother and slower transition from here to superhumanly capable general-purpose science AI — a transition that somehow leaves no window where the world's most competent AGI can unilaterally dominate the strategic landscape. I admit I have no concrete visualization of how that could go (and hereby solicit implausibly-detailed stories to make such scenarios seem more plausible to me, if you think outcomes like this are likely!). Given that I have a lot of trouble visualizing such worlds, I'm not a good person to talk about where our justifications could come from in those worlds.

I might say more on this topic later, but for now I just want to share this framing, and solicit explicit accounts of how we're supposed to believe that your favorite flavor of AGI is going to do good stuff.

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 6:56 AM

From my perspective, training stories are focused pretty heavily on the idea that justification is going to come from a style more like heavily precedented black boxes than like cognitive interpretability

I definitely don't think this—in fact, I tend to think that cognitive interpretability is probably the only way we can plausibly get high levels of confidence in the safety of a training process. From “How do we become confident in the safety of a machine learning system?”:

Nevertheless, I think that transparency-and-interpretability-based training rationales are some of the most exciting, as unlike inductive bias analysis, they actually provide feedback during training, potentially letting us see problems as they arise rather than having to get everything right in advance.

See also: “A transparency and interpretability tech tree

I believe this because of how the world looks "brittle" (e.g., nanotech exists) and because lots of technological progress seems cognition-constrained (such as, again, nanotech). This is a big part of why I think heavy-precedent-style justifications are doomed.

Apart from nanotech, what are the main examples or arguments you would cite in favor of these claims?

Separately, how close is your conception of nanotech to "atomically precise manufacturing", which seems like Drexler's preferred framing right now?

not Nate or a military historian, but to me it seems pretty likely for a ~100 human-years more technologically advanced actor to get decisive strategic advantage over the world.

  • In military history it seems pretty common for some tech advance to cause one side to get a big advantage. This seems to be true today as well with command-and-control and various other capabilities
  • I would guess pure fusion weapons are technologically possible, which means an AI sophisticated enough to design one can get nukes without uranium
  • Currently on the cutting edge, the most advanced actors have large multiples over everyone else in important metrics. This is due to either a few years' lead or better research practices still within the human range
    • SMIC is mass producing the 14nm node whereas Samsung is at 3nm, which is something like 5x better FLOPS/watt
    • algorithmic improvements driven by cognitive labor of ML engineers have caused multiple OOM improvement in value/FLOPS
    • SpaceX gets 10x better cost per ton to orbit than the next cheapest space launch provider, and this is before Starship. Also their internal costs are lower

This seems sufficient for "what failure looks like" scenarios, with faster disempowerment through hard takeoff likely to depend on other pathways like nanotech, social engineering, etc. As for the whole argument against "heavy precedent", I'm not convinced either way and haven't thought about it a ton.

One way in which the world seems brittle / having free energy AI could use to gain advantage: We haven't figured out good communication practices for the digital age. We don't have good collective epistemics. And we dont seem to be on track to have this solved in the next 20 years. As a result I expect that with enough compute and understanding of network science, and perhaps a couple more things, you could sabotage the whole civilization. ("Enough" is meant to stand for "a lot, but within reach of an early AGI". Heck, if Google somehow spent the next 5 years just on that, I would give them fair odds.)

Roughly speaking, I think that cognitive interpretability approaches are doomed, at least in the modern paradigm, because we're not building minds but rather training minds, and we have very little grasp of their internal thinking,

A brain-like AGI - modeled after our one working example of efficient general intelligence - would naturally have an interpretable inner monologue we could monitor. There's good reasons to suspect that DL based general intelligence will end up with something similar simply due to the convergent optimization pressure to communicate complex thought vectors to/from human brains through a low-bitrate channel.

"Well, it never killed all humans in the toy environments we trained it in (at least, not after the first few sandboxed incidents, after which we figured out how to train blatantly adversarial-looking behavior out of it)" doesn't give me much confidence. If you're smart enough to design nanotech that can melt all GPUs or whatever (disclaimer: this is a toy example of a pivotal act, and I think better pivotal-act options than this exist) then you're probably smart enough to figure out when you're playing for keeps, and all AGIs have an incentive not to kill all "operators" in the toy games once they start to realize they're in toy games.

Intelligence potential of architecture != intelligence of trained system

The intelligence of a trained system depends on the architectural prior, the training data, and the compute/capacity. Take even an optimally powerful architectural prior - one that would develop into a superintelligence if trained on the internet with reasonable compute - and it would still only be nearly as dumb as a rock if trained solely in atari pong. Somewhere in between the complexity of pong and our reality exists a multi-agent historical sim capable of safely confining a superintelligent architecture and iterating on altruism/alignment safely. So by the time that results in a system that is "smart enough to design nanotech", it should already be at least as safe as humans. There of course ways that strategy fails, but they don't fail because 'smartness' strictly entails unconfineability - which becomes more clear when you taboo 'smartness' and replace it with a slightly more detailed model of intelligence.

A brain-like AGI - modeled after our one working example of efficient general intelligence - would naturally have an interpretable inner monologue we could monitor.

This doesn’t have much to do with whether a mind is understandable. Most of my cognition is not found in the verbal transcript of my inner monologue, partly as I’m not that verbal a thinker, but mostly because most of my cognition is in my nonverbal System 1.

This has been discussed before. Your example of not being a verbal thinker is not directly relevant because 1.) inner monologue need not be strictly verbal, 2.) we need only a few examples of strong human thinkers with verbal inner monologues to show that isn't an efficiency disadvantage - so even if your brain type is less monitorable we are not confined to that design.

I also do not believe your central claim - in that based on my knowledge of neuroscience - disabling the brain modules responsible for your inner monologue will not only disable your capacity for speech, it will also seriously impede your cognition and render you largely incapable of executing complex long term plans.

Starting with a brain-like AGI, there are several obvious low-cost routes to dramatically improve automated cognitive inspectability. A key insight is that there are clear levels of abstraction in the brain (as predicted by the need to compress sensory streams for efficient bayesian prediction) and the inner monologue is at the top of the abstraction hierarchy, which maximizes information utility per bit. At the bottom of the abstraction hierarchy would be something like V1, which would be mostly useless to monitor (minimal value per bit).

To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn't really safe.

As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about "officially approved actions" very loud. And run a secondary thread of thinking about "non-approved thoughts" in the background --- "quieter" and in a more fragmented manner, and killing it off everytime it made it into the inner monologue.

Sure, there will be some traces. (A lot, even, I kinda suck at this :-).) But I think this already brings us outside of the realm of "interpretable by default" and into "a race between your interpretability and my cognitive steganography".

Note that I, or most humans, didnt have a particular need to develop this exact type of cognitive steganography. (The "elephant in the brain" type of actions-thoughts inconsistency is not exactly this.) As a result, I think we would be able to get much much better very quickly if there was an incentive to do it. (Not talking evolutionary scale. More like "give me a weekend to play with this".) And while I have no incentive to develop these skills, the same cannot be said for a brain-like AI that we are (trying to) interpret as it thinks.

These are good intuitive arguments against these sorts of solutions, but I think there's a more formal argument we can make that these solutions are dangerous because they pose excess false positive risk. In particular, I think they fail to fully account for the risks of generalized Goodharting, as do most proposed solutions other than something like agent foundations.

disclaimer: this is a toy example of a pivotal act, and I think better pivotal-act options than this exist

Would you be able to provide examples of other sorts of pivotal acts you have in mind?