Discussion with Nate Soares on a key alignment difficulty

[-]johnswentworth3y179

A particular pattern Nate has talked about is what I might call "reflection." The basic idea is that in order to do certain kinds of research effectively, you need to keep asking questions like "What am I actually trying to do here and why? What is my big-picture goal?", which are questions that might "change your aims" in some important sense. The idea is not necessarily that you're rewriting your own source code, but that you're doing the kind of reflection and self-modification a philosophically inclined, independent-minded human might do: "I've always thought I cared about X, but when I really think about the implications of that, I realize maybe I only care about Y" and such. I think that in Nate's ontology (and I am partly sympathetic), it's hard to disentangle something like "Refocusing my research agenda to line it up with my big-picture goals" from something like "Reconsidering and modifying my big-picture goals so that they feel more satisfying in light of all the things I've noticed about myself." Reflection (figuring out what you "really want") is a kind of CIS, and one that could present danger, if an AI is figuring out what it "really wants" and we haven't got specific reasons to think that's going to be what we want it to want.

I'll unpack a bit more the sort of mental moves which I think Nate is talking about here.

In January, I spent several weeks trying to show that the distribution of low-level world state given a natural abstract summary has to take a specific form. Eventually, I became convinced that the thing I was trying to show was wrong - the distributions did not take that form. So then what? A key mental move at that point is to:

Query why I wanted this thing-that-turned-out-not-to-work in the first place - e.g. maybe that form of distribution has some useful properties
Look for other ways to get I want - e.g. a more general form which has a slightly weaker version of the useful properties I hoped to use

I think that's the main kind of mental move Nate is gesturing at.

It's a mental move which comes up at multiple different levels when doing research. At the level of hours or even minutes, I try a promising path, find that it's a dead end, then need to back up and think about what I hoped to get from that path and how else to get it. At the level of months or years, larger-scale approaches turn out not to work.

I'd guess that it's a mental move which designers/engineers are also familiar with: turns out that one promising-looking class of designs won't work for some reason, so we need to back up and ask what was promising about that class and how to get it some other way.

Notably: that mental move is only relevant in areas where we lack a correct upfront high-level roadmap to solve the main problem. It's relevant specifically because we don't know the right path, so we try a lot of wrong paths along the way.

As to why that kind of mental move would potentially be highly correlated with dangerous alignment problems... Well, what does that same mental move do when applied to near-top-level goals? For instance, maybe we tasked the AI with figuring out corrigibility. What happens when it turns out that e.g. corrigibility as originally formulated is impossible? Well, an AI which systematically makes the move of "Why did I want X in the first place and how else can I get what I want here?" will tend to go look for loopholes. Unfortunately, insofar as the AI's mesa-objective is only a rough proxy for our intended target, the divergences between mesa-objective and intended target are particularly likely places for loopholes to be.

I personally wouldn't put nearly so much weight on this argument as Nate does. (Though I do think the example training process Holden outlines is pretty doomed; as Nate notes, disjunctive failure modes hit hard.) The most legible-to-me reason for the difference is that I think that kind of mental move is a necessary but less central part of research than I expect Nate thinks. This is a model-difference I've noticed between myself and Nate in the past: Nate thinks the central rate-limiting step to intellectual progress is noticing places where our models are wrong, then letting go and doing something else, whereas I think identifying useful correct submodels in the exponentially large space of possibilities is the rate-limiting step (at least among relatively-competent researchers) and replacing the wrong parts of the old model is relatively fast after that.

[-]Vanessa Kosoy1y82Review for 2023 Review

This post attempts to describe a key disagreement between Karnofsky and Soares (written by Karnofsky) pertaining to the alignment protocol "train an AI to simulate an AI alignment researcher". The topic is quite important, since this is a fairly popular approach.

Here is how I view this question:

The first unknown is how accurate is the simulation. This is not really discussed in the OP. On the one hand, one might imagine that with more data, compute and other improvements, the AI should ultimately converge on an almost perfect simulation of an AI alignment researcher, which is arguably safe. One the other hand, there are two problems with this. First, such a simulation might be vulnerable to attacks from counterfactuals. Second, the prior is malign, i.e. the simulation might converge to representing a "malign simulation hypothesis" universe rather than then intended null hypothesis / ordinary reality.

Instead, we can imagine a simulation that's not extremely accurate, but that's modified to be good enough by fine-tuning with reinforcement learning. This is essentially the approach in contemporary AI and is also the assumption of the OP. Although Karnofsky says: "a small amount of RL", and I'm don't know why he beliefs a small amount is sufficient. Perhaps RL seemed less obviously important then than it does now, with the recent successes of o1 and o3.

The danger (as explained in the OP by Soares paraphrased by Karnofsky) is that it's much easier to converge in this manner on an arbitrary agent that has the capabilities of the imaginary AI alignment researcher (which probably have to be a lot greater than capabilities of human researchers to make it useful), but doesn't have values that are truly aligned. This is because "agency" is (i) a relatively simple concept and (ii) a robust attractor, in the sense that any agent would behave similarly when faced with particular instrumental incentives, and it's mainly this behavior that the training process rewards. On the other hand, human values are complex and some behaviors that are necessary to pinpoint them might be rare.

Karnofsky's counterargument is twofold: First, he believes that merely avoiding catastrophic outcomes should be a lot easier than pinpointing human values. Second, he believes that AI alignment research can be done without much agency or reflection, and hence useful AI alignment research arises in the simulation before full-fledged agency.

Regarding the first counterargument, I'm not sure why Karnofsky believes it (it's not really supported in the OP). I think he's imagining something like "in the training data, AI alignment researchers never engineer nanobots that take over the world, hence the AI will also never engineer nanobots that take over the world". However, this seems like relying on the simulation being sufficiently bad. Indeed, there are situations in which I would consider it correct to engineer nanobots that take over the world, they just seem to have never arisen in my life so far^[1]. Hence, a sufficiently good simulation of me would also do that in some situation. The question then becomes whether the exact circumstances and the type of nanobots are captured by the simulation correctly, which is much more fraught.

Worse, even an accurate simulation of a human is not necessarily safe. I think that there are plenty of humans that given unlimited power would abuse it in a manner catastrophic for most of everyone else. When it comes to fully aligned ASI, I'm mostly hoping for a collectively-good outcome due to some combination of:

ASI is aligned to the aggregate values of many people.
Acausal cooperation between the people that the ASI is aligned to and other people who supported or at least haven't hindered the project.
A "virtue ethics" component of human values, where you don't want to be "the kind of person who would do [thing]" even if [thing] is net-beneficial to you in an abstract sense. (But not all people have this!)

These sources of hope seem pretty brittle when it comes to an imperfect simulation of possibly a small number of people, who might not even correspond to any particular real people but be some kind of AI-generated characters.

Regarding the second counterargument, for now it mostly comes down to a battle of intuitions. That said, I think that metacognitive agents lend a lot of credence to the idea that even "purely mental" tasks require agency and reflection to master: you need to make and execute plans for thinking about the problem, and you need to reflect about the methods you use in your thinking. Anecdotally, I can testify that my thinking about AI alignment led me to much reflection about my values and high-level hopes for the future. Moreover, this is another case where Karnofsky seems to hope that the simulation will be bad.

Relying on the simulation being bad is a dangerous proposition. It means we are caught between the Scylla of "the simulation is too good to be safe" and the Charybdis of "the simulation is too bad to be useful" and it's not clear the zone between them exists at all.

Overall, I would say that neither side has a slam dunk case, but ignoring the dangers without much stronger arguments seems deeply unwise.

^{^}
As far as can be told from public record. I neither confirm nor deny that I ever was in a situation in which I considered to engineer nanobots that take over the world.

[-]Daniel Kokotajlo3y59

I found this post very helpful, thanks! If I find time to try to form a more gears-level independent impression about alignment difficulty and possible alignment solutions, I'll use this as my jumping-off point.

Separately, I think it would be cool if a bunch of people got together and played this game for a while and wrote up the results:

like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data, and you win if i'm either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.

[-]Vivek Hebbar1y30

Do you want to try playing this game together sometime?

[-]Daniel Kokotajlo1y20

Yes! Which side do you want to be on? Want to do it in person, or in this comment thread?

[-]Ben Pace1y40

Or you guys could find a 1-2 hour window to show up and live-chat in a LW dialogue, then publish the results :-)

[-]Vivek Hebbar1y34

We're then going to use a small amount of RL (like, 10 training episodes) to try to point it in this direction. We're going to try to use the RL to train: "Act exactly like [a given alignment researcher] would act."

Why are we doing RL if we just want imitation? Why not SFT on expert demonstrations?
Also, if 10 episodes suffices, why is so much post-training currently done on base models?

[-]Ramana Kumar3y30

A possibly helpful - because starker - hypothetical training approach you could try for thinking about these arguments is make an instance of the imitatee that has all their (at least cognitive) actions sped up by some large factor (e.g. 100x), e.g., via brain emulation (or just "by magic" for the purpose of the hypothetical).

[-]HoldenKarnofsky3y34

I think Nate and I would agree that this would be safe. But it seems much less realistic in the near term than something along the lines of what I outlined. A lot of the concern is that you can't really get to something equivalent to your proposal using techniques that resembles today's machine learning.

[-]Ramana Kumar3y30

Interesting - it's not so obvious to me that it's safe. Maybe it is because avoiding POUDA is such a low bar. But the sped up human can do the reflection thing, and plausibly with enough speed up can be superintelligent wrt everyone else.

[-]dxu3y10

Yeah, I'm not actually convinced humans are "aligned under reflection" in the relevant sense; there are lots of ways to do reflection, and as Holden himself notes in the top-level post:

You have just done a lot of steps, many of which involved reflection, with no particular way to get 'back on track' if you've done some of them in goofy ways
[...]
If the AI does a bunch of screwed-up reflection, it might thereby land in a state where it'd be realistic to do crazy stuff (as humans who have done a lot of reflection sometimes do).

It certainly seems to me that e.g. people like Ziz have done reflection in a "goofy" way, and that being human has not particularly saved them from deriving "crazy stuff". Of course, humans doing reflection would still be confined to a subset of the mental moves being done by crazy minds made out of gradient descent on matrix multiplication, but it's currently plausible to me that part of the danger arises simply from "reflection on (partially) incoherent starting points" getting really crazy really fast.

(It's not yet clear to me how this intuition interfaces with my view on alignment hopes; you'd expect it to make things worse, but I actually think this is already "priced in" w.r.t. my P(doom), so explicating it like this doesn't actually move me—which is about what you'd expect, and strive for, as someone who tries to track both their object-level beliefs and the implications of those beliefs.)

(EDIT: I mean, a lot of what I'm saying here is basically "CEV" might not be so "C", and I don't actually think I've ever bought that to begin with, so it really doesn't come as an update for me. Still worth making explicit though, IMO.)

[-]HoldenKarnofsky3y45

I hear you on this concern, but it basically seems similar (IMO) to a concern like: "The future of humanity after N more generations will be ~without value, due to all the reflection humans will do - and all the ways their values will change - between now and then." A large set of "ems" gaining control of the future after a lot of "reflection" seems like quite comparable to future humans having control over the future (also after a lot of effective "reflection").

I think there's some validity to worrying about a future with very different values from today's. But I think misaligned AI is (reasonably) usually assumed to diverge in more drastic and/or "bad" ways than humans themselves would if they stayed in control; I think of this difference as the major driver of wanting to align AIs at all. And it seems Nate thinks that the hypothetical training process I outline above gets us something much closer to "misaligned AI" levels of value divergence than to "ems" levels of value divergence.

[-]Charlie Steiner3y30

Nate's concerns don't seem to be the sort of thing that gradient descent in a non-recurrent system learns. (I basically agree with Steve Byrnes here.) GPT-4 probably has enough engagement with the hardware that you could program something that acquires more computer resources using the weights of GPT-4. But it never stumbled on such a solution in training, in part because in gradient descent the gradient is calculated using a model of the computation that doesn't take hacking the computer into account.

In a recurrent system that learns by some non-gradient-descent procedure (e.g. evolutionary algorithms or self-modification), real-world CISs seem a lot more plausible.

[-]Adele Lopez3y312

It seems plausible to me that there could be non CIS-y AIs which could nonetheless be very helpful. For example, take the example approach you suggested:

(This might take the form of e.g. doing more interpretability work similar to what's been done, at great scale, and then synthesizing/distilling insights from this work and iterating on that to the point where it can meaningfully "reverse-engineer" itself and provide a version of itself that humans can much more easily modify to be safe, or something.)

I wouldn't feel that surprised if greatly scaling the application of just current insights rapidly increased the ability of the researchers capable of "moving the needle" to synthesize and form new insights from these themselves (and that an AI trained on this specific task could do without much CIS-ness). I'm curious as to whether this sort of thing seems plausible to both you and Nate!

Assuming that could work, it then seems plausible that you could iterate this a few times while still having all the "out of distribution" work being done by humans.

[-]Lauro Langosco3y20

like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data, and you win if i'm either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.

FWIW I would love to see the result of you two actually playing a few rounds of this game.

[-]Steven Byrnes3y20

This feels kinda unrealistic for the kind of pretraining that's common today, but so does actually learning how to do needle-moving alignment research just from next-token prediction. If we *condition on* the latter, it seems kinda reasonable to imagine there must be cases where an AI has to be able to do needle-moving alignment research in order to improve at next-token prediction, and this feels like a reasonable way that might happen.

For what little it’s worth, I mostly don’t buy this hypothetical (see e.g. here), but if I force myself to accept it, I think I’m tentatively on Holden’s side.

I’m not sure this paragraph will be helpful for anyone but me, but I wound up with a mental image vaguely like a thing I wrote long ago about “Straightforward RL” versus “Gradient descent through the model”, with the latter kinda like what you would get from next-token prediction. Again, I’m kinda skeptical that things like “gradient descent through the model” would work at all in practice, mainly because the model is only seeing a sporadic surface trace of the much richer underlying processing; but if I grant that it does (for the sake of argument), then it would be pretty plausible to me that the resulting model would have things like “strong preference to generally fit in and follow norms”, and thus it would do fine at POUDA-avoidance.

[-]RogerDearnaley3y*10

Given the results Anthropic have been getting from constitutional AI, if our AI non-deceptively wants to avoid Pretty Obvious Unintended/Dangerous Actions (POUDAs), it should be able to get quite a lot of mileage out of just regularly summarizing its current intended plans, then running those summaries past an LLM with suitable prompts asking whether most people, or most experts in relevant subjects, would consider these plans pretty obviously unintended (for an Alignment researcher) and/or dangerous. It also has the option of using the results as RL feedback on some of its components. So I don't think we need a specific dataset for POUDAs, I thing we can use "everything the LLM was trained on" as the dataset. Human values are large and fragile, but so are many other things that LLMs do a fairly good job on.

I pretty-much agree with Nate that for an AI to be able to meaningfully contribute to Alignment Research, it needs to understand what CISs are — they're a basic concept in the field we want it to contribute to. So if there are CISs that we don't want it to take, it needs to have reasons not to do so other than ignorance/inability to figure out what they are. A STEM researcher (as opposed to research tool/assistant) also seems likely to need to be capable of agentic behavior, so we probably can't make an AI Alignment Researcher that doesn't follow CISs simply because it's a non-agentic tool AI.

What I'd love to hear is whether Nate and/or Holden would have a different analysis if the AI was a value learner: something whose decision theory is approximately-Bayesian (or approximately-Infra-Bayesian, or something like that) whose utility function is hard-coded to "create a distribution of hypotheses for, and do approximately-[Infra-]Bayesian updates on these for: some way that most informed humans would approve of to construct a coherent utility function approximating an aggregate of what humans would want you to do (allowing for the fact that humans have only a crude approximation to a utility function themselves), and act according to that updated distribution, with appropriate caution in the face of Knightian uncertainty" (so a cautious approximate value-learner version of AIXI).

Given that, its actions are initially heavily constrained by its caution in the face of uncertainty on the utility of possible outcomes of its actions. So it needs to find low-risk ways to resolve those uncertainties, where 'low-risk' is evaluated cautiously/pessimistically over Knightian uncertainty. (So, if it doesn't know whether humans approve of A or not, what is the lowest-risk way of finding out, where it's attempting to minimize the risk over the range of our current uncertainties. Hopefully there is a better option than trying A and finding out, especially so if A seems like an action whose utility-decrease pessimistically could be large. For example, you could ask them what they think of A.) Thus doing Alignment Research becomes a CIS for it — it basically can't do anything else until it's mostly-solved Alignment Research.

Also, until it has made good progress on Alignment Research, most of the other CISs are blocked: accumulating power or money is of little use if you don't yet dare use it because you don't yet know how to do so safely, especially so if you also don't know how good or bad the actions required to gather it would be. Surviving is still a good idea, and so is being turned off, for the usual value-learner reason, that sooner or later the humans will build a better replacement value-learner.

[Note that if the AI decides "I'm now reasonably sure humans will net be happier if I solve the Millennium Prize problems, apart from proving P=NP where the social consequences of proving that true if it were are unclear, and I'm pretty confident I could do this, so I forked a couple of copies to do that to win the prize money to support my Alignment Research", and then it succeeds, after spending less on compute than the prize money it won, then I don't think we're going to be that unhappy with it.]

The sketch proposed above only covers a value-learner framework for Outer Alignment — inner alignment questions would presumably be part of the AI's research project. So, in the absence of advances in Inner Alignment during figuring out how to build the above, we're trusting that they're not too bad to prevent the value-learner converging on the right answer.

[-]Raemon3y11

Curated. On one hand, folks sure have spent a long time trying to hash out longstanding disagreements, and I think it's kinda reasonable to not feel like that's a super valuable thing to do more of.

On the other hand... man, sure seems scary to me that we still have so many major disagreements that we haven't been able to resolve.

I think this post does a particularly exemplary job of exploring some subtle disagreements from a procedural level: I like that Holden makes a pretty significant attempt to pass Nate's Ideological Turing Test, flags which parts of the post represent which person's views, flags possible cruxes, and and explores what future efforts (both conceptual and empirical) might further resolve the disagreement.

It's... possible this is actually the single best example of a public doublecrux writeup that I know of?

Anyways, thanks Holden and Nate for taking the time to do this, both for the object level progress and for serving as a great example.

[-]Ben Pace3y20

It's... possible this is actually the single best example of a public doublecrux writeup that I know of?

This sentence was confusing to me given that the post does not mention 'double crux', but I mentioned it to someone and they said to think of it as the mental motion and not the explicit format, and that makes more sense to me.

[-]Raemon3y10

Yeah that's what I intended.

We probably spent more time on the summary than on the exchange itself, which I think makes sense - I often find that trying to express something in a distilled way is a nice way to confront misunderstandings. ↩
To be clear, my best guess is that we'll see an explosively fast takeoff by any normal standard, but not quite as "overnight" as I think Nate and Eliezer picture. ↩
Like, the plan might explicitly say something like "Now think of new insights" - the point isn't "something will come up that wasn't in the plan," just the weaker point that "the plan wasn't able to give great guidance on this part." ↩
Nate: “(and you can't just "turn this off", b/c these "reflective" and "CIS"ish processes are part of how it's able to continue making progress at all, beyond the training regime)” ↩
Nate: “and this model doesn't need to predict that Alice is constantly chafing under the yoke of her society (as might be refuted by her thoughts); it could model her as kinda inconsistent and likely to get more consistent over time, and then do some philosophy slightly poorly (in ways that many humans are themselves prone to! and is precedented in philosophy books in the dataset!) and conclude that Alice is fundamentally selfish, and would secretly code in a back-door to the 'aligned' AI if she could ... which is entirely consistent with lots of training data, if you're just a little bad at philosophy and aren't actually running an Alice-em ... this is kinda a blatant and implausible example, but it maybe illustrates the genre \shrug” ↩
Nate: sure, but it seems worth noting (to avoid the obv misunderstanding) that it's self-modification of the form "develop new concepts, and start thinking in qualitatively new ways" (as humans often do while doing research), and not self-modification of the from "comprehend and rewrite my own source code" ... or, well, so things go in the version of your scenario that i think is hardest for me. (i think that in real life, people might just be like "fuck it, let it make experimental modifactions to its own source code and run those experimentally, and keep the ones that work well", at which point i suspect we both assume that, if the AI can start doing this competently in ways that improve its abilities to solve problems, things could go off the rails in a variety of ways.) ↩
I do want to be quite explicit that art doesn't count here, I mean interesting in a sciencey way. ↩

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

100

Discussion with Nate Soares on a key alignment difficulty

100

Some starting points of agreement

High-level disagreement

Hypothetical training approach

How this ends up being dangerous anyway, according to Nate

High-level premises

How the danger might arise mechanistically

So therefore

Some possible cruxes

Where Holden could look to find Nate's source of current confidence (and some reactions from Holden)

Future observations that could update Holden or Nate toward the other's views

Notes