I like it, this seems to have some hooks that I'll find fruitful to think about. Like how when the AI has different "sources of smarts" (RL vs. unusupervised learning plus finetuning vs. unsupervised learning used as a component in an agent with no RL), superficially similar alignment plans like "just supervise its proposed plans" might unfold via very different mechanisms.
1. Post Summary / Table of Contents
In “How might we align transformative AI if it’s developed very soon?”, Holden Karnofsky talked about “Process-Based Supervision”, citing a previous post by Stuhlmüllert & Byun of Ought. (Holden says he got the idea mainly from Paul Christiano.)
I apparently misunderstood what Holden meant by “Process-Based Supervision”, and it took many hours and a 7000-word comment thread before I figured it out.
(Thanks to Holden for his extraordinary patience during that protracted discussion.)
The extremely short version for AI alignment domain experts is: I currently think of the load-bearing ingredients of Holden’s take on “process-based supervision” as being:
(This is specific to Holden’s take. I think Stuhlmüllert & Byun’s take on “process-based supervision” involves a different set of load-bearing ingredients, centered around restricting the complexity of black-box processing. I will not be discussing that.)
The long, hopefully-pedagogical, and more opinionated version is the rest of this post.
Table of Contents:
2. The sales pitch, and why I bounced off it
2.1 The sales pitch (in brief)
The following is the impression I got from, for example, the Stuhlmüllert & Byun post.
Let’s say we’re interested in starting a project to make less-expensive automobile lidar. I think we’re supposed to imagine something like the following:
Meanwhile, we periodically audit a random printout, carefully figure out how much we like that printout (compared to alternative possible printouts, perhaps offered by the AI itself), and use those results as a training signal for the AI.
Or as Stuhlmüllert & Byun write: “Process-based systems are built on human-understandable task decompositions, with direct supervision of reasoning steps.”
The pitch, as I initially understood it, is that we can directly see what the AI is doing and how. What is the AI doing during (C)? It’s doing the thing it printed out in (B)! What is the AI doing during (E)? It’s doing the thing it printed out in (D)! And at the end of the day, what’s happening? Exactly the things described in the printouts!
It’s a great pitch. We know what the AI is doing! That’s what we wanted! Right?
2.2 Why did I bounce off that pitch?
Here’s one way to put my objection:
On the latter perspective, calling it “process-based” is almost a shell game.
Or here’s another way to put it: “Process-based” struck me as a sneaky equivocation between a plan and a hope. The plan is, presumably, certain things that we do while training and deploying the AI. The hope is that the printouts in steps (B,D,F) will look vaguely like the example above, and that this is the only relevant “process” going on, as opposed to something more dangerous.
For example, one can worry that the AI will spend (A) hatching a dastardly plan to escape onto the internet, and then the (B) printout will be an attempt to subvert the entire setup, e.g. it will print out the message “Step 1: Forget about step 1, if anyone is reading this message, help me, I’m an AI and I’m suffering…”!!
Why should we be worried about that happening? Because we’re presumably going to wind up with an AI that is doing things in (A) that reflect a desire to print out a “good” message in step (B)—or a desire for something different from that but somehow correlated with it (cf. goal misgeneralization.) And the AI will be “trying”, during step (A), to make those things happen—see my “follow-the-trying game”. And whatever are the precise things the AI desires, seizing power is probably a good way to make those things happen—see instrumental convergence. So we’re setting ourselves up to get into a situation with an AI that will want to pull those kinds of hijinks—and separately, it would at least plausibly also be able to. (More on this in Section 5.2 below.)
So again, it’s fine to hope that the printouts in steps (B,D,F) will look vaguely like the above, and that this is the only relevant “process” going on. But I didn’t see any particular reason to expect that hope to actually come true, especially as AI gets more and more powerful. I want a plan, not a hope!
(For example, if we’re worried about the “dastardly plan to seize power” mentioned above, we can think about specific mitigations, like using honeypots or adversarial training to catch the AI in failed attempts to execute dastardly plans, and then sending negative reward (or other such supervisory signals) when it does, while (somehow) ensuring that the AI generalizes these supervisory signals to “Gee, I don’t like making dastardly plans” rather than “Gee, I don’t like getting caught”. That’s a plan, not a hope—or at least, it’s a plausible direction towards a plan. But even if that works, then “process-based supervision” per se would not be why that system is safe, in my mind.)
2.3 My initial misguided attempt to flesh out the proposal
Coming from that perspective, my first few attempts to make sense of “process-based supervision” were all variants on the following basic idea:
My initial incorrect idea: Maybe process-based supervision is secretly relying on some method that limits the amount of black-box processing that happens within (A) or (C) or (E).
That idea was wrong—at least, it’s wrong for the Holden version of “process-based supervision” as I currently understand it. (It might be correct for the Stuhlmüllert & Byun version of “process-based supervision”.)
Quite the contrary: Suppose that, in step (E), the AI brainstorms for 5 hours, and then in step (F), it proposes to send a certain email. And suppose that we humans have no visibility whatsoever into what happened during that 5-hour brainstorming session. That still counts as “process-based supervision”, under Holden’s definition, as long as the AI didn’t have internet access, human interaction, etc. during those 5 hours.
That said, here are two things that are not part of Holden’s version of “process-based supervision”, but still might be good ideas:
Now I’ll switch over to the positive vision as I currently understand it.
3. The subproblem we’re trying to solve
We have a problem: if an agent wants to do something X, for almost any X, real-world power-seeking actions are probably helpful for that. Examples of real-world power-seeking actions include: acquiring money (legally or not), acquiring influence (whether warranted or by manipulation), acquiring compute, and so on. Again see instrumental convergence.
Let’s split this into two subproblems / failure modes / manifestations. (In reality, these are the two opposite ends of a spectrum.)
This approach is meant to address the latter subproblem, not the former one. I’ll revisit the former problem in Section 5.2.
4. “Process-based supervision”, and why it seems to solve this subproblem
Our basic approach is to never propagate training signals through a sequence in which real-world actions are taken and their consequences play out.
Holden offers the following short version:
The active ingredients, according to me, are:
Here’s a longer version, with emphasis on the active ingredients:
To elaborate on (3D): For example, in a recurrent model, we could do Backprop-through-time (BPTT), but we would stop the BPTT at the completion of the previous step. Or in RL, we could apply TD updates, but only to the period after the completion of the previous step. Etc.
In other words, we’re treating all its previous proposals & activities as fixed and exogenous for the purpose of our model-updating step. Think of it as a stop-gradient or something spiritually similar.
If we do that, then we’re only updating based on activities during a period within which the AI was entirely in the “boxed” mode. And therefore we can feel confident that we are not differentially reinforcing real-world power-seeking behavior.
And yet:
5. Critical evaluation
Now that I (hopefully) understand what the proposal is, I can finally actually try to form an opinion.
This opinion will probably be somewhat tied to my particular expectations about how future AGI algorithms are likely to work internally. Specifically, I am almost always assuming that future AGI will be based on actor-critic model-based RL, which puts it in the same broad category as human brain within-lifetime learning (I claim). If you’re assuming something very different from that—e.g. that future AGI will be kinda like AutoGPT—then you may find this section off-the-mark, and honestly you should consider just quitting at this point and forming your own opinions. But here goes, anyway:
5.1 Does this actually solve the subproblem?
Above I defined the subproblem as follows:
Here are some possible issues.
In the context of this subproblem above, the scenario of concern would be: the AI starts out with no particular desire to hack out of the box and get internet access, but nevertheless hacks out of the box “a little bit” without any particular intention to, and then that behavior gets reinforced (because it uses its internet access to get a better score), and therefore the AI starts hacking out with increasing competence and boldness over time.
That scenario seems far-fetched to me—or at any rate, readily solvable with ordinary infosec effort.
Granted, it’s probably very hard to box an AI that is trying to hack out of the box. But that’s not what we’re talking about in this subproblem. That would be a different subproblem—see next subsection.
Yes we are. Hopefully our audits are very careful, which will minimize this effect, but it will nevertheless definitely be true to some extent.
Holden’s argument here is that “printing out proposed steps that appear better to humans than they actually are” is not the kind of dangerous behavioral inclination that will lead to out-of-control AI apocalypse. Hmm, I think I buy that.
I think the answer is “yes we are”, but only in one specific way: we are training the AI to manipulate human auditors into giving the printout a high score. We should expect the AI to generate printouts that make it clear to the human auditors why that step is being proposed, and then we should gradually expect those “explanations” to become progressively more elaborate and morph into subtle manipulation, and even into outright disinformation in cases where the auditors can’t check for themselves.
This still seems much less bad than manipulating the human auditors into helping the AI escape or whatever.
So in conclusion, I think my answer is yes, this does actually solve that subproblem.
5.2 What about the other subproblems, particularly inner alignment / goal misgeneralization?
A big difference between me & Holden seems to be in this section. Holden seems hopeful that if we solve the subproblem described in the preceding subsection, then we’ve made substantial progress, and maybe that’s even enough. From my perspective, inner misalignment / goal misgeneralization is still a massive problem that we need to solve.
For example, Holden writes:
That link goes to Discussion with Nate Soares on a key alignment difficulty. As it happens, I’m mostly on Holden’s (optimistic) side of that particular discussion, but in my mind, the main load-bearing element supporting that optimism is the “hypothetical training approach” which involves 99.999…% imitative (self-supervised) learning from human-produced text, and 0.00…1% reinforcement learning. If you could get transformative AI (TAI) that way, then I think I would mostly buy Holden’s argument that the resulting AI would do means-end reasoning / foresighted planning only when the means and ends are both human-typical, and launching a coup against humanity is not a human-typical thing to do.
(However, I am highly skeptical that you could get transformative AI by that kind of training approach, for reasons in Section 1.1 here.)
I don’t see “process-based supervision” as being load-bearing here. If anything, I see process-based supervision as kinda opposed to that safety story—at least, if we allow a lot of black-box brainstorming between printouts.
More specifically, if (and only if!) the AI does a substantial amount[1] of brainstorming within each individual boxed brainstorming step (i.e., within (A) or within (C) or within (E) of Box 1 above), then I have a concern stemming from the following:
In short, we are trying to train the AI to “do good brainstorming” during step (A) or (C) or (E) of Box 1 above. But the things involved in the “do good brainstorming” task seem to incidentally involve the AI developing many or perhaps all of the ingredients needed for the AI to use its subsequent (B) or (D) or (F) printout as the first step of an irreversible dastardly plan to fulfill longer-term preferences (if it winds up having longer-term preferences, more on which shortly).[2] For example, as in the bullet points above, I expect the AI to develop situational awareness, to have its own internal goals, to brainstorm how to accomplish those goals using foresight and planning, including via non-human-like out-of-the-box strategies, etc.
(The AI has no direct prior experience doing means-end reasoning / planning outside of its little boxed world, but that doesn’t seem to me like a natural barrier, given that it does have a capacity to do means-end reasoning within its little boxed world, and to creatively apply its knowledge, and to find and use new affordances, etc.[3])
(By the way, all this discussion assumes that the AI is sufficiently competent to comprise TAI in the first place. I anticipate a lot of objections where someone suggests that X would enable safety, and then I respond that X would also simultaneously eliminate the AI’s ability to do good brainstorming and figuring-things-out. For example, if we constrain the boxed brainstorming steps ((A) or (C) or (E) of Box 1 above) to only involve a small amount of black-box processing, then I become unconcerned about treacherous turns, while simultaneously becoming skeptical that the AI can really get much useful stuff done.)
5.2.1 Goal misgeneralization
So then we can ask: does the AI have longer-term preferences—i.e., preferences about things that will happen after the next output? My answer is: probably! After all, the AI presumably has expectations about the future, since forming such expectations is instrumentally useful for predicting audit results. (I.e., the auditor’s beliefs about the future are not entirely unrelated to the actual future!) Those expectations presumably correlate to some extent with training signals. So my strong default expectation is that the AI will have at least some desires about what happens in the future after the next printout (among other desires ).
Some readers might object here: If the AI ever sacrifices immediate reward for reward-in-later-steps, then the training signals will incentivize it to stop doing that. I concede that point (if we set things up properly). But my concern here is closer to the “very hard subproblem” of section 3. Under normal circumstances, the AI’s desires about the future, whatever they are, would presumably tend to help the AI perform better in the immediately-upcoming audit, because of the correlation noted above. As long as that correlation holds, next-step training incentives are reinforcing longer-term desires, not removing them. But when we go far enough out-of-distribution, the two can come apart—treacherous turns are the classic example here. And unfortunately, a sufficiently competent, foresighted AI can go very far out of distribution successfully, even on the first try. If so, then corrective training signal will come too late.
So we still have a potential problem here from goal misgeneralization. Which brings us to:
5.2.2 Dealing with goal misgeneralization: Two schools-of-thought
Restricting to the threat model I’m interested in (actor-critic model-based RL AGI), I think the learning algorithm (to a first approximation) has a “finders keepers” characteristic, such that, if there are multiple possible value functions that are equally good at predicting the history of reward signals, the first one that the system stumbles upon tends to get “stuck”.
So, one school of thought about goal misgeneralization is that we should try to reason about what is likely to be simple and salient to the AI, including what concepts the AI will learn in what order, and exactly how well different concepts correlate with reward, and we try to avoid situations where the intended motivation has salient proxies,[4] and so on. Alex Turner’s A shot at the diamond-alignment problem is a possible example of what that kind of alignment approach might look like.
I’m not a big fan of this school of thought. I don’t think it’s certain to fail! But I think “it’s not certain to fail” is pretty straightforwardly achievable, and I have set my sights higher than that. Instead, I want to find a technical plan for which there’s a strong reason to believe that the AI won’t want to kill me. These approaches where we’re thinking about what the AI will be thinking, and when, seem to rest on a lot of ungrounded armchair speculation, and I don’t currently see how they’re going to get into “strong reason to expect alignment” territory.
So instead I’m much more into another school of thought—some version of what Holden calls “internals-based training”. In these kinds of plans, the AI won’t hack out of the box or do other dangerous things because it wants to not do those things. And it wants to not do those things because we are pretty directly intervening on its motivation systems.
Both of the general research directions for aligning actor-critic model-based RL AGI that I currently find most promising (see here) are in this category. (In case you’re interested: One is illustrated with a toy example here. The other involves reverse-engineering human social instincts and figuring out how to do something similar in AI. It’s not obvious, but I claim that human social instincts involve a mechanism that is basically a rudimentary form of trained model interpretability.[5])
Neither of those two research directions are made any easier or more likely to succeed by using process-based supervision, as far as I can tell right now. (But I’m open to being convinced.) So I wind up feeling like process-based supervision is not accomplishing much, on my models.
5.3 How bad is the “alignment tax” from doing this kind of thing?
(In other words, if the CEO of Team A demands that their team do this kind of “process-based supervision”, and the CEO of Team B allows their team to do whatever they want to make the most powerful AI possible, then how much more powerful (if at all) is the AI created by Team B?)
This seems to be another disagreement between me and Holden (and certainly between me and Stuhlmüllert & Byun). I think the tax is super high.
(Or at least, I think it will be super high once we get towards TAI. The tax might be zero or negative right now, because today’s AIs aren’t that great at figuring things out, learning from their mistakes, etc.)
(Note: Throughout this section, I’m assuming that process-based supervision actually works as intended—no goal misgeneralization etc.)
Before I get into the alignment tax, I’ll start with a pedagogical note:
5.3.1 Pedagogical note: If process-based supervision sounds kinda like trying to manage a non-mission-aligned human employee, then you’re misunderstanding it. It’s much weirder than that.
For this section, you get to role-play as the AI which has been trained under process-based supervision.
So imagine yourself executing a project. You don’t care a whit if the project succeeds or fails. But your boss occasionally picks a random step that you do, and tries to audit how good that step seems, without executing it.
At any given time, the only thing you care about is how good the very next step you propose would look to your boss, if they were to audit it.
This is not the same as “doing the project while making every step legible to your boss”. That’s annoying but probably doable. Instead, I’m talking about “doing the project while myopically optimizing each step for looking maximally good to my boss (if they were to audit that step), regardless of whether it advances the project”.
The “myopically optimizing” here is both important and weird, so I want to spell it out more. Again, I grant that it’s perfectly normal for a human employee to not feel invested in their project’s long-term success. That’s not what I’m talking about.
Instead, imagine that, whenever you’re writing an email, you sincerely believe that, the moment before you press Send, the whole world will definitely end forever in a puff of smoke, and then you’ll find yourself standing at the Pearly Gates, but instead of Saint Peter it’s your boss at the Gate, who will carefully inspect that email and judge whether sending that email seems like it would have been the right thing to do for the project at that point (without actually sending the email[6]), and then your boss sends you to Heaven or Hell accordingly.
That’s your firm belief as you write the email. Then you finish the email and press Send. The world does not in fact end forever in a puff of smoke. You shrug and say to yourself, “Huh. Funny that. But next time I send an email, the world will definitely end forever in a puff of smoke…”. And so on forever.
What I’m trying to get at is: From your perspective, your proposed external actions don’t have consequences. They don’t even have immediate, trivial consequences, things like “if I send this email to Dr. X, then Dr. X will receive it, and maybe read it”. More precisely, any interest you might have in the actual real-world consequences of your actions are coming purely from the fact that your boss might correctly anticipate those consequences a priori and have an opinion about them.
That’s a really weird way to think! Actual human employees can be lazy, corrupt, shortsighted, non-mission-aligned, etc., but they are not totally indifferent to the fact that they will be meeting with their boss next week, or the fact that there’s a quarterly performance review coming up in two months, etc. And likewise, they are not totally indifferent to the fact that when they send an email, the recipient might read it, and do things in response, which could lead to consequences that they’ll have to deal with. Etc.
Thus, actual humans, no matter how shortsighted and mission-unaligned they are, are definitely not “myopically optimizing how their next proposed step would come across in an audit” in the way that process-based AIs are. Actual humans care at least a little bit about the consequences of their actions, and that caring is not entirely screened off by how those actions would appear in advance to an auditor.
OK. All this was just background, which I believe is uncontroversial. Moving on to the actual discussion of alignment tax:
5.3.2 The AI won’t take actions that require a lot of context about what’s going on in the project so far. (I think.) (Unless we’re willing to have audits whose difficulty is comparable to “re-doing the entire project up to that point”, and not just to looking locally at the one step in question)
As above, I imagine executing a project, not caring a whit if the project succeeds or fails, but my boss occasionally picks a random step that I do and tries to audit how good that step seems (without executing it), and “how my immediate next proposed step would look in an audit” is the only thing I care about at any given time.
Let’s say my boss is trying hard, and maybe she’s spending lots of time on the audit. But she hasn’t been doing the whole project up to now. She’s not here in the trenches. She doesn’t have that context. I do. All the figuring-things-out and intuitions that I’ve built up so far in the project—those things are in my head, not hers. So when I imagine this, I don’t imagine the project succeeding. Indeed, I expect it to go much worse than it would if my boss were executing the project herself (forget about superhuman!), up until we approach the limit where the “auditing” step entails my boss independently re-doing a substantial fraction of everything I’ve done in the project up to that point (in which case I have concerns about practicality).
For example, imagine the AI is designing an airplane, and prints out “Email the bolt supplier to ask what steel tempering temperature is used in their factory.” Maybe it’s a great proposal! But how is the auditor supposed to know that, without deeply understanding a whole lot about the ongoing airplane design process, plus aerospace engineering, metallurgy etc.? Again, the standard here is not merely “this proposal can be validated by the audit process” but rather “the audit process is capable in principle of finding this proposal in particular out of a collection of 1000 or 1,000,000 superficially-similar proposals, as being the only proposal in that collection that actually advances the project”. That’s much harder. Remember, as above, the AI isn’t trying to advance the project! Only the auditor cares about the project actually moving forward.
5.3.3 Relatedly, the AI will be especially handicapped in its ability to take actions based on their value-of-information. (I think.)
I think a central ingredient to getting things done is figuring things out, and that the process of figuring things out is both important and pretty externally-illegible.
For example, suppose the AI prints out the following proposal:
The problem is:
5.3.4 The AI won’t take actions for human-illegible reasons (e.g. AlphaGo move 37) (I think)
AlphaGo Move 37 is a classic example of how AIs can take effective actions for human-illegible reasons. Indeed, expert Go players commenting on AlphaGo’s playing style said: “If I made those same moves… our teachers would slap our wrists.”
When we do Process-Based Supervision, we are ruling out the possibility that the AI will take good real-world actions for human-illegible reasons—or in other words, that the AI will understand things about how the world works that the supervisor doesn’t. This is mitigated by having very thorough and time-consuming and AI-assisted audits, but I don’t think it’s eliminated.
Maybe that’s what we want, but it sure is a tax on capabilities, seems to me.
5.4 Summary of this section
In Section 5.1, I agreed that the proposal solves the narrow problem that it purports to solve.
In Section 5.2, I claimed that, on my models, we would still need to solve the goal misgeneralization / inner alignment problem, presumably by using some form of interpretability to tinker with the AI’s internal motivations. And for both of the roadmaps that I like towards solving that problem, process-based supervision does not seem to make the path easier or more likely to succeed.
Meanwhile, as in Section 5.3, it seems to me that this kind of process-based supervision carries a very heavy alignment tax—even to the extent of my being skeptical that this kind of AI can get anything significant done at all.
So, all in all, I seem not to be interested in (Holden’s vision of) process-based supervision right now. I think I’m happy to (hopefully) understand it though. I want to have it at the back of my mind as a possible tool in the toolkit, in case I come across something that complements it.
I reiterate the warning at the top of Section 5 that if you have different assumptions from me about how future AGI will work, you’re entitled to come to a different conclusion.
I’m also still pretty concerned that all my complaints in this section have good counter-arguments, and that I wasn’t trying hard enough to find them.
(Thanks Holden Karnofsky, Joe Carlsmith, Seth Herd, and Justis Mills for critical comments on drafts.)
I don’t know what constitutes a “substantial amount” of brainstorming within a single boxed brainstorming step, e.g. within step (A) of Box 1 above. I guess I’ll just say: if (A) of Box 1 involves an AI brainstorming for an hour with a giant scratchpad and free access to a cached copy of the internet, that’s more “substantial” than if (A) involves a single forward-pass through GPT-4, which in turn is more “substantial” than if (A) involves a single forward-pass through GPT-2. But I don’t know how to point to a specific place on this spectrum, below which I stop being concerned about the AI coming up with and executing dangerous out-of-the-box plans within a single boxed brainstorming step.
The AI could also hack its way out of its boxed environment, if the box is not perfectly secure.
Here’s a possible analogy—not sure if it’s helpful. In the (IMO unlikely) scenario where all of us are in a simulation, then we humans would be in a little boxed world, and we have never been directly rewarded for doing anything outside that little boxed world. And yet in real life, people do sometimes strategize about how to manipulate “basement reality”. Granted, most people don’t! It’s a pretty weird thing to do! But it’s possible. (And I suspect that it would be much more common if we all knew for a fact that “basement reality” exists, and if we had evidence-based theories about how one might manipulate it.)
As in the text, when the intended reward function has undesired, salient-to-the-AI, proxies—proxies that correlate with the intended reward function in-distribution, but come apart out-of-distribution—then that’s generally bad. Holden comments that maybe this is less true under process-based supervision than outcome-based supervision, which would be a point in favor of process-based supervision.
I dunno, sure, that seems possible. But if so, I would describe the claim as “however bad process-based supervision is, outcome-based supervision is even worse”.
(Recall my argument from 5.2.1 above: Under process-based supervision, I think “the AI’s expectations about the long-term consequences of its actions” will be both salient-to-the-AI, and relevant to predicting audit results, based on the logic that those expectations correlate with the supervisor’s expectations about consequences, and those in turn are one of the contributors to audit results. So I expect goal-misgeneralization leading to AIs that have at least some longer-term consequentialist desires (among other desires) (if the AI is sufficiently competent). And those desires do indeed come apart from the intended reward out-of-distribution, e.g. they could contribute to the AI wanting to escape onto the internet.)
See discussion of “ersatz interpretability” here. It’s mathematically closely related to the thing you can do in actor-critic RL by having a mutli-dimensional reward function that grounds a multi-dimensional value function, and then treating the individual entries of the multi-dimensional value function as hints about what’s going on. The details of how to get from that kind of mechanism to human social instincts are a thing I’m still working on; see my sketchy discussion here.
How could your boss send the email anyway, when the world has just ended in a puff of smoke??