All of Cody Breene's Comments + Replies

When you say, "take over", what do you specifically mean? In the context of a GPT descendent, would take over imply it's doing something beyond providing a text output for a given input? Like it's going out of its way to somehow minimize the cross-entropy loss with additional GPUs, etc.? 

4Paul Christiano
Takeover seems most plausible when humans have deployed the system as an agent, whose text outputs are treated as commands that are sent to actuators (like bash), and which chooses outputs in order to achieve desired outcomes. If you have limited control over what outcome the system is pursuing, you can end up with useful consequentialists who have a tendency to take over (since it's an effective way to pursue a wide range of outcomes, including natural generalizations of the ones it was selected to pursue during training). A few years ago it was maybe plausible to say that "that's not how people use GPT, they just ask it questions." Unfortunately (but predictably) that has become a common way to use GPT-4, and it is rapidly becoming more compelling as the systems improve. I think that in the relatively near future if you want an AI to help you with coding, rather than just getting completions you might say "Hey this function seems to be taking too long, could you figure out what's going on?" and the AI will e.g. do a bisection for you, set up a version of your code running in a test harness, ask you questions about desired behavior, and so on. I don't think "the system gets extra GPUs to minimize cross entropy loss" in particular is very plausible. (Could happen, but not a high enough probability to be worth worrying about.)

"AI developers should not develop and deploy systems with a significant risk of killing everyone."

If you were looking at GPT-4, what criteria would you use to evaluate whether it had a significant risk of killing everyone? 

3Paul Christiano
I'd test whether it has the capability to do so if it were trying (as ARC is starting to do). Then I'd think about potential reasons that our training procedure would lead it to try to take over (e.g. as described here or here or in old writing). If models might be capable enough to take over if they tried, and if the training procedures plausibly lead to takeover attempts, I'd probably flip the burden of proof ask developers to explain what evidence leads them to be confident that these systems won't try to take over (what actual experiments did they perform to characterize model behavior, and how well would it detect the plausible concerns?) and then evaluate that evidence. For GPT-4 I think the main step is just that it's not capable enough to do so. If it were capable enough, I think the next step would be measuring whether it ever engages in abrupt shifts to novel reward-hacking behaviors---if it robustly doesn't then we can be more confident it won't take over (when combined with more mild "on-distribution" measurement), and if it does then we can start to ask what determines when that happens or what mitigations effectively address it.