Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

Epistemic status: Speculation with some factual claims in areas I’m not an expert in.

Thanks to Jean-Stanislas Denain, Charbel-Raphael Segerie, Alexandre Variengien, and Arun Jose for helpful feedback on drafts, and thanks to janus, who shared related ideas.

Main claims

  • GPTs’ next-token-prediction process roughly matches System 1 (aka human intuition) and is not easily accessible, but GPTs can also exhibit more complicated behavior through chains of thought, which roughly matches System 2 (aka human conscious thinking process).
  • Human will be able understand how a human-level GPTs (trained to do next-token-prediction) complete complicated tasks by reading the chains of thought.
  • GPTs trained with RLHF will bypass this supervision.

System 2 and GPTs’ chains of thought are similar

A sensible model of the human thinking process

Here is what I feel like I’m doing when I’m thinking:

Repeat

  1. Sample
...
1Dakara
After 2 years have passed, I am quite interested in hearing @Fabien Roger's thoughts on this comment, especially this part "But how useful could gpt-n be if used in such a way? On the other extreme, gpt-n is producing internal reasoning text at a terabyte/minute. All you can do with it is grep for some suspicious words, or pass it to another AI model. You can't even store it for later unless you have a lot of hard drives. Potentially much more useful. And less safe.".

Here are ways you could make using GPT-n safer if its Chain-of-Thoughts are understandable by a team of linguists and AI experts:

  • On a small random subset of deployment Chain-of-Thought, and when running adversarial evaluations / honeypots aimed at tricking the model into revealing a potential misalignment, actually call your team of linguists and AI experts to analyze what GPT-n is generating (it probably isn't that much text anyway). Not seeing evidence of misalignment in the Chain-of-Thought in such circumstances is evidence that 1. the model is not havi
... (read more)

A simple, weak notion of corrigibility is having a "complete" feedback interface. In logical induction terms, I mean the AI trainer can insert any trader into the market. I want to contrast this with "partial" feedback, in which only some propositions get feedback and others ("latent" propositions) form the structured hypotheses which help predict the observable propositions -- for example, RL, where only rewards and sense-data is observed.

(Note: one might think that the ability to inject traders into LI is still "incomplete" because traders can give feedback on the propositions themselves, not on other traders; so the trader weights constitute "latents" being estimated. However, a trader can effectively vote against another trader by computing all that trader's trades and counterbalancing them. Of course, we can also more...

6Abram Demski
You can define it that way, but then I don't think it's highly relevant for this context.  The story I'm telling here is that partial feedback (typically: learning some sort of input-output relation via some sort of latents) always leaves us with undesired hypotheses which we can't rule out using the restricted feedback mechanism. * Reinforcement Learning cannot rule out the wireheading hypothesis or human-manipulation hypothesis. * Updating on things being true or false cannot rule out agentic hypotheses (the inner optimizer problem). Any sufficiently rich hypotheses space has agentic policies, which can't be ruled out by the feedback. "Purely epistemic" in your sense filters for hypotheses which make good predictions, but this doesn't constrain things to be non-agentic. The system can learn to use predictions as actions in some way. I think it would be fair to define a teleosemantic notion of "purely epistemic" as something like "there is no optimization (anywhere in the system -- 'inner' or 'outer') except optimization for epistemic accuracy".  The obvious application of my main point is that some form of "complete feedback" is a necessary (but insufficient) condition for this.  "Epistemic accuracy" here has to be defined in such a way as to capture the one-way "direction-of-fit" optimization of the map to fit the territory, but never the territory to fit the map. IE the optimization algorithm has to ignore the causal impact of its predictions. However, I don't particularly endorse this as the correct design choice -- although a system with this property would be relatively safe in the sense of eliminating inner-alignment concerns and (in a sense) outer-alignment concerns, it is doing so by ignoring its impact on the world, which creates its own set of dangers. If such a system were widely deployed and became highly trusted for its predictions, it could stumble into bad self-fulfilling prophecies. So, in my view, "epistemic" systems should be as transpa

Hmm, I think the point I’m trying to make is: it’s dicey to have a system S that’s being continually modified to systematically reduce some loss L, but then we intervene to edit S in a way that increases L. We’re kinda fighting against the loss-reducing mechanism (be it gradient descent or bankroll-changes or whatever), hoping that the loss-reducing mechanism won’t find a “repair” that works around our interventions.

In that context, my presumption is that an AI will have some epistemic part S that’s continually modified to produce correct objective underst... (read more)

The cleanest argument that current-day AI models will not cause a catastrophe is probably that they lack the capability to do so.  However, as capabilities improve, we’ll need new tools for ensuring that AI models won’t cause a catastrophe even if we can’t rule out the capability. Anthropic’s Responsible Scaling Policy (RSP) categorizes levels of risk of AI systems into different AI Safety Levels (ASL), and each level has associated commitments aimed at mitigating the risks. Some of these commitments take the form of affirmative safety cases, which are structured arguments that the system is safe to deploy in a given environment. Unfortunately, it is not yet obvious how to make a safety case to rule out certain threats that arise once AIs have sophisticated strategic abilities. The goal

...

Cool, ty for (characteristically) thoughtful engagement.

I am still intuitively skeptical about a bunch of your numbers but now it's the sort of feeling which I would also have if you were just reasoning more clearly than me about this stuff (that is, people who reason more clearly tend to be able to notice ways that interventions could be surprisingly high-leverage in confusing domains).

Project proposal: No-holds-barred jailbreak attacks and defenses for control

Normally we’ve done control research under the assumption that the red team isn’t able to use techniques like GCG to make targeted jailbreaks for the trusted monitor. Can the blue team defend if we relax that assumption (e.g. because we’re worried the model might be able to start a rogue deployment in which it runs GCG)?

The blue team can try a variety of strategies here.

  • The simplest strategy is randomizing the prompt to the trusted monitor. This forces the red team to find a jailbr
... (read more)
10Buck Shlegeris
Project proposal: Control for auditing (I'm sharing this here because I needed to link to it, and I thought I might as well make it public.) When preventing high stakes failures, a crucial bottleneck is the proportion of actions that you can audit: that is, investigate the action further for the purpose of finding evidence that the AI is scheming. (See here for more on this. In particular, note that this might be different from the problem of investigating an action for the purpose of learning whether that action is dangerous.) If we need to rely on human labor to audit, we’ll only be able to audit a tiny fraction of actions that our AI agents are doing. It would be extremely helpful if we were able to use AI assistance for auditing, from both trusted and untrusted models. But obviously auditing with untrusted models leads to massive risk of collusion. So this setting seems potentially pretty hard and potentially crucial; I’m excited for a project investigating it. This is pretty closely related to untrusted monitoring; you could also call this project “deep dive into untrusted monitoring”. The simplest such setting I can think of We’re going to take the APPS backdooring setting from AI Control: Improving Safety Despite Intentional Subversion, fix the main control technique to be trusted monitoring with 1% auditing budget and defer-to-trusted for the 10% most suspicious solutions, and focus on the problem of implementing an auditor. That is: * Get a dataset of backdoor attempts. * For statistical power, this dataset will need to be large (at least 100; it would be better for it to be more like 1000). This might require developing a better/cheaper backdoor generation pipeline. * It might be good to explore backdooring in a harder setting than the interview subset of APPS. Maybe the advanced subset of APPS would be better. * The red team will be allowed to modify these backdoor attempts to add collusion signals. (These modifications will probably be
2Fabien Roger
Actually, it is not that clear to me. I think adversarial robustness is helpful (in conjunction with other things) to prevent CEOs from misusing models. If at some point in a CEO trying to take over wants to use HHH to help them with the takeover, that model will likely refuse to do egregiously bad things. So the CEO might need to use helpful-only models. But there might be processes in place to access helpful-only models - which might make it harder for the CEO to take over. So while I agree that you need good security and governance to prevent a CEO from using helpful-only models to take over, I think that without good adversarial robustness, it is much harder to build adequate security/governance measures without destroying an AI-assisted-CEO's productivity. There is a lot of power concentration risk that just comes from people in power doing normal people-in-power things, such as increasing surveillance on dissidents - for which I agree that adversarial robustness is ~useless. But security against insider threats is quite useless too.
3Zach Stein-Perlman
What's a bad thing that the agent can do with the jailbreak that it can't just do in the unmonitored scaffold? Edit: dumb question, oops: obviously the schemer will have affordances that it doesn't have within the unmonitored scaffold / there are actions where it's helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.

TLDR: We want to describe a concrete and plausible story for how AI models could become schemers. We aim to base this story on what seems like a plausible continuation of the current paradigm. Future AI models will be asked to solve hard tasks. We expect that solving hard tasks requires some sort of goal-directed, self-guided, outcome-based, online learning procedure, which we call the “science loop”, where the AI makes incremental progress toward its high-level goal. We think this “science loop” encourages goal-directedness, instrumental reasoning, instrumental goals, beyond-episode goals, operational non-myopia, and indifference to stated preferences, which we jointly call “Consequentialism”. We then argue that consequentialist agents that are situationally aware are likely to become schemers (absent countermeasures) and sketch three concrete example scenarios. We are uncertain about...

4Noosphere89
I think where I get off the train personally probably comes down to the instrumental goals leading to misaligned goals section, combined with me being more skeptical of instrumental goals leading to unbounded power-seeking. I agree there are definitely zero-sum parts of the science loop, but my worldview is that the parts where the goals are zero sum/competitive receive less weight than the alignment attempts. I'd say the biggest area of how I'm skeptical so far is that I think there's a real difference between the useful idea that power is useful for the science loop and the idea that the AI will seize power by any means necessary to advance it's goals. I think instrumental convergence will look more like local power-seeking that is more related to the task at hand, and not to serve some of it's other goals, primarily due to denser feedback constraining the solution space and instrumental convergence more than humans. That said, this is a very good post, and I'm certainly happier that this much more rigorous post was written than a lot of other takes on scheming.

Thanks. I agree that this is a weak part of the post. 

After writing it, I think I also updated a bit against very clean unbounded power-seeking. But I have more weight on "chaotic catastrophes", e.g. something like:
1. Things move really fast.
2. We don't really understand how goals work and how models form them.
3. The science loop makes models change their goals meaningfully in all sorts of ways. 
4. "what failure looks like" type loss of control. 

4Marius Hobbhahn
Some questions and responses: 1. What if you want the AI to solve a really hard problem? You don't know how to solve it, so you cannot give it detailed instructions. It's also so hard that the AI cannot solve it without learning new things -> you're back to the story above. The story also just started with someone instructing the model to "cure cancer". 2. Instruction following models are helpful-only. What do you do about the other two H's? Do you trust the users to only put in good instructions? I guess you do want to have some side constraints baked into its personality and these can function like goals. Many of the demonstrations that we have for scheming are cases where the model is too much of a saint, i.e. it schemes for the right cause. For example, it might be willing to deceive its developers if we provide it with strong reasons that they have non-HHH goals. I'm not really sure what to make of this. I guess it's good that it cares about being harmless and honest, but it's also a little bit scary that it cares so much.  My best guess for how the approach should look is that some outcome-based RL will be inevitable if we want to unlock the benefits, we just have to hammer the virtues of being non-scheming and non-power-seeking into it at all points of the training procedure. And we then have to add additional lines of defense like control, interpretability, scalable oversight, etc. and think hard about how we minimize correlated failures. But I feel like right now, we don't really have the right tools, model organisms, and evals to establish whether any of these lines of defense actually reduce the problem. 
3Marius Hobbhahn
Good point. That's another crux for which RL seems relevant.  From the perspective of 10 years ago, specifying any goal into the AI seemed incredibly hard since we expected it would have to go through utility functions. With LLMs, this completely changed. Now it's almost trivial to give the goal, and it probably even has a decent understanding of the side constraints by default. So, goal specification seems like a much much smaller problem now.  So the story where we misspecify the goal, the model realizes that the given goal differs from the intended goal and decides to scheme is also less likely.  Instead, there has to be a component where the AIs goals substantially change over time from something that we actually intended to something misaligned. Again, outcome-based RL and instrumental convergence yield a plausible answer. 

I think AI agents (trained end-to-end) might intrinsically prefer power-seeking, in addition to whatever instrumental drives they gain. 

The logical structure of the argument

Premises

  1. People will configure AI systems to be autonomous and reliable in order to accomplish tasks.
  2. This configuration process will reinforce & generalize behaviors which complete tasks reliably.
  3. Many tasks involve power-seeking.
  4. The AI will complete these tasks by seeking power.
  5. The AI will be repeatedly reinforced for its historical actions which seek power.
  6. There is a decent chance the reinforced circuits (“subshards”) prioritize gaining power for the AI’s own sake, not just for the user’s benefit.

Conclusion: There is a decent chance the AI seeks power for itself, when possible.

 

Read the full post at turntrout.com/intrinsic-power-seeking

Find out when I post more content: newsletter & RSS

Note that I don't generally read or reply to comments on LessWrong. To contact me, email alex@turntrout.com.

gwern1819

I'm not sure I see any difference here between regular dangerously convergent instrumental drives and this added risk of 'intrinsic' drives. They just seem like the same thing to me. Like the two predictions you give seem already true and fulfilled:

Relative to other goals, agentic systems are easy to steer to seek power.

Agentic systems seek power outside of the “training distribution”, but in ways which don’t seem to be part of larger power-seeking plans.

Both of these seem like I would expect from a flexible, intelligent agent which is capable of handl... (read more)

Load More