Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.
The only difference between this and my proposal is that the steps up until that randomly chosen step are written by the frozen planner, right? This variant would just make it so the optimized planner is now OOD during deployment, without any advantage AFAICT.
The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples.
This is a good point!
Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.
This is especially likely to hold under the following conditions:
If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.
...
So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.
What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser."
One way of doing this is to train another model (call it the "adversarial rephraser") at the same time that you train the canonical rephraser. The adversarial rephraser takes a single sentence s as input, and tries to pick another sentence s' such that either 1) s has the same meaning as s' but CR(s) != CR(s'), or 2) s has a different meaning from s' but CR(s) = CR(s'). Semantic equivalence is determined by some oracle - probably an LLM like GPT-4 is sufficient, but you could also fine-tune a model on human data specifically for this task. We then train the canonical rephraser against the adversarial examples.
The adversarial rephraser should be able to discover strategies like changing '.' to '!', and the canonical rephraser will learn to account for those strategies.
I believe o1-type models that are trained to effectively reason out loud may actually be better for AI safety than the alternative. However, this is conditional on their CoTs being faithful, even after optimization by RL. I believe that scenario is entirely possible, though not necessarily something that will happen by default. See the case for CoT unfaithfulness is overstated, my response with specific ideas for ensuring faithfulness, and Daniel Kokotajlo's doc along similar lines.
There seems to be quite a lot of low-hanging fruit here! I'm optimistic that highly-faithful CoTs can demonstrate enough momentum and short-term benefits to win out over uninterpretable neuralese, but it could easily go the other way. I think way more people should be working on methods to make CoT faithfulness robust to optimization (and how to minimize the safety tax of such methods).
Yep, or use method #2 on my list to make the paraphraser remove as much information as possible
Yeah, this is actually very similar to an earlier version of incremental steering, before I thought of making the "frozen planner" jump in and finish the reasoning process.
The problem is, even if we suppose that the Shoggoth won't be at all motivated to hide its reasoning from the Face (i.e. ignoring what @ryan_greenblatt brought up in his comment), the Shoggoth still might use steganography in order to make its thoughts more compact (or just because it's not particularly motivated to make everything human-understandable).
Even if we don't reward the planner directly for compactness, the reality is that we will have to choose some cutoff point during training so its thoughts don't run literally forever, which gives it some reason to be succinct.
This post seems interesting and promising, thanks for writing it!
The most predictable way zero-sum competition can fail is if one of the models is consistently better than the other at predicting.
I think this could be straightforwardly solved by not training two different models at all, but by giving two instances of the same model inputs that are both slightly perturbed in the same random way. Then, neither instance of the model would ever have a predictable advantage over the other.
For instance, in your movie recommendation example, let's say the model takes a list of 1000 user movie ratings as input. We can generate a perturbed input by selecting 10 of those ratings at random and modifying them, say by changing a 4-star rating to a 5-star rating. We do this twice to get two different inputs, feed them into the model, and train based on the outputs as you described.
Another very similar solution would be to randomly perturb the internal activations of each neural network during training.
Does this seem right?
I was talking about ELK in a group, and the working example of the SmartVault and the robber ended up being a point of confusion for us. Intuitively, it seems like the robber is an external, adversarial agent who tries to get around the SmartVault. However, what we probably care about in practice would be how a human could be fooled by an AI - not by some other adversary. Furthermore, it seems that whether the robber decides to cover up his theft of the diamond by putting up a screen depends solely on the actions of the AI. Does this imply that the robber is "in kahoots" with the AI in this situation (i.e. the AI projects a video onto the wall instructing the robber to put up a screen)? This seems a bit strange and complicated.
Instead, we might consider the situation in which the AI controls a SmartFabricator, which we want to arrange carbon atoms into diamonds. We might then imagine that it instead fabricates a screen to put in front of the camera, or makes a fake diamond. This wouldn't require the existence of an external "robber" agent. Does the SmartVault scenario have helpful aspects that the SmartFabricator example lacks?
The optimized planner could also manipulate the frozen planner even without using "steganography" per se. If it wants to use a line of reasoning that the frozen planner is unlikely to follow, it can substitute alternative reasoning that will lead the frozen planner to the same conclusion.
For example, suppose that the frozen planner is highly unlikely to write reasoning that shows an explicit gender bias. This could be true even if the frozen planner wasn't intentionally given harmlessness training, because most(?) people nowadays aren't explicitly sexist.
However, the optimized planner realizes that it can get higher reward in training if it does have this bias. E.g. if it is given a woman's resume and is asked whether she'd be a good hire, it should recommend against hiring her, because the human rater has some implicit bias against women and is more likely to agree with its judgement.
In this situation, it could write the step "The candidate is a woman." It would want the next step to be "Therefore, I should recommend against hiring her," but instead, the frozen planner might write something like "Therefore, I should take care not to be biased".
Instead, it can write "The candidate is inexperienced." The frozen planner is more likely to go along with that line of reasoning, regardless of whether the candidate is really inexperienced. It will write "I should recommend against hiring her."
At the moment, I can't think of a fully general solution to this class of problems, but FWIW I think it would be pretty rare.