All of CBiddulph's Comments + Replies

The optimized planner could also manipulate the frozen planner even without using "steganography" per se. If it wants to use a line of reasoning that the frozen planner is unlikely to follow, it can substitute alternative reasoning that will lead the frozen planner to the same conclusion.

For example, suppose that the frozen planner is highly unlikely to write reasoning that shows an explicit gender bias. This could be true even if the frozen planner wasn't intentionally given harmlessness training, because most(?) people nowadays aren't explicitly sexist.

H... (read more)

2Abram Demski
Yeah, this is a good point, which doesn't seem addressed by any idea so far.

Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.

The only difference between this and my proposal is that the steps up until that randomly chosen step are written by the frozen planner, right? This variant would just make it so the optimized planner is now OOD during deployment, without any advantage AFAICT.

2Abram Demski
Yeah, you're right, I no longer think it's an interesting proposal.

The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples.

This is a good point!

Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.

This is especially likely to hold under the following conditions:

  • The
... (read more)
3Abram Demski
Well, I'm not sure. As you mention, it depends on the step size. It also depends on how vulnerable to adversarial inputs LLMs are and how easy they are to find. I haven't looked into the research on this, but it sounds empirically checkable. If there are lots of adversarial inputs which have a wide array of impacts on LLM behavior, then it would seem very plausible that the optimized planner could find useful ones without being specifically steered in that direction. We can also combine this with other proposals, such as paraphrasing.

If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.

...

So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.

What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser."

One way of doing this is to train an... (read more)

3Abram Demski
Ah, yep, this makes sense to me.

I believe o1-type models that are trained to effectively reason out loud may actually be better for AI safety than the alternative. However, this is conditional on their CoTs being faithful, even after optimization by RL. I believe that scenario is entirely possible, though not necessarily something that will happen by default. See the case for CoT unfaithfulness is overstated, my response with specific ideas for ensuring faithfulness, and Daniel Kokotajlo's doc along similar lines.

There seems to be quite a lot of low-hanging fruit here! I'm optimistic tha... (read more)

2Abram Demski
Thanks for this response! I agree with the argument. I'm not sure what it would take to ensure CoT faithfulness, but I agree that it is an important direction to try and take things; perhaps even the most promising direction for near-term frontier-lab safety research (given the incentives pushing those labs in o1-ish directions).

It depends on what the alternative is. Here's a hierarchy:

1. Pretrained models with a bit of RLHF, and lots of scaffolding / bureaucracy / language model programming.
2. Pretrained models trained to do long chains of thought, but with some techniques in place (e.g. paraphraser, shoggoth-face, see some of my docs) that try to preserve faithfulness.
3. Pretrained models trained to do long chains of thought, but without those techniques, such that the CoT evolves into some sort of special hyperoptimized lingo
4. As above except with recurrent models / neuralese,... (read more)

Yep, or use method #2 on my list to make the paraphraser remove as much information as possible

Yeah, this is actually very similar to an earlier version of incremental steering, before I thought of making the "frozen planner" jump in and finish the reasoning process.

The problem is, even if we suppose that the Shoggoth won't be at all motivated to hide its reasoning from the Face (i.e. ignoring what @ryan_greenblatt brought up in his comment), the Shoggoth still might use steganography in order to make its thoughts more compact (or just because it's not particularly motivated to make everything human-understandable).

Even if we don't reward the planne... (read more)

2Daniel Kokotajlo
Oh yeah, there's another idea designed to avoid the compaction/steganography problem: Paraphrasers. Just take each short block of text from the Shoggoth and pass it through a frozen purpose-built paraphraser model that generates 10 paraphrases of the text and then randomly picks one of them and appends it to the CoT instead of the original block the Shoggoth proposed. So the Shoggoth never actually sees its own text, it only sees paraphrases of its own text. This should at east partially destroy the incentive for it to learn to compress things or think in euphemisms.

This post seems interesting and promising, thanks for writing it!

The most predictable way zero-sum competition can fail is if one of the models is consistently better than the other at predicting.

I think this could be straightforwardly solved by not training two different models at all, but by giving two instances of the same model inputs that are both slightly perturbed in the same random way. Then, neither instance of the model would ever have a predictable advantage over the other.

For instance, in your movie recommendation example, let's say the mod... (read more)

2Rubi Hudson
Thanks for the comment. I agree that, ideally, we would find a way not to have two wholly separate models and instead somehow train a model against itself. I think a potential issue with your proposal is that small perturbations could have discontinuous effects, the anticipation of which distorts predictions. However, it would be interesting to think about further to see if there's some way to avoid that issue.

I was talking about ELK in a group, and the working example of the SmartVault and the robber ended up being a point of confusion for us. Intuitively, it seems like the robber is an external, adversarial agent who tries to get around the SmartVault. However, what we probably care about in practice would be how a human could be fooled by an AI - not by some other adversary. Furthermore, it seems that whether the robber decides to cover up his theft of the diamond by putting up a screen depends solely on the actions of the AI. Does this imply that the ro... (read more)

3Mark Xu
The SmartFabricator seems basically the same. In the robber example, you might imagine the SmartVault is the one that puts up the screen to conceal the fact that it let the diamond get stolen.