kave

Hello! I work at Lightcone and like LessWrong :-). I have made some confidentiality agreements I can't leak much metadata about (like who they are with). I have made no non-disparagement agreements.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

You claim (and I agree) that option control will probably not be viable at extreme intelligence levels. But I also notice that when you list ways that AI systems help with alignment, all but one (maybe two), as I count it, are option control interventions.

evaluating AI outputs during training, labeling neurons in the context of mechanistic interpretability, monitoring AI chains of thought for reward-hacking behaviors, identifying which transcripts in an experiment contain alignment-faking behaviors, classifying problematic inputs and outputs for the purpose of preventing jailbreaks

I think "labeling neurons" isn't option control. Detecting alignment-faking also seems marginal; maybe it's more basic science than option control.

I think mech interp is proving to be pretty difficult, in a similar way to human neuroscience. My guess is that even if we can characterise the low-level behaviour of all neurons and small circuits, we'll be really stuck with trying to figure out how the AI minds work, and even more stuck trying to turn that knowledge into safe mind design, and even more even more stuck trying to turn that knowledge differentially into safe mind design vs capable mind design.

Will we be able to get AIs to help us with this higher-level task as well? The task of putting all the data and experiments together and coming up with a theory that explains how they behave. I think they probably can just if they could do the same for human neuroscience. And my weak guess is that, if there's a substantial sweet spot, they will be able to do the same for human neuroscience.

But I'm not sure how well we'll be able to tell that they have given us a correct theory? They will produce some theory of how the brain or a machine mind works, and I don't know (genuinely don't know) whether we will be able to tell if it's a subtly wrong theory. It does seem pretty hard to produce a small theory, that makes a bunch of correct empirical predictions, but has some (intentional or unintentional) error that is a vector for loss-of-control. So maybe reality will come in clutch with some extra option control at the critical time.

Your taxonomies of the space of worries and orientations to this question are really good, and I think well capture my concerns above. But I wanted to spell out my specific concerns because things will succeed or fail for specific reasons.

I do not think your post is arguing for creating warning shots. I understand it to be advocating for not averting warning shots.

To extend your analogy, there are several houses that are built close to a river, and you think that a flood is coming that will destroy them. You are worried that if you build a dam that would protect the houses currently there, then more people will build by the river and their houses will be flooded by even bigger floods in the future. Because you are worried people will behave in this bad-for-them way, you choose not to help them in the short term. (The bit I mean to point to by "diagonalising" is the bit where you think about what you expect they'll do, and which mistakes you think they'll make, and plan around that).

I expect moderately sized warning shots to increase the chances humanity as a whole takes serious actions and, for example, steps up efforts to align the frontier labs.

It seems naïvely evil to knowingly let the world walk into a medium-sized catastrophe. To be clear, I think that sometimes it is probably evil to stop the world from walking into a catastrophe, if you think that increases the risk of bad things like extinctions. But I think the prior of not diagonalising against others (and of not giving yourself rope with which to trick yourself) is strong.

I think you train Claude 3.7 to imitate the paraphrased scratchpad, but I'm a little unsure because you say "distill". Just checking that Claude 3.7 still produces CoT (in the style of the paraphrase) after training, rather than being trained to perform the paraphrased-CoT reasoning in one step?

I think TLW's criticism is important, and I don't think your responses are sufficient. I also think the original example is confusing; I've met several people who, after reading OP, seemed to me confused about how engineers could use the concept of mutual information.

Here is my attempt to expand your argument.

We're trying to design some secure electronic equipment. We want the internal state and some of the outputs to be secret. Maybe we want all of the outputs to be secret, but we've given up on that (for example, radio shielding might not be practical or reliable enough). When we're trying to design things so that the internal state and outputs are secret, there are a couple of sources of failure.

One source of failure is failing to model the interactions between the components of our systems. Maybe there is an output we don't know about (like the vibrations the electronics make while operating), or maybe there is an interaction we're not aware of (like magnetic coupling between two components we're treating as independent).

Another source of failure is that we failed to consider all the ways that an adversary could exploit the interactions we do know about. In your example, we fail to consider how an adversary could exploit higher-order correlations between emitted radio waves and the state of the electronic internals.

A true name, in principle, allows us to avoid the second kind of failure. In high-dimensional state spaces, we might need to get kind of clever to prove the lack of mutual information. But it's a fairly delimited analytic problem, and we at least know what a good answer would look like.

The true name could also guide our investigations into our system, to help us avoid the first kind of failure. "Huh, we just made the adder have a more complicated behaviour as an optimisation. Could the unnevenness of that optimisation over the input distribution leak information about the adder's inputs to another part of the system?"

Now, reader, you might worry that the chosen example of a True Name leaves an implementation gap wide enough for a human adversary to drive an exploit through. And I think that's a pretty good complaint. The best defence I can muster is that it guides and organises the defender's thinking. You get to do proofs-given-assumptions, and you get more clarity about how to think if your assumptions are wrong.

To the extent that the idea is that True Names are part of a strategy to come up with approaches that are unbounded-optimisation-proof, I think that defence doesn't work and the strategy is kind of sunk. 

On the other hand, here is an argument that I can plause. In the end, we've got to make some argument that when we flick some switch or continue down some road, things will be OK. And there's a big messy space of considerations to navigate to that end. True Names are necessary to have any hope of compressing the domain enough that you can make arguments that stand up.

I think 2023 was perhaps the peak for discussing the idea that neural networks have surprisingly simple representations of human concepts. This was the year of Steering GPT-2-XL by adding an activation vectorcheese vectors, the slightly weird lie detection paper and was just after Contrast-consistent search.

This is a pretty exciting idea, because if it’s easy to find human concepts we want (or don’t want) networks to possess, then we can maybe use that to increase the chance that systems that are honest, kind, loving (and can ask them questions like “are you deceiving me?” and get useful answers).

I don’t think the idea is now definitively refuted or anything, but I do think a particular kind of lazy version of the idea, more popular in the Zeitgeist, perhaps, than amongst actual proponents, has fallen out of favour.

CCS seemed to imply an additional proposition, which is that you can get even more precise identification of human concepts by encoding some properties of the concept you’re looking for into the loss function. I was kind of excited about this, because things in this realm are pretty powerful tools for specifying what you care about (like, it rhymes with axiom-based definition or property-based testing).

But actually, if you look at the numbers they report, that’s not really true! As this post points out, basically all their performance is recoverable by doing PCA on contrast pairs.[1]

I like how focused and concise this post is, while still being reasonably complete.

There’s another important line of criticism of CCS, which is about whether its “truth-like vector” is at all likely to track truth, rather than just something like “what a human would believe”. I think posts like What Discovering Latent Knowledge Did and Did Not Find address this somewhat more directly than this one.

But I think, for me, the loss function had some mystique. Most of my hope was that encoding properties of truth into the loss function would help us find robust measures of what a model thought was true. So I think this post was the main one that made me less excited about both CCS and take a bit more of a nuanced view about the linearity of human concept representations.

  1. ^

    Though I admit I’m a little confused about how to think about the fact that PCA happens to have pretty similar structure to the CCS loss. Maybe for features that have less confidence/consistency-shaped properties, shaping the loss function would be more important.

I don't think this distinction is robust enough to rely on as much of a defensive property. I think it's probably not that hard to think "I probably would have tried something in direction X, or direction Y", and then gather lots of bits about how well the clusters X and Y work.

I'm curious about Claude's coauthorship of this post. How did it help?

The crucial point is: when the untrusted model gets to step 6, it sees R in the history instead of A, so it knows its attempt was replaced.

Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R? If it can tell that, how come it can't guess it specifically would have produced A?

It seems like there's a general principle here, that it's hard to use pure empiricism to bound behaviour over large input and action spaces. You either need to design the behaviour, or understand it mechanistically.

Load More