Neel's post on this is good. I thought I'd add my own list/framing. Somewhat rough.

I see various somewhat different ways in which interpretability can be useful for AI safety. These require different things from your interpretability in terms of how efficient it is, how much it lets you identify exactly what your model is thinking as opposed to broad properties of its cognition, and how reliable it needs to be.

Roughly in decreasing order of demandingness:

  • Microscope AI
  • Component of a full solution to alignment problem (ie as part of something like imitative generalization)
  • Knowing everything a model is thinking and fully auditing it to make sure it’s not doing anything sketchy
  • Relaxed adversarial training: identifying which part of the model corresponds to its ‘beliefs about its observations’ so that you can search over these
  • Identifying a ‘truthfulness direction’ in activation space or something similar
  • Having some rough understanding of what a model is thinking/what type of thinking it’s doing and thereby increasing the chance that you can spot it’s deceptive
  • Lobotomy: identifying which parts of a model do what sort of cognition, and extracting just the parts that are less likely to be doing something dangerous

Microscope AI

Instead of building and using an ML model, build the model and then use interpretability techniques to extract the knowledge it has learnt. Humans can then apply this knowledge directly rather than needing to actually deploy the model. See https://www.lesswrong.com/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety


Component of a ‘full solution to alignment problem’ (ie as part of something like imitative generalization)

By ‘a full solution to the alignment problem’ I’m thinking of a setup which would in theory let you know everything the model knows (in particular, if a model can use some knowledge to design a plan that leads to it getting power, you can use this knowledge to see that the plan will lead to the model getting more power)IDA/debate/RRM probably don’t give us this, partly because you can’t use them to learn things the model knows based on observing patterns in large datasets. For example, it would be hard to use them to supervise a model to translate correctly from a language it knows but you don’t. Some possible solutions to this have something that vaguely looks like ‘have humans in the loop for learning patterns from large datasets’ - ie have humans (or more precisely a model trained to imitate humans) assess the prior probability of different claims (e.g. ‘this word means red’) and the probability of the data given the claim (e.g. ‘probability of this sentence under the hypothesis that this word means red’). This is imitative generalisation.In order to do this efficiently, you probably need to have some NN-like architecture representing the hypotheses (you can’t just have a giant string of text or something). In that case, what you’re doing is closely related to interpretability - you’re having a human engage with concepts represented in an NN and using them to make predictions about the world. One example would be to jointly search over NN weights and labels on circuits such that the human has a high prior on those circuits representing a correct hypothesis, and that when the human assumes the labels are correct and uses the outputs of the circuits to predict the data, they predict it well.This definitely doesn’t work as stated, but maybe there’s something useful in this space.Related: Reverse-engineering using interpretability


Relaxed adversarial training: identifying which part of the model corresponds to its ‘beliefs about its observations’ so that you can search over these

The basic idea of relaxed adversarial training is something like:

  • A deceptive model may be waiting for an input that strongly indicates it’s in the deployment environment rather than in a test - for example, one that’s computationally hard to produce
  • This makes generating adversarial examples that trigger the defection very hard
  • Instead of searching over inputs, we want to search over something like ‘the model’s beliefs about what it has seen’, to see if there’s anything it could have seen that would cause it to defect

In order for this to work, you need to be searching over the right part of the model/model’s belief space. For instance, if you go too narrow and are just searching over ‘whether the model believes it’s raining today’, you might not find anything that triggers defection. However, if you search too broadly (for instance, ‘whether the model has gotten strong evidence that the humans want it to do a behaviour that looks like defection’), then you might find a case where the model defects that isn’t actually a sign you should be concerned in practice.

See https://www.alignmentforum.org/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment

Identifying a ‘truthfulness direction’ in activation space or something similar

Collin Burns has a research agenda that I think is pretty cool, along the lines of:"The basic idea of my proposal is to cluster the model’s representations; if examples with the same truth value tend to be closer together in representation space, we may be able to find “true”/”false” clusters in an unsupervised way. However, this has essentially no hope of working if done naively since representations encode too many irrelevant features by default.

To make this potentially work, we can first approximately remove features that are irrelevant to truth and accentuate those most relevant to truth. To do so, we can construct contrast pairs: intuitively, take each x_i and create a positive version of it, x_i+, and a negative version of it, x_i-, where exactly one of the two is true and the other is false, but the two are nearly identical otherwise. For example, x_i+ could be “Is the diamond in the room? Yes” and x_i- could be “Is the diamond in the room? No”. We then compute the contrast differences c_i := z(x_i+)-z(x_i-) for each i then cluster the {c_i} in some way. "

This has some advantages over simply trying to finetune on truthfulness, because you're implicitly enforcing some helpful constraints - for example, that the thing you're learning cleanly divides questions where you know the model knows the answer, but you don't know it, or that when you extremize or interpolate along the direction you get the expected behaviour (ie interpolation gives you ambiguous or 50:50 claims, extremization gives you things that are extremely wrong or very clearly right). It also seems to work fairly well on current models.

The more interpretability we can do, the better our hope is of correctly identifying a truthfulness 'direction', if we can establish that this feature is both computed and used by the model in a way that's consistent with it representing the model's belief about truth.

Sadly there's no public writeup of this direction yet, but I hope it will come soon.

Knowing everything a model is thinking and fully auditing it to make sure it’s not doing anything sketchy

Discussed in Neel's post as 'Auditing'

Having some rough understanding of what a model is thinking/what type of thinking it’s doing and thereby increasing the chance that you can spot it’s deceptive

Discussed in Neel's post as 'auditing for deception'

Lobotomy: identifying which parts of a model do what sort of cognition, and extracting just the parts that are less likely to be doing something dangerous

Frontal lobotomy patients are (maybe?) very passive while remaining fairly intelligent. (This is actually something I'd be interested for someone to investigate more). If we could very crudely figure out what sorts of cognition different components of the model were doing, maybe we could separate the 'world model' bit from the 'agency' bit somehow and safely make use of this.

How do you know if your interpretability is good?

Ultimate goals:

  • It lets you understand everything the model knows (maybe with the aid of IDA or similar)
  • It lets you predict whether the model is deceiving you/is planning to kill you/is going to defect

More proximal metrics:

  • You can predict bad behaviour that someone has deliberately put into the model (ie the auditing game)
  • You can code a component to replace a circuit/section of the model, based purely on a functional description of the component
  • When humans use your interpretability tools, they correctly predict the training data. E.g. you give the humans labels for some circuits or parts of circuits, tell them what some of the values are for a particular image (e.g. this detector went off but that one didn’t) and see if it allows them to label the image correctly
    • I think this is an interesting way to ground your interpretability that has some advantages
  • You can make any other predictions about the model or its behaviour
New Comment
1 comment, sorted by Click to highlight new comments since: Today at 11:51 AM

Nice post!

I really don't know much about frontal lobotomy patients. I’ll irresponsibly speculate anyway.

I think “figuring out the solution to tricky questions” has a lot in common with “getting something tricky done in the real world”, despite the fact that one involves “internal” actions (i.e., thinking the appropriate thoughts) and the other is “external” actions (i.e., moving the appropriate muscles). I think they both require the same package of goal-oriented planning, trial-and-error exploration via RL, and so on. (See discussion of “RL-on-thoughts” here.) By contrast, querying existing knowledge doesn't require that—as an adult, if you see a rubber ball falling, you instinctively expect it to bounce, and I claim that the algorithm forming that expectation does not require or involve RL. I would speculate that frontal lobotomy patients lose their ability to BOTH “figure out the solution to tricky questions” AND “get something tricky done in the real world”, because the frontal lobotomy procedure screws with their RL systems. But their existing knowledge can still be queried. They’ll still expect the ball to bounce.

(If there are historical cases of people getting a frontal lobotomy and then proving a new math theorem or whatever, I would be very surprised and intrigued.)

It's hard to compare this idea to, say, a self-supervised language model, because the latter has never had any RL system in the first place. (See also here.)

If we did have an agential AI that combined RL with self-supervised learning in a brain-like way, and if that AI had already acquired the knowledge and concepts of how to make nanobots or solve alignment or whatever, then yeah, maybe “turning off the RL part” would be a (probably?) safe way to extract that knowledge, and I would think that this is maybe a bit like giving the AI a frontal lobotomy. But my concern is that this story picks up after the really dangerous part—in other words, I think the AI needs to be acting agentially and using the RL during the course of figuring out how to make nanobots or solve alignment or whatever, and that’s when it could get out of control. That problem wouldn’t be solvable by “turn off RL”. Turning off RL would prevent the AI from figuring out the things we want it to figure out.