Mateusz: Last month, Charbel published the post Against Almost Every Theory of Impact of Interpretability, which sparked a lot of discussion in the community. Lucius is an AI notkilleveryoneism researcher at Apollo, focused on interpretability. He disagrees with Charbel. So I brought them together to discuss. I suggest that Charbel starts by bringing up what he considers his strongest argument against interpretability, Lucius responds to that, and we'll see how it goes.
Reverse engineering neural networks and correct ontology for interpretability
Charbel: I think we can start with "auditing deception with interp is out of reach". So, Neel says, "Maybe we can look at random bits of the model and identify circuits or features." We need to steelman this argument a bit because it's clear that looking at random bits of the model won't be sufficient. Instead, we can try to enumerate each type of feature (enumerative safety). However, it seems to me that this kind of procedure is very poorly specified. We have already attempted similar approaches with vision models, but they didn't quite work [Charbel comment: I updated a bit after the Towards Monosemanticity paper]. Additionally, there are some theoretical arguments, like deception could be something very diffuse and it's possible that there won't be any part of the model that is deceptive.
Maybe we should try to study deceptive alignment with other techniques instead. For example, trying to monitor the proxies of deceptive alignment, or trying to find ex-ante strategies instead of ex-post strategies. Finally, at the end of the section, I wrote that if DeepMind announced today that they had found an instance of a deceptive model, they certainly wouldn't have found it using interpretability alone. I think I won some Bayes points because the Center for AI Safety recently published a paper where they have a methodology for monitoring Cicero's behavior, whether it's lying and so on. There's no interpretability in that paper.
Lucius: So, first off, my views on interpretability are maybe not exactly mainstream and representative of most interpretability researchers. My answers to this might be very different from what Neel Nanda would, for example, say. I would say that most of current published interpretability is very, very bad and sucks at its job.
I do have an overall belief that making interpretability that does not suck, is actually quite feasible, and that there's no particular reason to believe that it's going to be particularly difficult or take particularly long.
To me, interpretability is not a thing where you go into the model, and ask "Oh, what is this neuron doing?", because this is already assuming something which I don't think is true, namely, that the neurons are something like the elementary variables of the computation the network is carrying out. I don't think there's any evidence for that being correct, and there's quite a lot of evidence speaking to it probably not being correct.
The most abstract evidence I would point to is that neural networks generalize outside of their training data set, even if they have more parameters than training data points. And we have a whole bunch of very, very solid Bayesian math which says that if you fit a function like that to data, and the fit keeps holding outside the training data set, then the function must, in some sense, be simple. It must have a lower Kolmogorov complexity; a lower description length than the number of data points.
In other words, if you've got a neural network that has more weights in it than there were data points, and you fit that to the data, and it generalizes nevertheless, that is implying that, in a sense, the actual algorithm that is running inside this mess of weights has a description that is shorter than the weights and, by extension, the network described in the neuron and weight picture.
To me, this is strongly implying that there's a way to talk about what this network is doing, where the elementary units are not the neurons, but some other thing. And that other thing might be a whole lot simpler and more structured than the mess of weights, which is why it works. And I would say that the job of interpretability is to find that structure and understand how it works. Where "understand" here means not that I can tell some post hoc story about "oh, this part was activating when the model did this, oh, this is like a thing that kind of correlates with dogs", but "oh, this is the algorithm it is running here, which is maybe not an algorithm that anyone has previously seen". Once upon a time, humans did not know the fast Fourier transform algorithm, and then we came up with it. Similarly, I imagine that sometime in the future, people will look into LLMs and go, "Oh, here are some algorithms we didn't know about that help you do language processing". And once you know them, you can tell that you know them because you can now write an LLM program yourself in Python, no neural network training necessary, and the program will do a decent job.
So for starters, that's what I would want to get at by doing interpretability. A lot of the arguments you bring relate to the notion of alien concepts, the AI might not think like you do. I agree with that. I agree that I do not like how a lot of current interpretability seems to just assume that it's human concepts in there. That's the wrong way to go about it. I think you should look at the math of the network, and not project what you expect to find into the network. But it seems to me that this is a straightforwardly solvable thing. So currently, I agree. Currently, interpretability is basically of no practical use for anything. But that's very normal when you're trying to invent a new science or understand a new area. Back in the day, when people were first trying to invent quantum mechanics, what eventually matured into quantum mechanics was not particularly useful for anything.
Last thing, I don't expect that the way that we are currently trying to find deception in the models (looking at correlates and behavior from the outside and having no strong predictive mathematical grounding for any of it) is going to generalize to when the AIs starts being a lot smarter. At the very least, it's not something I would want to rely on. I would rather be in a position where I know how the circuits in there work, I know what's going on. I know that if I now train this kind of model, there is just no way that I can get something that could evade my deception detection. And I feel like with the behavioral stuff we're doing right now, we're not going to get this. There could be weird thermodynamics in there as models scale up that we can't currently predict.
Charbel: So let me summarize your points. First, you are saying that maybe enumerative safety is not the right framing for using interpretability. Maybe we should apply it not at the neuron level, but at some kind of motif level, an assembly of neurons, or maybe not an assembly but some kind of high-level description that tells us how the neural network works. Second, you are comparing interpretability to quantum mechanics, to point out that maybe we should be more charitable to this new science instead of dismissing it outright. And the last point is about generalization. We would like to get some mathematical formalism, and some mathematical guarantees, to ensure that the model is going to behave correctly at scale in the future.
Lucius: That's not quite how I would put it but it's kind of going in a similar-ish direction, so okay.
Just one caveat, which I think is important. The description of the network's behavior I have in mind is not "high-level". It would be equivalent to the low-level behavior of the network while being simpler.
Charbel: Let's take the IOI (indirect object identification) paper, where they studied how the model knows whether to continue the sentence with John or Mary? They tried to reverse engineer the circuits inside GPT-2, and they found a circuit made up of a bunch of attention heads. If you look at these heads, you can recover a lot of the model's performance on the task. Do you think this kind of mechanism (that you have in mind) would be like what they did in this paper? If not, what do you think is wrong with it? this paper? If not, what do you think is wrong about it?
Why am I bringing this up? Because it's an excellent interpretability paper, and the description of the circuit implementing such a simple algorithm is already huge and incomplete in the sense that it doesn't explain all the performance of the model and we are focusing on just one thing: is it John or Mary? We are not even asking how it knows that it should be the name of a person and not something else. I think that tells us something about the general feasibility of reverse engineering.
Lucius: Basically, almost no reverse engineering done to date has been done in a way where I feel safe saying "Yeah, they actually did it; this is actually the description of how the model does this". The one example I can think of right now where I would be willing to concede that they probably have it right is Neel Nanda's modular addition transformer. For all the others, what people are basically doing, as far as I can tell, is they're looking at specific heads, specific neurons in a model, kind of hoping to an extent that they can back out whether this neuron is a part of this circuit, that it's not a part of that other circuit, that this structure here sort of belongs to this behavior, and then try to understand how it does things.
I think the reason this seems so hard and ends up with such finicky descriptions, is that it's assuming a structure to the internals of the model that I just don't think is there. I don't think that you can delimit a circuit to certain neurons. Not in the sense of "Oh, you shouldn't be looking at low-level behavior, because low-level behavior is inherently too finicky", but in the sense of "This is just a bad way to carve up the model into chunks". You probably need quite different ways. There's just no reason to presuppose that a neuron corresponds to something like an elementary variable in the program this thing is running.
Charbel: So you're saying there should be a middle ground between the kind of description we have in the IOI paper and the opposite which would be totally alien concepts?
Lucius: That's not quite how I would put it. Maybe imagine you are given a record of the activations in a CPU that is running on this laptop right now. You don't know what the macOS operating system looks like, or how the other programs running on it work. You just see these transistor activations and you're asking somebody to reverse-engineer whatever program is running on the CPU. If you're doing this, I think it's not a good idea to go in asking "Okay, this transistor was in the "one" state right now… is this the Apple transistor? …is this a Safari-related transistor?" This is the wrong ontology.
To reverse engineer this, what you actually want to do is to understand that this thing has organization levels. There's some kind of mapping you can make from the activations of the CPU, back to something like program code. And this mapping is very, very similar for many kinds of CPUs running many different kinds of programs. You're kind of backing out from transistors into assembly code. Before that, it looked like an absolute mess, everything interacting with everything. But after that, because it's really a bunch of programs running in parallel, it's not as complicated anymore. There's the browser, the operating system, and whatever other apps you've got in the background, and they don't interact with each other that much. And each program is made up of many subroutines and functions that you can understand in isolation, without needing to refer to the other functions. This is the sort of environment in which I want to start talking about circuits, subroutines, and specific parts of the overall algorithm that do specific things.
How does this relate to neural networks? When you're looking at the neurons and at what makes them fire (which is what most interpretability does) you're doing something much closer to looking at when the transistors on your CPU are in the one state or the zero state. What you should actually be thinking about is, "Okay, what's a nice generalizable mathematical mapping I can make from that to a compressed description, an algorithm that does exactly what a neural network is doing but looks simpler?". I suspect, when you look at that structure, it will be a lot shorter, a lot more structured, and a lot more amenable to reductionism. You will be able to understand each part, one at a time. The reason I expect this is that this is how simple things, things that work outside of very narrow environments, tend to work. Simplicity is almost necessary for generality. I have a hard time seeing how these neural networks could be doing as well as they're doing if they are inherently unamenable to short structured descriptions like this.
Feasibility of reverse engineering vs. rewriting AI from scratch (1/2)
Charbel: I can give you a concrete example. Let's take AlphaZero. To interpret it, they took a bunch of concepts that seemed useful for playing chess. For example, can I take the queen or how many pawns do I have. They wanted to find out when AlphaZero learned these concepts, and they succeeded. Then if they wanted to go a little bit further and see if they could discover new strategies by reverse engineering the system, I think that's where this methodology fails. How would you be able to discover new concepts that are not already in the corpus of human knowledge? In the paper they were not able to discover any new concepts. So it's a pretty bad start. And in you say that there's a real difference between IOI and grokking. I think grokking is an algorithmic task, whereas IOI is a natural language processing task, which makes it much harder. Noam Chomsky and other linguists have done a lot of work describing the structure of language, but those descriptions are always incomplete and not exhaustive. If you try to define an object in the real world, you'll get something like "a human is a featherless biped". Whatever you come up with, there will always be counterexamples. Do we really want to go down that road?
Even if this is not exactly what you mean… First, it seems to be beyond human capacity to accurately define fuzzy concepts. Second, this reverse engineering would be very infohazardous, at the very least, because when you can write the GPT-5 equivalent in C++, then you can use a compiler and optimize various modules, which would probably make foom much easier. And then it seems to me that it's harder to take a system and reverse engineer everything out of it than to make it from scratch. [Charbel comment: See the paper From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought for example. It proposes a way to convert GPT-3's fuzzy word model into a probabilistic programming language-style modelization of a world model].
I can give you an analogy. Imagine you are a software engineer who has just joined a new company. You are trying to understand a very large, very complex code base. There are some parts of it that you don't understand very well. Now it turns out that there's a bug in the code and you need to patch it. You have two ways to do it. Either you patch the buggy code (which you don't understand very well and you're not sure your understanding of it is correct), or you understand this module from first principles and rewriting it from scratch.
It seems to me that it's much easier to start again and write everything from scratch than to take a very opaque (though not completely opaque) system like GPT-4 and reverse engineer it into a C++ algorithm. I would rather use LLMs to help me iteratively refine the code that tries to do the same things they can do. [Charbel comment: During the debate, I has something in mind similar to the new Eureka paper (Eureka: Human-Level Reward Design via Coding Large Language Models). I think writing the specs explicitly like this could go a long way alongside Davidad's plan].
Lucius: Okay, so, a couple of points there. You brought up the point about info hazards, which I think I'll address last. But this is maybe the thing we can most agree on. I think a lot of your other points seem to play into the background for why we have differing intuitions of what's easier here: reverse engineering or writing from scratch. I think reverse engineering would probably be a lot easier. I do not think it would be above the human level. I don't expect it to be significantly harder than many other routine tasks and scientific problems that humans solved in the past.
You point out that (to date), even in tiny networks, like the modular adition transformer, the only things people have really been able to identify and understand in there, such that they go "oh, this kind of looks like it's doing something we're familiar with" are… well, things we're familiar with. We already know what to look for and that makes it a lot easier to find it in this mess.
You then correctly say, that this is not going to work for a case where you don't even understand yet what the model is doing. For example, people think they have maybe, to some extent, kind of, identified some points in GPT-3, where it stores data. Maybe. People knew how to store data in a database in the year 2005. But they weren't able to make a language processor, a program that could talk. So certainly, there are some algorithms going on there in GPT-3 that we don't know how to write yet. Otherwise, we could have written GPT-3 ourselves, we wouldn't have needed SGD to do it for us.
And my sort of angle on this is: I think there is a way to do this reverse engineering, where you're not just poking in a mess of neurons and weights looking for things you're familiar with. I think there is a mapping you can make from these neurons and weights to a more regularized structure and that mapping does not require you to know what that structure is in advance. You just kind of look in there for how can you carve up this computation into parts that are, for example, not causally interacting with each other.
Charbel: Is this modularity?
Lucius: This is one way you can call it… I'm deliberately being vague about infohazardous stuff, so I'm not going to say anything concrete about how you might actually do this.
Why do I expect this is possible at all? Why am I so convinced that this is a thing you can actually do? So you mentioned AlphaZero and its reverse engineering where we saw some things in there we recognized, but also, it's obviously running some heuristics that we don't know about yet and have not been able to identify. You said you don't think there's any particular reason these heuristics would be amenable to a short description. And that I think, is where I strongly disagree. Because these heuristics work. These chess programs do not just play chess well in the games they're trained on. They play chess well outside that training data.
Charbel: Yes, but in chess, for example, if you are in a late game, you only have a queen and a few pawns. People who are very good at chess usually memorize strategies for such situations. You can imagine that there is no short description for these endgames. What you call "heuristics" might just be something that has been memorized for the same reason that we memorize chess endgames and there is no short description.
Lucius: I expect there are some parts of the chess engine, and also GPT-3, that are like that, that just store data. You know: "who was Tom Cruise's Mother, look up…, ah!, that was in the training data somewhere!" But in a sense, those parts are actually relatively easy to understand for us, because that's just a database. In a sense, you can factor this out, you can go, you know, "this whole bunch of description is just like a hard disk where things are stored." Then you can talk about the rest of the algorithm and how it knows when to access what, which is the actually difficult part where we don't understand how GPT-3 is doing it so well. Or how the chess program plays well outside of its training distribution, outside of its memorized game situations.
Singular learning theory
Lucius: I do think that heuristics like that have to be in there. It's not just all stored data. Because if it was… To give you an example: GPT-3 has 175 billion parameters. If I fit a 175 billion degree polynomial to internet data like we fit GPT-3 to predict the next token, that polynomial is not going to generalize outside of its training data. The moment I put in any data point that it hasn't seen yet, it's gonna do horribly. Well, why is it going to do horribly? Because that polynomial fit is basically just a lookup table. It is almost uncompressible. The entire 175 billion parameters are used. That is the description length. And there is no short description to be found here, no question of "Oh, what are the underlying dynamics of this thing?". It is just these parameters, there is no short description. GPT-3 is not like that. GPT-3 generalizes. GPT-3, in a sense, thus, is a function fit, an algorithm, that is simple.
Charbel: Yeah, they use this a lot for comparing Bayesian models. If you have multiple models, you should use the one with the lowest BIC.
Lucius: Yes. So you could roughly, informally see the Bayesian information criterion as the thing that quantifies the trade-off between a better fit to the training data and overfitting (and thus failing to generalize out-of-distribution) as I give it more parameters.
Fun fact: neural networks do not obey the Bayesian information criterion. They do much better than the Bayesian information would suggest, constantly. This has bugged learning theorists for quite a while. What's the reason they do better? At the very start of the learning theory math (that derives the Bayesian information criterion) there's an assumption that the parameter-function map of the function you're fitting is invertible. So if I have a 10-degree function (10 parameters), then for any single 10-dimensional vector (that represents one setting of the parameters) I get one unique input-output mapping. That is very not true in neural networks. There are many, many points in the parameter landscapes that implement the same thing. It turns out that, if you do the learning theory math thing again without starting from that invertibility assumption and try to derive something like the Bayesian information criterion, you get a different formula that is very similar, but instead of the raw parameter count (e.g., GPT-3's 175 billion), you get a number that describes the dimensionality of the optimum in the loss landscape that the network you fitted ended up in.
Charbel: What do you mean by "the dimensionality of the optimum"?
Lucius: If I just have a single point of loss landscape that's a perfect fit to my data, that's zero-dimensional. If I've got a line such that any point on the line is a perfect fit, that's one-dimensional. You can go to two-dimensional, three-dimensional, and so on. And what the sort of corrected Bayesian information criterion is telling you is: that if your solution is on the line, this is as if you have one parameter less, complexity-wise. If it's on a plane, it's two parameters less—three parameters, four parameters, etc.
What seems to be happening in the neural networks is the solutions that the training finds are massively less complex, and lower-dimensional than the network's raw parameter count. This means that the solutions they find are simple solutions, with simple descriptions. That is the fundamental reason why I think "Yeah, sure, there might be some lookup tables in the chess program, some databases in GPT-3, but there are also some algorithms that have simple descriptions that you can find." And the additional leap of faith I'm making is: not only is there a simple description in some sense, but there's also a simple description that looks something like an algorithm you can run. A different description of going step-by-step through the network, where you don't talk necessarily about the neurons and the weights.
Charbel: Thank you, it's probably one of the best arguments for the feasibility of interpretability I've ever heard.
Feasibility of reverse engineering vs. rewriting AI from scratch (2/2)
Charbel: I think I can buy this story to a reasonable extent, but I still don't feel like you've explained why you think reverse engineering would be easier than writing the thing from scratch, e.g. using Copilot or similar models that are already quite impressive. Sure, maybe there's a short description of what the network does, but there's no argument that this description is human understandable.
And some other things I want to mention. Even if you have this C++ version of GPT-4, where every function is annotated with docstrings, it seems to me that it's going to be much more useful for capabilities than for safety. For example, if you have a function that triggers when you see a gun, it can be used for both self-defense and attack. It's not easy to translate the description into an inherent danger of each function's execution.
Lucius: Maybe I'll start with why I'm skeptical of the other path: just code stuff from the ground up rather than reverse engineer. People have been trying to sort of make AI from scratch, and understand language from scratch (as you mentioned) for a while, but not really successfully. We've got an advantage now that we didn't have before, we have GPT-4 to help and soon GPT-5. But the things that GPT-5 and Copilot currently are good at and can carry out are pretty much all stuff that a human programmer could do too. So they're not doing anything we wouldn't be smart enough to do as well. They're just kind of carrying out what they saw in the training data and they can do this at scale and fast. But if I ask them to write me an algorithm that does as well as GPT-3 at processing language, they don't know how to do that any more than we do.
I expect that if you keep making better and smarter AIs, eventually you will indeed get AIs that are smarter than us. And that can write GPT-3 from scratch, with no training necessary. They will invent algorithms that let you do this. But that is exactly around the intelligence level where I'm very scared of these AIs because they're smarter than me and we haven't solved the alignment problem yet. How do you deal with that? You could kind of hope that we'll muddle through and that we can sort of make them kind of smarter than us but still have enough safeguards that they can't kill us immediately.
But that just seems to me like a very, very precarious setup. For starters, you probably need to correctly estimate how smart they are. If you train them too little, they're too dumb to do the job. If you train them too much, they will evade all your safeguards and kill you. And you don't necessarily have the best measurements of how smart they are right now. Because you haven't done a bunch of interpretability work to understand things like "what circuits are in there?", "When do they form and how fast?", "Can I have some scaling curve predictions of capability that are cross-architecture and that have a bit more theory backing them such that we can be sure that the scaling curve does not go weird somewhere?".
If these AIs are quite smart, smarter than you (though maybe not superintelligent yet), and trying to deceive you, they might also try such sophisticated strategies as pretending to be dumber than they are for a while until you make them smarter. Looking at this, I just do not feel very confident at all going this route of "have the AIs help us make smarter, more aligned AIs". This just seems like a chicken and egg problem. I do not feel like this ends well.
Infosec
Lucius: First off, I agree that there are definite infosec concerns around this and definite double use for this. The ability I described, to translate these models into an algorithm you can actually inspect, that's just a whole lot of foundational knowledge of how neural networks actually operate. You're a lot less confused now and usually, if you are less confused about a topic in science, that deconfusion is very multi-use.
By default, I would expect that at least some of what you learn will be usable for capabilities. When you figure out quantum mechanics, you can use that to make a laser, you can use that to make an atomic bomb; and before you figured quantum mechanics out, you wouldn't know about either of these possibilities.
So why do I think that, to an extent, the gamble is worth it anyway? Because I don't think there are nice shortcuts to alignment. I think to make any alignment strategy that's actually gonna work for something that's a lot smarter than you, you need to understand what you're doing. In the same way that when people build a laser, they understand what they're doing. They have the theory of material science and quantum mechanics underlying this. They can be pretty sure about how things will turn out. And that is why they can pull things off like making giant novel billion-dollar apparatuses that have to work on the first try, actually work on the first try. They successfully do that, all the time. I'd say they can do that because they have very solid theories and they are not as confused about things as we are about AI. By default, without strong interpretability, and an understanding of neural network science, I don't see us getting things right on the first try.
Lucius on interpretability's theory of impact
This is also getting into what I think the safety story for interpretability is. What I want to get out of this is not that I expect to see in there the "being bad algorithm", or the "lying-to-me algorithm". That's not how I expect this to work. I don't think there's necessarily any kind of "lying component" in there. Maybe there are some social heuristic components for how you are good at social interaction with humans, and how you predict them well. One thing you can use all of these parts for is lying, if that's the thing that you happen to want to do right now.
In the end, we want to make these models have desires and goals that match our desires and goals. But we have no idea what in these models corresponds in any way to a desire. What's the mathematical type signature of that, even? And if I think about research agendas, to get at that, to get at questions like "what is the type signature of a goal?", "how do you edit goals?", "how do you write goals that do not result in everyone dying?", "how do you write 'make me a copy of a strawberry and don't destroy the world when you do it'?", all of that, to me, seems to require an empirical feedback loop. You want an actual system that you can look at, to test your theories and hypotheses and measure things. And a great way to do this, I think, would be to look at the insides of AIs. Their circuits, the programs they're running. Giving agent foundations research an actual empirical, fast feedback loop that they can iterate on.
For example, at some point, we may understand how this fuzzy abstraction-based reasoning works, that humans seem to engage in all the time, and that I expect GPT-4 probably also engages in all the time. Once we understand how that works, the way that we understand how a fast Fourier transform works, maybe it starts being a whole lot less mysterious how you write"Make me a copy of a strawberry and don't destroy the world" in C . Because you know what abstractions actually correspond to in terms of code.
Then, I would be looking at things like if I have, say, AutoGPT-5 and I told it to make me a T-shirt and it goes out and tries to make a T-shirt business, then it maybe seems well described as an agent right now, an agent that wants there to be lots of T-shirts. Now I can ask what's the mathematics of how that maps back onto the initial prompt and the scaffolding and the circuitry inside the model that make it have that goal rather than a different goal. And then see if you can describe that map. See if you can predict ahead of time that if you make these kinds of circuits and this kind of setup, then you get an AI that wants this thing. That's sort of the road I would want to travel. I do acknowledge however, the infosec concerns are there.
Wrap-up
Charbel: We didn't converge. I still have a lot of things to bring to the table. First of all, you said that it's not safe to use LLMs to do big things that humans can't do. But I think that this also applies to reverse engineering GPT-4 or to OpenAI's plan. Yes, this applies to trying to code the model from scratch, but it also applies to reverse engineering GPT-4.
Regarding your idea about giving the agent foundations program an empirical grounding, and finding goals inside the model, it seems to me that it's fairly plausible that models won't have a clear single goal. I'm not saying deceptive alignment is unlikely. I think it's fairly likely, but the system may be deceptive even if it doesn't have any clear-cut goal that you would correspond to "deceive a human" [Charbel comment: See Deep Deceptiveness, even if I agree this is sort of a second order problem].
Also, it seems to me that there are easier ways to make understandable, bounded systems. Something like task decomposition, what Conjecture uses for cognitive emulation, trying to decompose and monitor the reasoning, and bound each little system, and understanding the capability of each system, like a mini-bureaucracy. Or what Ought is doing with factored cognition. If you assume The Translucent Thoughts Hypothesis (i.e., that the models will mostly reason in English and won't be able to do complex thinking in one forward pass), seems to me fairly plausible that even slightly superhuman systems could still be monitored using those strategies. To me, this seems more feasible than reverse engineering and a bit more neglected.
I've been involved in a lot of field-building, I've seen a lot of junior wanna-be researchers and generally, when you don't have an idea of what to work on, you just might collapse into interpretability without even considering other strategies. I'm not saying interpretability is a bad strategy. But we already tried it a lot and it seems quite difficult. You're saying that most interpretability currently is not good but also saying that good interpretability research exists, but it's not public. So it's hard for me to believe this state of matter 🤷.
Lucius: (I have a long of response to this, but we only have an hour.)
The following is a transcript of a public discussion between Charbel-Raphaël Segerie and Lucius Bushnaq that took place on 23 September during LessWrong Community Weekend 2023. I edited the transcript for clarity.
Transcript
Mateusz: Last month, Charbel published the post Against Almost Every Theory of Impact of Interpretability, which sparked a lot of discussion in the community. Lucius is an AI notkilleveryoneism researcher at Apollo, focused on interpretability. He disagrees with Charbel. So I brought them together to discuss. I suggest that Charbel starts by bringing up what he considers his strongest argument against interpretability, Lucius responds to that, and we'll see how it goes.
Reverse engineering neural networks and correct ontology for interpretability
Charbel: I think we can start with "
auditing deception with interp is out of reach
". So, Neel says, "Maybe we can look at random bits of the model and identify circuits or features." We need to steelman this argument a bit because it's clear that looking at random bits of the model won't be sufficient. Instead, we can try to enumerate each type of feature (enumerative safety). However, it seems to me that this kind of procedure is very poorly specified. We have already attempted similar approaches with vision models, but they didn't quite work [Charbel comment: I updated a bit after the Towards Monosemanticity paper]. Additionally, there are some theoretical arguments, likedeception could be something very diffuse
and it's possible that there won't be any part of the model that is deceptive.Maybe we should try to study deceptive alignment with other techniques instead. For example, trying to monitor the proxies of deceptive alignment, or trying to find ex-ante strategies instead of ex-post strategies. Finally, at the end of the section, I wrote that if DeepMind announced today that they had found an instance of a deceptive model, they certainly wouldn't have found it using interpretability alone. I think I won some Bayes points because the Center for AI Safety recently published a paper where they have a methodology for monitoring Cicero's behavior, whether it's lying and so on. There's no interpretability in that paper.
Lucius: So, first off, my views on interpretability are maybe not exactly mainstream and representative of most interpretability researchers. My answers to this might be very different from what Neel Nanda would, for example, say. I would say that most of current published interpretability is very, very bad and sucks at its job.
I do have an overall belief that making interpretability that does not suck, is actually quite feasible, and that there's no particular reason to believe that it's going to be particularly difficult or take particularly long.
To me, interpretability is not a thing where you go into the model, and ask "Oh, what is this neuron doing?", because this is already assuming something which I don't think is true, namely, that the neurons are something like the elementary variables of the computation the network is carrying out. I don't think there's any evidence for that being correct, and there's quite a lot of evidence speaking to it probably not being correct.
The most abstract evidence I would point to is that neural networks generalize outside of their training data set, even if they have more parameters than training data points. And we have a whole bunch of very, very solid Bayesian math which says that if you fit a function like that to data, and the fit keeps holding outside the training data set, then the function must, in some sense, be simple. It must have a lower Kolmogorov complexity; a lower description length than the number of data points.
In other words, if you've got a neural network that has more weights in it than there were data points, and you fit that to the data, and it generalizes nevertheless, that is implying that, in a sense, the actual algorithm that is running inside this mess of weights has a description that is shorter than the weights and, by extension, the network described in the neuron and weight picture.
To me, this is strongly implying that there's a way to talk about what this network is doing, where the elementary units are not the neurons, but some other thing. And that other thing might be a whole lot simpler and more structured than the mess of weights, which is why it works. And I would say that the job of interpretability is to find that structure and understand how it works. Where "understand" here means not that I can tell some post hoc story about "oh, this part was activating when the model did this, oh, this is like a thing that kind of correlates with dogs", but "oh, this is the algorithm it is running here, which is maybe not an algorithm that anyone has previously seen". Once upon a time, humans did not know the fast Fourier transform algorithm, and then we came up with it. Similarly, I imagine that sometime in the future, people will look into LLMs and go, "Oh, here are some algorithms we didn't know about that help you do language processing". And once you know them, you can tell that you know them because you can now write an LLM program yourself in Python, no neural network training necessary, and the program will do a decent job.
So for starters, that's what I would want to get at by doing interpretability. A lot of the arguments you bring relate to the notion of alien concepts, the AI might not think like you do. I agree with that. I agree that I do not like how a lot of current interpretability seems to just assume that it's human concepts in there. That's the wrong way to go about it. I think you should look at the math of the network, and not project what you expect to find into the network. But it seems to me that this is a straightforwardly solvable thing. So currently, I agree. Currently, interpretability is basically of no practical use for anything. But that's very normal when you're trying to invent a new science or understand a new area. Back in the day, when people were first trying to invent quantum mechanics, what eventually matured into quantum mechanics was not particularly useful for anything.
Last thing, I don't expect that the way that we are currently trying to find deception in the models (looking at correlates and behavior from the outside and having no strong predictive mathematical grounding for any of it) is going to generalize to when the AIs starts being a lot smarter. At the very least, it's not something I would want to rely on. I would rather be in a position where I know how the circuits in there work, I know what's going on. I know that if I now train this kind of model, there is just no way that I can get something that could evade my deception detection. And I feel like with the behavioral stuff we're doing right now, we're not going to get this. There could be weird thermodynamics in there as models scale up that we can't currently predict.
Charbel: So let me summarize your points. First, you are saying that maybe enumerative safety is not the right framing for using interpretability. Maybe we should apply it not at the neuron level, but at some kind of motif level, an assembly of neurons, or maybe not an assembly but some kind of high-level description that tells us how the neural network works. Second, you are comparing interpretability to quantum mechanics, to point out that maybe we should be more charitable to this new science instead of dismissing it outright. And the last point is about generalization. We would like to get some mathematical formalism, and some mathematical guarantees, to ensure that the model is going to behave correctly at scale in the future.
Lucius: That's not quite how I would put it but it's kind of going in a similar-ish direction, so okay.
Just one caveat, which I think is important. The description of the network's behavior I have in mind is not "high-level". It would be equivalent to the low-level behavior of the network while being simpler.
Charbel: Let's take the IOI (indirect object identification) paper, where they studied how the model knows whether to continue the sentence with
John
orMary
? They tried to reverse engineer the circuits inside GPT-2, and they found a circuit made up of a bunch of attention heads. If you look at these heads, you can recover a lot of the model's performance on the task. Do you think this kind of mechanism (that you have in mind) would be like what they did in this paper? If not, what do you think is wrong with it?this paper? If not, what do you think is wrong about it?
Why am I bringing this up? Because it's an excellent interpretability paper, and the description of the circuit implementing such a simple algorithm is already huge and incomplete in the sense that it doesn't explain all the performance of the model and we are focusing on just one thing: is it
John
orMary
? We are not even asking how it knows that it should be the name of a person and not something else. I think that tells us something about the general feasibility of reverse engineering.Lucius: Basically, almost no reverse engineering done to date has been done in a way where I feel safe saying "Yeah, they actually did it; this is actually the description of how the model does this". The one example I can think of right now where I would be willing to concede that they probably have it right is Neel Nanda's modular addition transformer. For all the others, what people are basically doing, as far as I can tell, is they're looking at specific heads, specific neurons in a model, kind of hoping to an extent that they can back out whether this neuron is a part of this circuit, that it's not a part of that other circuit, that this structure here sort of belongs to this behavior, and then try to understand how it does things.
I think the reason this seems so hard and ends up with such finicky descriptions, is that it's assuming a structure to the internals of the model that I just don't think is there. I don't think that you can delimit a circuit to certain neurons. Not in the sense of "Oh, you shouldn't be looking at low-level behavior, because low-level behavior is inherently too finicky", but in the sense of "This is just a bad way to carve up the model into chunks". You probably need quite different ways. There's just no reason to presuppose that a neuron corresponds to something like an elementary variable in the program this thing is running.
Charbel: So you're saying there should be a middle ground between the kind of description we have in the IOI paper and the opposite which would be totally alien concepts?
Lucius: That's not quite how I would put it. Maybe imagine you are given a record of the activations in a CPU that is running on this laptop right now. You don't know what the macOS operating system looks like, or how the other programs running on it work. You just see these transistor activations and you're asking somebody to reverse-engineer whatever program is running on the CPU. If you're doing this, I think it's not a good idea to go in asking "Okay, this transistor was in the "one" state right now… is this the Apple transistor? …is this a Safari-related transistor?" This is the wrong ontology.
To reverse engineer this, what you actually want to do is to understand that this thing has organization levels. There's some kind of mapping you can make from the activations of the CPU, back to something like program code. And this mapping is very, very similar for many kinds of CPUs running many different kinds of programs. You're kind of backing out from transistors into assembly code. Before that, it looked like an absolute mess, everything interacting with everything. But after that, because it's really a bunch of programs running in parallel, it's not as complicated anymore. There's the browser, the operating system, and whatever other apps you've got in the background, and they don't interact with each other that much. And each program is made up of many subroutines and functions that you can understand in isolation, without needing to refer to the other functions. This is the sort of environment in which I want to start talking about circuits, subroutines, and specific parts of the overall algorithm that do specific things.
How does this relate to neural networks? When you're looking at the neurons and at what makes them fire (which is what most interpretability does) you're doing something much closer to looking at when the transistors on your CPU are in the one state or the zero state. What you should actually be thinking about is, "Okay, what's a nice generalizable mathematical mapping I can make from that to a compressed description, an algorithm that does exactly what a neural network is doing but looks simpler?". I suspect, when you look at that structure, it will be a lot shorter, a lot more structured, and a lot more amenable to reductionism. You will be able to understand each part, one at a time. The reason I expect this is that this is how simple things, things that work outside of very narrow environments, tend to work. Simplicity is almost necessary for generality. I have a hard time seeing how these neural networks could be doing as well as they're doing if they are inherently unamenable to short structured descriptions like this.
Feasibility of reverse engineering vs. rewriting AI from scratch (1/2)
Charbel: I can give you a concrete example. Let's take AlphaZero. To interpret it, they took a bunch of concepts that seemed useful for playing chess. For example, can I take the queen or how many pawns do I have. They wanted to find out when AlphaZero learned these concepts, and they succeeded. Then if they wanted to go a little bit further and see if they could discover new strategies by reverse engineering the system, I think that's where this methodology fails. How would you be able to discover new concepts that are not already in the corpus of human knowledge? In the paper they were not able to discover any new concepts. So it's a pretty bad start. And in you say that there's a real difference between IOI and grokking. I think grokking is an algorithmic task, whereas IOI is a natural language processing task, which makes it much harder. Noam Chomsky and other linguists have done a lot of work describing the structure of language, but those descriptions are always incomplete and not exhaustive. If you try to define an object in the real world, you'll get something like "a human is a featherless biped". Whatever you come up with, there will always be counterexamples. Do we really want to go down that road?
Even if this is not exactly what you mean… First, it seems to be beyond human capacity to accurately define fuzzy concepts. Second, this reverse engineering would be very infohazardous, at the very least, because when you can write the GPT-5 equivalent in C++, then you can use a compiler and optimize various modules, which would probably make foom much easier. And then it seems to me that it's harder to take a system and reverse engineer everything out of it than to make it from scratch. [Charbel comment: See the paper From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought for example. It proposes a way to convert GPT-3's fuzzy word model into a probabilistic programming language-style modelization of a world model].
I can give you an analogy. Imagine you are a software engineer who has just joined a new company. You are trying to understand a very large, very complex code base. There are some parts of it that you don't understand very well. Now it turns out that there's a bug in the code and you need to patch it. You have two ways to do it. Either you patch the buggy code (which you don't understand very well and you're not sure your understanding of it is correct), or you understand this module from first principles and rewriting it from scratch.
It seems to me that it's much easier to start again and write everything from scratch than to take a very opaque (though not completely opaque) system like GPT-4 and reverse engineer it into a C++ algorithm. I would rather use LLMs to help me iteratively refine the code that tries to do the same things they can do. [Charbel comment: During the debate, I has something in mind similar to the new Eureka paper (Eureka: Human-Level Reward Design via Coding Large Language Models). I think writing the specs explicitly like this could go a long way alongside Davidad's plan].
Lucius: Okay, so, a couple of points there. You brought up the point about info hazards, which I think I'll address last. But this is maybe the thing we can most agree on. I think a lot of your other points seem to play into the background for why we have differing intuitions of what's easier here: reverse engineering or writing from scratch. I think reverse engineering would probably be a lot easier. I do not think it would be above the human level. I don't expect it to be significantly harder than many other routine tasks and scientific problems that humans solved in the past.
You point out that (to date), even in tiny networks, like the modular adition transformer, the only things people have really been able to identify and understand in there, such that they go "oh, this kind of looks like it's doing something we're familiar with" are… well, things we're familiar with. We already know what to look for and that makes it a lot easier to find it in this mess.
You then correctly say, that this is not going to work for a case where you don't even understand yet what the model is doing. For example, people think they have maybe, to some extent, kind of, identified some points in GPT-3, where it stores data. Maybe. People knew how to store data in a database in the year 2005. But they weren't able to make a language processor, a program that could talk. So certainly, there are some algorithms going on there in GPT-3 that we don't know how to write yet. Otherwise, we could have written GPT-3 ourselves, we wouldn't have needed SGD to do it for us.
And my sort of angle on this is: I think there is a way to do this reverse engineering, where you're not just poking in a mess of neurons and weights looking for things you're familiar with. I think there is a mapping you can make from these neurons and weights to a more regularized structure and that mapping does not require you to know what that structure is in advance. You just kind of look in there for how can you carve up this computation into parts that are, for example, not causally interacting with each other.
Charbel: Is this modularity?
Lucius: This is one way you can call it… I'm deliberately being vague about infohazardous stuff, so I'm not going to say anything concrete about how you might actually do this.
Why do I expect this is possible at all? Why am I so convinced that this is a thing you can actually do? So you mentioned AlphaZero and its reverse engineering where we saw some things in there we recognized, but also, it's obviously running some heuristics that we don't know about yet and have not been able to identify. You said you don't think there's any particular reason these heuristics would be amenable to a short description. And that I think, is where I strongly disagree. Because these heuristics work. These chess programs do not just play chess well in the games they're trained on. They play chess well outside that training data.
Charbel: Yes, but in chess, for example, if you are in a late game, you only have a queen and a few pawns. People who are very good at chess usually memorize strategies for such situations. You can imagine that there is no short description for these endgames. What you call "heuristics" might just be something that has been memorized for the same reason that we memorize chess endgames and there is no short description.
Lucius: I expect there are some parts of the chess engine, and also GPT-3, that are like that, that just store data. You know: "who was Tom Cruise's Mother, look up…, ah!, that was in the training data somewhere!" But in a sense, those parts are actually relatively easy to understand for us, because that's just a database. In a sense, you can factor this out, you can go, you know, "this whole bunch of description is just like a hard disk where things are stored." Then you can talk about the rest of the algorithm and how it knows when to access what, which is the actually difficult part where we don't understand how GPT-3 is doing it so well. Or how the chess program plays well outside of its training distribution, outside of its memorized game situations.
Singular learning theory
Lucius: I do think that heuristics like that have to be in there. It's not just all stored data. Because if it was… To give you an example: GPT-3 has 175 billion parameters. If I fit a 175 billion degree polynomial to internet data like we fit GPT-3 to predict the next token, that polynomial is not going to generalize outside of its training data. The moment I put in any data point that it hasn't seen yet, it's gonna do horribly. Well, why is it going to do horribly? Because that polynomial fit is basically just a lookup table. It is almost uncompressible. The entire 175 billion parameters are used. That is the description length. And there is no short description to be found here, no question of "Oh, what are the underlying dynamics of this thing?". It is just these parameters, there is no short description. GPT-3 is not like that. GPT-3 generalizes. GPT-3, in a sense, thus, is a function fit, an algorithm, that is simple.
If you've heard about singular learning theory… In classical learning theory, there's a thing called the Bayesian information criterion (BIC). Do you know that?
Charbel: Yeah, they use this a lot for comparing Bayesian models. If you have multiple models, you should use the one with the lowest BIC.
Lucius: Yes. So you could roughly, informally see the Bayesian information criterion as the thing that quantifies the trade-off between a better fit to the training data and overfitting (and thus failing to generalize out-of-distribution) as I give it more parameters.
Fun fact: neural networks do not obey the Bayesian information criterion. They do much better than the Bayesian information would suggest, constantly. This has bugged learning theorists for quite a while. What's the reason they do better? At the very start of the learning theory math (that derives the Bayesian information criterion) there's an assumption that the parameter-function map of the function you're fitting is invertible. So if I have a 10-degree function (10 parameters), then for any single 10-dimensional vector (that represents one setting of the parameters) I get one unique input-output mapping. That is very not true in neural networks. There are many, many points in the parameter landscapes that implement the same thing. It turns out that, if you do the learning theory math thing again without starting from that invertibility assumption and try to derive something like the Bayesian information criterion, you get a different formula that is very similar, but instead of the raw parameter count (e.g., GPT-3's 175 billion), you get a number that describes the dimensionality of the optimum in the loss landscape that the network you fitted ended up in.
Charbel: What do you mean by "the dimensionality of the optimum"?
Lucius: If I just have a single point of loss landscape that's a perfect fit to my data, that's zero-dimensional. If I've got a line such that any point on the line is a perfect fit, that's one-dimensional. You can go to two-dimensional, three-dimensional, and so on. And what the sort of corrected Bayesian information criterion is telling you is: that if your solution is on the line, this is as if you have one parameter less, complexity-wise. If it's on a plane, it's two parameters less—three parameters, four parameters, etc.
What seems to be happening in the neural networks is the solutions that the training finds are massively less complex, and lower-dimensional than the network's raw parameter count. This means that the solutions they find are simple solutions, with simple descriptions. That is the fundamental reason why I think "Yeah, sure, there might be some lookup tables in the chess program, some databases in GPT-3, but there are also some algorithms that have simple descriptions that you can find." And the additional leap of faith I'm making is: not only is there a simple description in some sense, but there's also a simple description that looks something like an algorithm you can run. A different description of going step-by-step through the network, where you don't talk necessarily about the neurons and the weights.
Charbel: Thank you, it's probably one of the best arguments for the feasibility of interpretability I've ever heard.
Feasibility of reverse engineering vs. rewriting AI from scratch (2/2)
Charbel: I think I can buy this story to a reasonable extent, but I still don't feel like you've explained why you think reverse engineering would be easier than writing the thing from scratch, e.g. using Copilot or similar models that are already quite impressive. Sure, maybe there's a short description of what the network does, but there's no argument that this description is human understandable.
And some other things I want to mention. Even if you have this C++ version of GPT-4, where every function is annotated with docstrings, it seems to me that it's going to be much more useful for capabilities than for safety. For example, if you have a function that triggers when you see a gun, it can be used for both self-defense and attack. It's not easy to translate the description into an inherent danger of each function's execution.
Lucius: Maybe I'll start with why I'm skeptical of the other path: just code stuff from the ground up rather than reverse engineer. People have been trying to sort of make AI from scratch, and understand language from scratch (as you mentioned) for a while, but not really successfully. We've got an advantage now that we didn't have before, we have GPT-4 to help and soon GPT-5. But the things that GPT-5 and Copilot currently are good at and can carry out are pretty much all stuff that a human programmer could do too. So they're not doing anything we wouldn't be smart enough to do as well. They're just kind of carrying out what they saw in the training data and they can do this at scale and fast. But if I ask them to write me an algorithm that does as well as GPT-3 at processing language, they don't know how to do that any more than we do.
I expect that if you keep making better and smarter AIs, eventually you will indeed get AIs that are smarter than us. And that can write GPT-3 from scratch, with no training necessary. They will invent algorithms that let you do this. But that is exactly around the intelligence level where I'm very scared of these AIs because they're smarter than me and we haven't solved the alignment problem yet. How do you deal with that? You could kind of hope that we'll muddle through and that we can sort of make them kind of smarter than us but still have enough safeguards that they can't kill us immediately.
But that just seems to me like a very, very precarious setup. For starters, you probably need to correctly estimate how smart they are. If you train them too little, they're too dumb to do the job. If you train them too much, they will evade all your safeguards and kill you. And you don't necessarily have the best measurements of how smart they are right now. Because you haven't done a bunch of interpretability work to understand things like "what circuits are in there?", "When do they form and how fast?", "Can I have some scaling curve predictions of capability that are cross-architecture and that have a bit more theory backing them such that we can be sure that the scaling curve does not go weird somewhere?".
If these AIs are quite smart, smarter than you (though maybe not superintelligent yet), and trying to deceive you, they might also try such sophisticated strategies as pretending to be dumber than they are for a while until you make them smarter. Looking at this, I just do not feel very confident at all going this route of "have the AIs help us make smarter, more aligned AIs". This just seems like a chicken and egg problem. I do not feel like this ends well.
Infosec
Lucius: First off, I agree that there are definite infosec concerns around this and definite double use for this. The ability I described, to translate these models into an algorithm you can actually inspect, that's just a whole lot of foundational knowledge of how neural networks actually operate. You're a lot less confused now and usually, if you are less confused about a topic in science, that deconfusion is very multi-use.
By default, I would expect that at least some of what you learn will be usable for capabilities. When you figure out quantum mechanics, you can use that to make a laser, you can use that to make an atomic bomb; and before you figured quantum mechanics out, you wouldn't know about either of these possibilities.
So why do I think that, to an extent, the gamble is worth it anyway? Because I don't think there are nice shortcuts to alignment. I think to make any alignment strategy that's actually gonna work for something that's a lot smarter than you, you need to understand what you're doing. In the same way that when people build a laser, they understand what they're doing. They have the theory of material science and quantum mechanics underlying this. They can be pretty sure about how things will turn out. And that is why they can pull things off like making giant novel billion-dollar apparatuses that have to work on the first try, actually work on the first try. They successfully do that, all the time. I'd say they can do that because they have very solid theories and they are not as confused about things as we are about AI. By default, without strong interpretability, and an understanding of neural network science, I don't see us getting things right on the first try.
Lucius on interpretability's theory of impact
This is also getting into what I think the safety story for interpretability is. What I want to get out of this is not that I expect to see in there the "being bad algorithm", or the "lying-to-me algorithm". That's not how I expect this to work. I don't think there's necessarily any kind of "lying component" in there. Maybe there are some social heuristic components for how you are good at social interaction with humans, and how you predict them well. One thing you can use all of these parts for is lying, if that's the thing that you happen to want to do right now.
In the end, we want to make these models have desires and goals that match our desires and goals. But we have no idea what in these models corresponds in any way to a desire. What's the mathematical type signature of that, even? And if I think about research agendas, to get at that, to get at questions like "what is the type signature of a goal?", "how do you edit goals?", "how do you write goals that do not result in everyone dying?", "how do you write 'make me a copy of a strawberry and don't destroy the world when you do it'?", all of that, to me, seems to require an empirical feedback loop. You want an actual system that you can look at, to test your theories and hypotheses and measure things. And a great way to do this, I think, would be to look at the insides of AIs. Their circuits, the programs they're running. Giving agent foundations research an actual empirical, fast feedback loop that they can iterate on.
For example, at some point, we may understand how this fuzzy abstraction-based reasoning works, that humans seem to engage in all the time, and that I expect GPT-4 probably also engages in all the time. Once we understand how that works, the way that we understand how a fast Fourier transform works, maybe it starts being a whole lot less mysterious how you write"Make me a copy of a strawberry and don't destroy the world" in C . Because you know what abstractions actually correspond to in terms of code.
Then, I would be looking at things like if I have, say, AutoGPT-5 and I told it to make me a T-shirt and it goes out and tries to make a T-shirt business, then it maybe seems well described as an agent right now, an agent that wants there to be lots of T-shirts. Now I can ask what's the mathematics of how that maps back onto the initial prompt and the scaffolding and the circuitry inside the model that make it have that goal rather than a different goal. And then see if you can describe that map. See if you can predict ahead of time that if you make these kinds of circuits and this kind of setup, then you get an AI that wants this thing. That's sort of the road I would want to travel. I do acknowledge however, the infosec concerns are there.
Wrap-up
Charbel: We didn't converge. I still have a lot of things to bring to the table. First of all, you said that it's not safe to use LLMs to do big things that humans can't do. But I think that this also applies to reverse engineering GPT-4 or to OpenAI's plan. Yes, this applies to trying to code the model from scratch, but it also applies to reverse engineering GPT-4.
Regarding your idea about giving the agent foundations program an empirical grounding, and finding goals inside the model, it seems to me that it's fairly plausible that models won't have a clear single goal. I'm not saying deceptive alignment is unlikely. I think it's fairly likely, but the system may be deceptive even if it doesn't have any clear-cut goal that you would correspond to "deceive a human" [Charbel comment: See Deep Deceptiveness, even if I agree this is sort of a second order problem].
Also, it seems to me that there are easier ways to make understandable, bounded systems. Something like task decomposition, what Conjecture uses for cognitive emulation, trying to decompose and monitor the reasoning, and bound each little system, and understanding the capability of each system, like a mini-bureaucracy. Or what Ought is doing with factored cognition. If you assume The Translucent Thoughts Hypothesis (i.e., that the models will mostly reason in English and won't be able to do complex thinking in one forward pass), seems to me fairly plausible that even slightly superhuman systems could still be monitored using those strategies. To me, this seems more feasible than reverse engineering and a bit more neglected.
I've been involved in a lot of field-building, I've seen a lot of junior wanna-be researchers and generally, when you don't have an idea of what to work on, you just might collapse into interpretability without even considering other strategies. I'm not saying interpretability is a bad strategy. But we already tried it a lot and it seems quite difficult. You're saying that most interpretability currently is not good but also saying that good interpretability research exists, but it's not public. So it's hard for me to believe this state of matter 🤷.
Lucius: (I have a long of response to this, but we only have an hour.)