Review
I'm pessimistic about the chances of this happening automatically, or requiring no expensive tradeoffs, because training networks to have good performance will disupt the human interpretability of intermediate representations, even if we think those representations look like text. But I do think it's interesting, and that attempts to make auditing via intepretable intermediate representations work (and pay the costs) aren't automatically doomed.
Epistemic status: Uncertain about the validity of the claims I’m making here, and looking for feedback about the research directions I’m suggesting.
Thanks to Marius Hobbhahn, Johannes Treutlein, Siméon Campos, and Jean-Stanislas Denain for helpful feedback on drafts.
Here is a set of hypotheses:
I call these the Translucent Thoughts hypotheses.
I think the Translucent Thoughts hypotheses are likely (around 20% conditioning on AGI before 2030) because:
If these hypotheses are true, it should lead us to prioritize underexplored research directions, such as circumventing steganography or building extremely reliable text-supervision methods. I think those deserve attention, because Translucent Thoughts AIs are not safe by default.
In this post, I argue that we may will in a world where the first AGIs will look like X, and I then describe ways to make the first AGIs safer given X. This is different from most other works in this space, which often directly describe a kind of safe AGI. Despite this, the ideas of this post are close to some other works describing paths to safe AGIs, such as:
The Translucent Thoughts Hypotheses
Here, I sketch a world in which the first AGIs have certain properties. I argue that this world is likely, and thus a subset of all possible futures to care about. But I think it’s not a large part of all possible futures (20% conditioning on AGI before 2030).
The First AGIs Will Have LLMs at Their Core
By “first AGIs” I mean the first systems able to automate all cognitive tasks.
AGI is likely to do reasoning and planning using LLMs. AGI might rely on vision models for some tasks and interactions with the world, and it might use explicit search processes like AlphaGo. But I expect LLMs to do plan generation and evaluation, which are the core of the system (from an Alignment point of view).
End-to-end neural networks won’t be able to compete with LLMs when it comes to reasoning and planning, or at least, end-to-end networks will use “their LLMs parts” to do their most advanced form of reasoning and planning. This means that systems like Gato will either “behave like LLMs” to produce advanced plans (by generating language), or be out-competed by systems which do.
Why this hypothesis is still likely to be wrong: Progress in AI is fast, LLMs didn’t look promising 5 years ago, and they might be outcompeted in the next decade. As more compute is used to train RL agents, fully black box RL agents might be able to generate and evaluate advanced plans without going through text generation at all.
Effective Plans to Defeat Humanity Can’t Be Found in a Single LLM Forward Pass
The LLMs composing the first AGI will be Transformer-like: they will only execute as many serial computation steps as they have layers, and will follow roughly the same scaling laws as transformers (with a speedup modeled by general “algorithmic speedup”). Note: the number of computation steps is the depth of the computation tree, not the number of “reasoning steps”, nor the number of FLOPs used.
For the sake of simplicity, I will call “Transformers” all Transformer-like neural networks.
LLMs composing the first AGIs will be Transformers with less than 2000 layers.
Finding an effective plan to defeat humanity requires more than 2000 serial steps. Gradient descent might “hard-code” some deceptive strategies. But I expect that Transformers won’t play the full deceptive training game in one forward pass[2].
During training, the network won’t learn a heuristic which, when followed, has the same effect as following a complex plan to defeat humanity. It can still learn simple sycophant heuristics, but not heuristics to defeat humanity. So if it took over, it would be because it would have gone trough the difficult process of finding a way to defeat humanity.
Why this hypothesis is still likely to be wrong:
LLMs Will Solve Complex Tasks by Using English Text
The LLMs will be pretrained with next token predictions on human text.
They will be able to solve complex tasks by generating English text, which will contain results of intermediate computations, enabling them to solve tasks too complex to be solved in a single forward pass. For example, they may generate chains of thought or prompt other LLM instances.
One forward pass allows only #layers serial steps, while generating tokens allows for #layers x #generation_steps serial steps. Figure from By Default, GPTs Think In Plain Sight
This kind of procedure where tokens are sampled and fed to an LLM won’t be outcompeted by alternative architecture without the “sampling bottleneck” (for example, systems with internal analog memory, or recursively computed hidden states like RNNs).
When I say “English”, I mean any language for which there is a lot of available human text. This English could, as a result of fine-tuning, be packed with hidden information that humans can’t read. But it uses at least a little bit the English language in its intermediate generations steps, it’s not a completely opaque scratchpad. Note: this definition is a bit fuzzy. I think it would be fair to say that an AI has not translucent thoughts if its scratchpad is mostly opaque, even if you can understand some small parts of it.
Note: I think that LLMs trained purely on human generated text (and without strong inner optimizers) won’t hide crucial information in their generations, because the information they put inside generations is the information humans would have put in their texts. Most of the information humans put in their text is put there unconsciously and might therefore not be easily recoverable, but the kind of information a text needs to contain to be significant progress towards finding a plan to defeat humanity (which is “crucial information”) is not the kind of information humans put in their text without thinking about it.
Why this hypothesis is still likely to be wrong: RL might be much more powerful than I expect it to be when enough compute is used. This would make AIs trained with RL able to build and use a language much more powerful than English, outcompeting AIs using English as high level language of thought. This could make scratchpads of the first AGIs completely opaque. It could even bypass the need for a scratchpad at all if I’m wrong about the strength of the sample + broadcast algorithm, and if direct analog communication channels yield better systems.
Research Directions
If the Translucent Thoughts hypotheses are true, then the first AGIs would look like LLMs too dumb to create and execute a plan to defeat humanity in a single forward pass, but which can do all relevant tasks when they generate English text (through chains of thought, cross-LLMs prompting, …).
Aligning different types of systems requires different techniques and ideas. While there may be some overlap, aligning an AGI with translucent thoughts is different from aligning an end-to-end network trained with RL which is different from aligning an ideal optimizer maximizing a reward function. Thus, if the Translucent Thoughts hypotheses are true, certain research directions may be more useful than others
Introduction: What a Safe AGI Could Look Like
Here is a description of what a safe AGI could be in a world in which the Translucent Thoughts hypotheses are true. It is heavily inspired by AI systems described in Externalized reasoning oversight, The Open Agency Model, Factored Cognition and discussions I’ve had with Siméon.
The description adds assumptions that probably won’t be met if a system like that is developed with more powerful LLMs and today’s Alignment techniques. Some research directions below are paths towards making them true.
When a user asks the system to solve a task, an LLM reads the task description and prompts other LLMs on subprocesses. In turn, these LLMs launch other LLMs (like HCH), etc. The LLMs might invoke special utilities like Monte Carlo tree search, file readers & writers, image recognitions or image generations, by using special tokens. These LLMs communicate using only understandable English: the information present in the text is exactly the information a human would extract from it. The whole system just does what the best humans can already do, only on a larger scale, much faster and with less noise. They communicate at a fast pace, but an oversight system enables humans to have an overview of what the system does, and brings specific generations to the human attention. On critical tasks, or where the oversight system detects something different from the kind of behavior previously seen and inspected, each generation is inspected by several independent teams of humans before it is fed to its target LLM or utility. When the system comes up with the concept, the system is paused and inspected so that overseers can understand what this concept means and how it was born, so that the overseers are able to always follow what the system is doing. If a problem is detected, the oversight system can kill the problematic generation and start a new one using a different seed. LLMs can also be tuned in a way that preserves the nice properties described above. Because the system was trained to act like humans and look good to AI augmented humans, it often follows dumb sycophant heuristics, but people know that and they are still able to make amazing progress by only using the system for tasks where that’s not an issue[4].
What makes this system safe is that every part of the system is too dumb to plan something that involves hiding information from humans or escaping supervision. Communication happens in plain English and doesn’t contain information that humans can’t understand. Therefore, even though the whole system is slightly above human level the overseers can catch and stop any plan to deceive humans and escape supervision.
This exact description is just one example of safe AGI in a world of which the Translucent Thoughts hypotheses are true. A possible research direction is to find other examples of safe AGIs (conditioning on Translucent Thoughts hypotheses), with assumptions that are less demanding than the AI system described above.
Making LLMs Use Only Understandable English
As gwern explains in this comment, we should expect that LLMs will hide information in their generations by default, even for LLMs trained on next token predictions on internet text
Here is why:
Here are plausible mitigation strategies. I expect them to be hard, but worth exploring:
Building an Oversight System
You would need an advanced oversight system to supervise an AI even if it has transparent thoughts, beyond what is currently possible. It should:
Here are some directions which might help:
Note: the challenges of this oversight system are different and easier than the challenges of oversight described in AGI safety from first principles, because this oversight system inspects “thoughts”, not actions. Therefore, it doesn’t need to anticipate the future effects or the societal effects of what it measures. See Supervise Process, not Outcomes for more details.
Revisiting Concepts and Arguments
Classic arguments which detail reasons for concern about AI could be examined again, conditioning on the Translucent Thoughts hypotheses. This could look like:
This could be applied to arguments such as How likely is deceptive alignment? (What would deception look like for AIs with translucent thoughts? Is it path dependent?), Why AI alignment could be hard with modern deep learning (How could SGD find deceptive strategies with these AI systems?) and AGI safety from first principles (Would these kinds of system be agentic?).
Testing the Hypotheses
A detailed analysis of the Translucent Thoughts hypotheses can help determine if they are realistic or unrealistic, which in turn can help researchers decide if pursuing projects conditioned on those hypotheses is worthwhile.
Conducting experiments with small neural networks, simple Mathematical models, or human subjects can help identify the minimum number of layers and reasoning steps required for a system to generate creative and powerful plans. This can help determine the validity of the Translucent Thoughts hypotheses and increase (or decrease) confidence in AI systems that use a low number of serial steps.
Making the Hypotheses True?
If the Translucent Thoughts hypotheses are accurate, we are more likely to survive the first AGIs than if the hypotheses are not valid. This is because the most probable alternative would likely involve black box systems optimized with Reinforcement Learning, which are much harder to supervise. Therefore, research efforts that focus on building powerful AI systems through the assembly of large language models (LLMs) like GPT, which are primarily trained for next-token prediction, may be safer and more valuable than other kinds of capabilities research. This could potentially lead to positive differential technological progress.
I expect that building such systems is not what independent researchers and small groups should do, but I would be excited if some of the money and talents being spent today on improving RLHF and scaling the size of Transformers was being retargeted towards building networks of LLMs.
Related work
Externalized Reasoning Oversight
In short: It describes systems which also “think” with explicit reasoning in English, don’t hide information in their reasoning, but also make all reasoning steps explicit (which is really hard to formalize), and don’t output reasoning steps which are not used to generate the answer (which might be hard to get in practice). I think these last two constraints can be safely ignored if models can’t find deceptive plans in a single forward pass.
In Externalized reasoning oversight, Tamera describes an alignment proposal which relies on making AIs with authentic reasoning - reasoning which is complete, straightforward, and causally responsible for the output - and oversee the reasoning to give feedback based on the reasoning itself and to make sure nothing wrong happens). The post does not make the case for competitiveness, but it gives research directions which significantly overlap with the ones I described.
But authentic reasoning is a much more stringent condition than the absence of hidden crucial information in the model’s outputs (which Tamera calls straightforward reasoning).
Completeness
An authentic reasoner should make everything humans think are “important steps”, even when it can do in a single forward pass.
Why ensuring completeness is hard: For many tasks, making the reasoning explicit will be much harder than learning to do it in a single forward pass, thereby severely limiting the power of the system. Moreover, as the author points out in the original paper, completeness is not well defined because there isn’t a clear definition of what a step should be.
Causality
An authentic reasoner would have to not give useless reasoning steps. For example, when ChatGPT answers “A” then proceeds to explain for 4 paragraphs why the answer is “A”, it’s not causally related to the answer it has given.
Why ensuring causality is hard: Human text is full of wrong justifications, and unconscious information - such as context and style - plays a large role in LLMs outputs (making the causal role of “conscious human information” weaker). For example, few shot learning helps, but not for the reasons you might expect (see Rethinking the Role of Demonstrations). Therefore, it might not be easy to make a system generate only causally relevant steps.
Why I think completeness & causality are overated
The only task for which a lack of transparency would be fatal are complicated task such as finding a plan to defeat humanity. I believe those tasks can’t be solved in a single forward pass, even if it tried. Therefore, it’s not a problem if you can’t “expand” a reasoning into multiple explicit steps, or if the model just makes up a non causal explanation.
Therefore, I prefer to replace “the model is forbidden to skip/not use what humans think are quite important steps” by “for extremely hard tasks, the model is computationally incapable of skipping all relevant steps and has to use at least some intermediate steps it made explicit”.
If a model can’t defeat humanity without having explicit thoughts about it because of computational constraints, and generate neither complete nor causally relevant thoughts, then it would produce many completely unauthentic explanations, and this can lead to false alarms, but it won’t be able to craft a plan to defeat humanity without getting caught.
Conditioning Predictive Models
In short: It describes how one might use extremeley powerful AIs trained purely on prediction. This also relies on pretrained models being competitive with other AI systems. It’s safety guarantees would not come from the weakness of the individual parts of the system, but rather from the properties of a system which tries to do “pure prediction”. This yields very different research directions, but both approaches can be combined.
In Conditioning Predictive Models, the authors lay down what would be needed to safely use predictive models if one wasn’t able to look inside them.
For this approach to be useful, predictive models must be competitive with other types of AI systems. As a result, I think they are describing systems not that far from the one described above.
(Because other modalities than text are less data efficient, the core predictive process will likely be driven by LLMs.)
(but not exclusive)
(but not exclusive)
(but not exclusive)
Thus, these approaches yield slightly different predictions and different research directions:
Overall, Conditioning Predictive Models is an adjacent but different research frameworks working with slightly different and compatible assumptions. Their lens seems to leave more room for armchair thinking (though they also outline many empirical open problems), while the research directions I suggest (such as the study of steganography) are mostly empirical.
The Open Agency Model
In short: It describes AI systems which work through an explicit decomposition of high level tasks through planning and evaluation in natural language, then execution by very simple RL agents. It interpreted strictly, it’s a tiny subset of AIs with Translucent Thoughts which does not look competitive. If interpreted loosely, it looks a lot like AIs with Translucent Thoughts, without strong safety guarantees.
The Open Agency Model (by Eric Drexler) describes AI systems where “Generative planning models respond to goals as prompts”, “Plan descriptions are externalized and interpretable”, plans are selected through a selection process which confronts the points of view of “diverse agents”, and selected plans are implemented by dumb agents in small, bounded tasks.
Eric Drexler describes a set of AI systems which is either a subset of safe AGIs with transparent thoughts or a subset of AIs with Translucent Thoughts.
I’m unsure how strict the definition of an Open Agent is supposed to be. Here, I’ll describe the relations between two extreme interpretations of Open Agents, and what is described in this post.
Strict Open Agents
The feedback from the world only improves the dumb agents. The evaluation system as well as the generation system are only imitating humans or following hard-coded rules. The plans are generated in a single pass, without querying the evaluation process to prune or guide the search. The generation and evaluation systems have no knowledge about the details of the dumb agents. Each part of the system is too dumb to take over. In particular, the part of the system responsible for plan generation is too dumb to do anything dangerous such as generating code, or manipulating humans, which is a strong upper bound on its general abilities.
This is a super restrictive kind of AI system. It looks hard to make such a system because the current best ways to generate plans is to use the best possible Language Model, and for such a system, strong abilities at plan generation is usually tangled with strong abilities at code generation & manipulation.
This kind a Open Agent is such a small subset of the systems I describe here that the general problems of AIs with translucent thoughts do not apply, and are replaced with extremely hard technical challenges and a plan which I don’t find competitive with other approaches.
General Open Agents
The feedback from the world can flow back to the evaluation and generation processes. Plans are generated through many cycles of generation and evaluation, and evaluation can contain commentary to improve it further. The only constraint is that each communication between part of the system have to be done in English text. A pretrained LLM is used as a foundation for the generation process, the evaluation process, and some “dumb agents”, but on its own, the pretrained model is too dumb to imagine and execute a plan to defeat humanity in a single forward pass (so that it doesn’t try to game the system it is in).
This is indistinguishable from the Translucent Thoughts hypotheses. It inherits the safety problems that AIs with Translucent Thoughts have by default.
Factored Cognition
In Factored Cognition, Andreas Stuhlmüller (from Ought) describes the idea that tasks can be can be decomposed into small pieces, and that this can be used to build supervisable AI systems.
The kind of AI system he describes is an instance of AI with translucent thoughts: each part of the system is too dumb to do anything dangerous in a single forward pass, and the different parts of the system prompt themselves through English text.
From what I’ve read, the Factored Cognition project is mostly focused on making an AI with translucent thoughts work, rather than analyzing it’s potential safety issues.
Note: Ought has shifted its focus towards another project (see here) and I’m not aware of anyone else working on this at the moment.
Using table 3 from the Chinchilla paper, and assuming that #layers is proportional to #params0.3, then the lifetime anchor (1024 FLOPs) optimal network has the size of Chinchilla (80 layers), and the genom anchor (1033FLOPs) optimal network has 3000T parameters, which corresponds to 2000 layers. #layers is proportional to #params0.3 is not a well-known scaling law, it’s just what you can observe in Palm (α=0.315 in the log linear regression), for Pythia models too (α=0.309), and also for GPT-3 models (α=0.249) (all models combined gives you α=0.275). This last log-linear models predict 1270 layers for a 3000T parameters model, and the Palm one predicts 1820 layers. This is highly speculative, as the bassin of “good amount of layer” is quite large, and thus layer scaling laws are much weaker than data/compute scaling laws.
See How Likely Is Deceptive Alignment by Hubinger. Here is the kind of thoughts and reasoning I expect are too long to fit in a single forward pass: “Given what I know about my training setup and my future deployment setup, it’s likely that I’m being trained. Given my long term goals, and given I’m in training, it’s better to achieve low loss so that my goals aren’t trained away by SGD. Given what I know about the training setup, what I have to do is Y. To do Y, the next token to output to get low training loss is X. Therefore output X.” and “Given what I know about my training setup and my future deployment setup, it’s likely that I’m deployed. Given my long term goals, and given I’m in deployment, it’s better to take control of humanity. To take control of humanity, the next step to take is Y. To do Y, the next token to output is X. Therefore, output X.”
In tasks where the answer is only a few tokens long, and for which the thoughts can’t be done “as you read the text”. For example, text summarization or evaluating which one of two SQL queries is the best one are tasks which can be done “as you read the text”, which is not the case for figuring out if a mathematical statement is true or false (for most Maths statements not memorized during training).
For example, you wouldn’t ask it “am I right to think that I should nuke France”, or the system might be heavily biased towards giving you an answer consistent with your beliefs, even if it’s not explicitly thinking about how to give you an answer consistent with your beliefs. I think humans sometimes follow this kind of unconscious heuristics, which is why I expect them to also exist in LLMs.
The false negative rate needs to be especially high if the LLMs can’t be tuned to output less deceptive plans. But if it LLMs can be tuned, then a false negative rate of 10−6 might be enough to get to a very low take over probability: if each instance of deceptive plan enables feedback that increases by 1% the time between deceptive plan generations, and if at the beginning, the system generates one deceptive plan per hour, then if you run the model for 10 years, it would have a probability of 0.0006 to generate a deceptive plan which doesn't get caught (while this probability would be 0.08 without any learning).