Epistemic status: brainstorming some speculative research directions. Not trying to thoroughly justify the claims I’m making.

One way to think about the AI safety problem: there’s a spectrum of methods which each represent a different tradeoff between safety and ease of training an AGI, and unfortunately the two are anticorrelated. In particular, consider four regimes in which the bulk of training might occur (perhaps with additional fine-tuning afterwards):

  1. Training a language model to answer questions correctly.
  2. Training a RL agent on a range of limited tasks (e.g. games).
  3. Training a RL agent on general tasks in large-scale simulations for long time periods.
  4. Training a RL agent in competitive multi-agent environments.

I claim (but won’t fully defend here) that these are in order from safest but most difficult, to easiest but most dangerous:

  1. Regime 1 will produce a question-answering system which has no experience taking actions in the world, and which may not be goal-directed at all. But many researchers expect that it’ll be much easier to create an AGI which can answer difficult questions by training it to interact with a simulated world, so that its concepts are “grounded” by experience.
  2. Regime 2 is likely to produce an agent whose goals are bounded, and whose concepts are grounded; but which might only do well on the specific tasks it had been trained on. If so, building AGI in this regime would require a very sophisticated curriculum, if it’s possible at all.
  3. Regime 3 provides a rich environment for an agent to learn quite general skills and concepts. However, now the agent will also be rewarded for developing large-scale goals, which might make it dangerous.
  4. Regime 4 additional provides an “autocurriculum” via competition, the training signal from which could accelerate the development of general intelligence (as it did in humans). However, the agent could learn harmful skills and motivations (such as deception, manipulation or aggression) from competing with other agents, which it might then apply to interactions with humans.

This is a problem - but it’s also an opportunity. If we accept the claims I’ve made about this spectrum, then it might be much easier to train a relatively safe and non-agentic AGI if we start training in less safe regimes, and then gradually transition the training of that AGI into safer regimes. More specifically, I’m proposing a training curriculum in which an agent is trained in regime 4 until it displays a given level of competence; then moved to regime 3 until it again displays a significant amount of progress; then regime 2; then regime 1. The specific regimes used are not vital; some could be removed or replaced by others that I haven’t thought of. Neither is it essential to keep using exactly the same agent; it’d need to be retrained to use different observation and action spaces, and perhaps have its architecture modified during transitions. (In particular, it might be useful to incorporate a pre-trained language model at some point to kick-start its understanding of language.) The main point is that as training progresses, we increasingly use safer training regimes even though we expect it to be much more difficult to train an AGI solely using those regimes.

The key hypothesis is that it’s not uniformly harder to train AGIs in the safer regimes - rather, it’s primarily harder to get started in those regimes. Once an AI reaches a given level of intelligence, then transitioning to a safer regime might not slow down the rate at which it gains intelligence very much - but might still decrease the optimisation pressure in favour of that AI being highly agentic and pursuing large-scale goals.

I have some intuitions in favour of this hypothesis (although I’m still pretty uncertain). Here are three:

  1. Language-based tasks or limited-domain tasks can be made almost arbitrarily hard, we won’t run out of ways to challenge the agent (although this might require more work than continuing training in other regimes).
  2. Exploration is much easier for an agent that’s already quite intelligent, and so reward sparsity matters less. This would also make it easier to manually design tasks, since a wider range of difficulties would be acceptable.
  3. Once an agent has had some amount of interaction with a (simulated or real) world, it can interpret language as referring to objects in that world. Whereas without any interaction, it seems far more difficult to develop a solid understanding of objects and concepts. (Note that although quite a few AI researchers seem to believe that such grounding is important, I haven’t seen much justification for it in the context of modern machine learning. So I wanted to flag this as a particularly intuition-driven claim.)

One example which might help illustrate the overall hypothesis is that of Helen Keller, who developed a sophisticated understanding of the world (and wrote 12 books), despite becoming deaf and blind before the age of 2. Compare two options: either trying to train an AGI (from random initialisation to human-level intelligence) which receives the inputs that a typical human has, or else trying to do the same thing giving it only the inputs that Helen had (primarily touch, and language initially communicated via taps on her hand). I think the latter would be significantly more difficult, because having much less data imposes many fewer constraints on what the structure of the world could be like. And yet Helen did not learn many times more slowly than most people; in fact she earned a degree from Harvard in four years. My explanation for this: the learning problem Helen faced was much harder than what most of us face, but because her brain architecture had already been “trained” by evolution, she could make use of those implicit priors to match, and then surpass, most of her contemporaries.

The second crucial hypothesis is that the AGI doesn’t also retain dangerous characteristics from earlier training regimes. Again, I’m very uncertain about this hypothesis - it might be that, once the system has started training in a highly competitive environment, it will continue to have competitive and highly agentic motivations until we actively prevent it from doing so. Yet there are ways to mitigate this. In particular: to the extent that we can identify which part of the AGI is responsible for goal-directed cognition, we can remove that before continuing to train the rest of the network in the new regime. This would rely on interpretability techniques - but techniques which could be much cruder than those required to understand what an AGI is thinking or planning at any given moment. As an illustration, I think we already understand the human brain well enough to identify the parts most responsible for goal-directed cognition and motivation (although I don’t know much neuroscience, so corrections on this point would be very welcome). After removing the analogous sections of an AGI, it’d need to develop a new motivational system in subsequent training regimes - but I’m hoping that those later regimes only incentivise limited goal-directedness, towards bounded goals.

Right now, I think it’s pretty likely that there’ll be some transitions between different training regimes when developing AGI - e.g. for language models it’s already common to start with unsupervised pre-training, and then do supervised or RL fine-tuning. But note that this transition goes in the opposite direction (from more limited tasks to larger-scale tasks) compared with the ones I discussed above. So my proposal is a little counterintuitive; but while I’m not confident that it’ll be useful (perhaps because I’m still quite confused about agency and goal-directedness), I think it’s worth evaluating. One empirical test which could already be done is to see whether pre-training in a simulated environment is advantageous for developing better language models. Another, which might only be viable in a few years, is to investigate how separable goal-directed cognition is from world-modelling and other aspects of intelligence, in deep reinforcement learners with a range of neural architectures.


* Given that I have no experience of being deaf or blind, and have not looked into it very much, my intuitions on this point are not very well-informed; so I wanted to explicitly flag it as quite speculative.

New Comment
3 comments, sorted by Click to highlight new comments since:

I don't think that design (1) is particularly safe.

If your claim that design (1) is harder to get working is true, then you get a small amount of safety from the fact that a design that isn't doing anything is safe.

It depends on what the set of questions is, but if you want to be able to reliably answer questions like "how do I get from here to the bank?" then it needs to have a map, and some sort of pathfinding algorithm encoded in it somehow. If it can answer "what would a good advertising slogan be for product X" then it needs to have some model that includes human psychology and business, and be able to seek long term goals like maximising profit. This is getting into dangerous territory.

A system trained purely to imitate humans might be limited to human levels of competence, and so not too dangerous. Given that humans are more competent at some tasks than others, and that competence varies between humans, the AI might contain a competence chooser, which guesses at how good an answer a human would produce, and an optimiser module that can optimise a goal with a chosen level of competence. Of course, you aren't training for anything above top human level competence, so whether or not the optimiser carries on working when asked for superhuman competence depends on the inductive bias.

Of course, if humans are unusually bad at X, then superhuman performance on X could be trained by training the general optimiser on A,B,C... which humans are better at. If humans could apply 10 units of optimisation power to problems A,B,C... and we train the AI on human answers, we might train it to apply 10 units of optimisation power to arbitrary problems. If humans can only produce 2 units of optimisation on problem X, then the AI's 10 units on X is superhuman at that problem.

To me, this design space feels like the set of heath robinson contraption that contains several lumps of enriched uranium. If you just run one, you might be lucky and have the dangerous parts avoid hitting each other in just the wrong way. You might be able to find a particular design in which you can prove that the lumps of uranium never get near each other. But all the pieces needed for something to go badly wrong are there.

I agree. I'm generally okay with the order (oracles do seem marginally safer than agents, for example, and more restrictions should generally be safer than less), but also think the marginal amount of additional safety doesn't matter much when you consider the total absolute risk. Just to make up some numbers, I think of it like choosing between options that are 99.6%, 99.7%, 99.8%, and 99.9% likely to result in disaster. I mean of course I'll pick the one with a 0.4% chance of success, but I'd much rather do something radically different that is orders of magnitude safer.

Yeah, so I guess opinions on this would differ depending on how likely people think existential risk from AGI is. Personally, it's clear to me that agentic misaligned superintelligences are bad news - but I'm much less persuaded by descriptions of how long-term maximising behaviour arises in something like an oracle. The prospect of an AGI that's much more intelligent than humans and much less agentic seems quite plausible - even, perhaps, in a RL agent.