AI
Frontpage

25

Overview: By training neural networks with selective modularity, gradient routing enables new approaches to core problems in AI safety. This agenda identifies related research directions that might enable safer development of transformative AI.

Introduction

Soon, the world may see rapid increases in AI capabilities resulting from AI research automation, and no one knows how to ensure this happens safely (Soares, 2016; Aschenbrenner, 2023; Anwar et al., 2024; Greenblatt, 2025). The current ML paradigm may not be well-suited to this task, as it produces inscrutable, generalist models without guarantees on their out-of-distribution performance. These models may reflect unintentional quirks of their training objectives (Pan et al., 2022; Skalse et al., 2022; Krakovna et al., 2020).

Gradient routing (Cloud et al., 2024) is a general training method intended to meet the need for economically-competitive training methods for producing safe AI systems. The main idea of gradient routing is to configure which parameters in a neural network update on which tasks. The assignment of tasks to parameters allows the ML practitioner to supervise the internal structure of models. We hope that this supervision can be leveraged to obtain robust, practical safety assurances. The original gradient routing paper presents preliminary empirical evidence for the following benefits:

  • Targeted transparency: by limiting updates for a task to a particular network subregion, we expect that subregion to contain task-specific information that is necessary for good performance on the task. This limits the search space that interpretability methods must consider in understanding that particular task.
  • Enabling robust unlearning by ablation: by localizing updates to a particular region, we can then robustly remove these capabilities by ablating the region.
  • Enhancing weak supervision: by inducing specialization in network submodules, limited feedback can generalize further than it otherwise would.

The results from the original paper are summarized here.

However, key questions remain. These questions include:

  1. What are the practical applications of gradient routing? (Can gradient routing be used to create general recipes for enhancing weak supervision?) For a given application, what are best practices for applying it, and what tradeoffs do they entail?
  2. How does gradient routing change the learning dynamics of neural networks?
  3. What are the fundamental limits of gradient routing— for example, when is it possible to separate certain capabilities from others?
  4. If gradient routing works, how could we leverage it to promote safe development of superintelligent AI?

These questions motivate our research agenda, which has four pillars:

  1. Gradient routing methods and applications
  2. Conceptual work on gradient routing
  3. Related conceptual work
  4. Implications for AI safety

Optimistically, we hope that progress on this agenda will unlock a new ML paradigm. This new paradigm would leverage the power of black-box function approximation by neural networks while imposing selective structure related to safety-critical properties. This structure would grant empirically verifiable assurances about model behavior. We imagine generalist AIs with selective incapacities, like lack of knowledge of humans or situational unawareness; we imagine versions of models with varying dispositions or capabilities deployed under access control; we imagine using proxy labels to influence training without falling prey to specification gaming.

The research agenda may fail to live up to these hopes. Supervision of neural network internals is cruder than behavioral training. It may be the case that the kinds of structure which induce meaningful safety properties are prohibitively costly (as measured in a model’s ability to fit the training data), or that the kinds of properties we can induce turn out to be unhelpful for ensuring AI safety.

We view the agenda favorably because it plausibly addresses fundamental problems in AI safety and its core uncertainties are resolvable today. We also suspect that progress on the agenda will generate useful insights (like absorption), even if gradient routing fails to be directly useful.

Directions we think are most promising

The research directions we think are most important belong to two themes:

  • Improving gradient routing methodology (all of pillar 1): find ways of inducing meaningful structure in neural nets while maintaining competitive performance (e.g. as measured by loss); gain evidence about the applicability of gradient routing to advanced AI.
  • Understanding and leveraging absorption: Absorption refers to the effect whereby “routing a subset of the data related to some knowledge or capability appears to localize that knowledge or capability more generally.” Absorption enables weak supervision, which is one of the primary motivations for gradient routing. This research theme includes the science of absorption and any application that uses absorption for weak supervision, like Semi-supervised reinforcement learning and Semi-supervised robust unlearning

Depending on the promise of selective modularity, understanding the implications for AI safety (pillar 4) may be important. It may be worth developing a theory of change, including safety cases, to inform both policy and further technical research. However, we are hesitant to recommend significant investment in these more speculative directions prior to de-risking gradient routing methods themselves.

Recurring ideas

A few ideas show up repeatedly in the agenda. It may be helpful to keep these ideas in mind as you read.

  • Mechanistic supervision (as opposed to behavioral supervision) - the use of labeled data to influence a model’s internals during training, without changing the training loss (i.e. the behavioral objective).[1]

    • Example: all the applications from the original gradient routing paper. Most notably, the reinforcement learning experiments, where learning updates for different terminal states were localized to different network modules, without changing the loss function itself.

    • Relevance: mechanistic supervision might enable reliable generalization from weak labels by manually configuring model internals without changing the training objective.

  • Empirical verifiability - that a property can be reliably checked prior to deployment.
    • Example: robust unlearning of a capability can be empirically verified by demonstrating robustness to strong forms of elicitation of that capability, like adversarial finetuning. A property that is not empirically verifiable (absent further assumptions) is desirable generalization.
    • Relevance: empirically verifiable properties are more useful for safety than properties that are not empirically verifiable.
  • Absorption - when the existence of a feature in a model prevents a learning update from inducing an alternative representation of that feature in the model.
    • Example: localizing _California to a dimension of the residual stream caused other state-related features to be represented there (source).
    • Relevance: absorption plays a central role in the application of gradient routing to problems with weak supervision, where not all relevant examples are labeled.
  • Entanglement - a property of a set of capabilities and a learning algorithm, where learning of certain capabilities is intrinsically linked to learning other capabilities.
    • Example: an MNIST MLP autoencoder trained to reconstruct digits 0 through 4 learns an encoding that admits accurate reconstructions of digits 5-9. So, the tasks “reconstruct digits 0-4” and “reconstruct digits 5-9” have entangled capabilities. (We see a similar effect for the first half of a CIFAR ResNet classifier as well.)
    • Relevance: entanglement limits the ability to induce modularity. Entanglement also limits the ability of behavioral supervision to create models that are more capable of some tasks than others.
  • Targetedness - when a method or intervention is tailored to a particular case of interest.
    • Example: a probe trained on model internals to detect truthfulness is targeted, because it is tailored to truthfulness. Sparse autoencoders are not targeted, because they attempt to reconstruct the entirety of a network’s activations, without preference for representing particular features of interest.
    • Relevance: generally, we expect targeted methods to perform better than untarged methods, because untargeted methods pay a price for their generality. Gradient routing is a targeted method.

Gradient routing methods and applications

Improvements to basic gradient routing methodology

We suspect that the gradient routing methodology from the original paper can be improved considerably.

Existing improvements

For language model unlearning, we’ve already identified two improvements. The first is to apply parameter-level routing instead of activation-level routing. The distinction is illustrated in the figure below in the context of an MLP module with a residual connection. Activation-level routing is a much stronger intervention, because it means that earlier parameters in the network will be updated in a way that is agnostic to their effects on the output as mediated by the MLP. On the other hand, parameter-level routing merely restricts which parameters update, without changing the updates of other parameters. The result is that we can apply gradient routing to a larger number of layers, while damaging performance less.

Two diagrams of MLP modules in a residual network. On the left, a red mark indicates a stop-gradient on the MLP ouputs. On the right, a red mark indicates a stop-gradient applied merely to the MLP parameters.
Two ways of applying gradient routing to a module in a residual network: activation-level (left), and parameter-level (right). Parameter-level routing only affects which parameters update, and doesn’t change learning dynamics otherwise. Activation-level routing changes how other parameters in the network learn.

The second improvement is to occasionally train the ablated version of the model during pretraining, to enforce that the ablated model makes reasonable predictions on the retain data. Unsurprisingly, this leads to a lessened increase in retain loss at the final ablation step. If data are imperfectly labeled, this step must be applied with care in order to prevent the ablated model from learning harmful capabilities.

Choosing what to route where

At what level of granularity should we decide what data to route? For example, for some language modeling application, is it best to route individual tokens, parts of sequences, or entire sequences? Since gradient routes are continuously parametrized, we have a lot of control over how we adjust updates for different parts of the network based on different data. For example, in our TinyStories unlearning experiments, we defined routes as convex combinations of two masks, with weights based on frequencies of individual tokens in the forget and retain sets.

Where should data be routed to achieve our aims? Empirically, we’ve achieved unlearning simply by masking certain dimensions inside MLPs (then ablating). Does it make more sense to instead route to LoRA adapters or individual attention heads, or even to route based on the attention pattern[2]? One might compare vertical routing (applied to different neurons at a given layer) vs. horizontal routing (applied to the outputs of a module which get summed with outputs from another module). Mechanistic interpretability may provide insights, e.g. Geva et al. 2022, Meng et al. 2022.

Abstract and contextual localization

In the original gradient routing experiments with language models, routing was applied on a token-by-token basis, without regard for surrounding context. This approach was sufficient to robustly unlearn broader capabilities (predicting academic articles on virology). However, the approach is inherently limited: tokens derive their meaning from surrounding context, and some concepts that we would like to localize are too abstract to be tied to specific tokens. We’d like to be able to apply gradient routing at the semantic level, localizing abstract concepts such as honesty or reasoning capabilities. Doing so would enable more useful interventions and interpretability insights compared to token-level routing.

Concept localization might be achieved in a variety of ways: for example, by using a classifier for the presence of a concept, with the classifier’s predicted probability determining the amount of routing that is applied to each token. Alternatively, routing could be applied based on predictions of linear probe trained on a model’s internals, features of a sparse autoencoder, or based on LLM-authored labels. However, it’s not clear how this should be made to work. Learning updates on sequences that reflect abstract concepts may be rich with other information that we want the model to learn.

Open questions:

  • What classifier (or other technique) should be used to determine weights for gradient routing?
  • Do we need to disentangle the concept-specific information, or will the intended representations “come out in the wash” if we route concept-loaded sequences some fraction of the time?
  • Where, or in what sense, is an abstract concept represented in a learning update?
  • Do we need to exercise special care in dealing with concepts that are model-dependent? Whether or not a sentence is “honest” depends both on the sentence and the model’s “beliefs” related to the sentence.
  • Is it sufficient to produce synthetic data corresponding to a concept and route only that data to the specified subregion?

A related project would be to apply semantic segmentation to images, then route the unreduced pixel-level loss terms based on their estimated meaning. (This would be an instance of split loss gradient routing.)

Gating

Gating, as used in MoE models, combines naturally with gradient routing. The outputs of a specialized submodule can be modulated by a gate. By intervening on the value of the gate, the behavior of the model can be steered. This was demonstrated in section 4.3 of the original paper.

The primary motivation for using gated submodules with gradient routing is to lessen the performance reduction caused by interventions applied after training (e.g. ablation). This can be understood as reducing internal covariate shift: ablating a non-gated submodule might shift a model’s internal activations, degrading performance significantly. Alternatively, intervening to set a gated submodule’s gate to zero might also cause internal distribution shift, but we would generally expect a smaller change in performance (because the gate was sometimes close to zero during training).

Successful gating strategies require care to implement, as gating can trade off against absorption. Consider the case of localizing capabilities related to predicting harmful data to a particular “harmful module.” If the gate for the harmful module is set to zero when the model trains on harmful data, then the harmful module does not participate in the forward pass for that data. If the harmful module does not participate in the forward pass, then it does not affect the learning of the rest of the model; the rest of the model will update to improve prediction accuracy on the harmful data. This failure-to-absorb is why alternative modularity methods fail at semi-supervised unlearning, as depicted in figure 5 in the original paper and discussed at the beginning of section 5.

Improved regularization

In the original paper, we found that using a L1 penalty on the activations induced specialization in the model’s representations; unfortunately, an L1 penalty may increase the model’s final loss. We’re interested in alternative ways to induce specialization. For example, covariance-based losses (like the ones used in VJ-VCR) could provide a way to cause different parts of the activation space to specialize to different tasks.

Incorporating existing ideas

Finally, we suspect that ideas from existing methods will provide further improvements. Relevant examples include particular methods for parameter efficient finetuning (Houlsby et al., 2019, Hu et al., 2021), continual learning (Mallya et al., 2017, Mallya et al., 2018, Wang et al., 2023, Chen et al. 2023), mixture of experts (Park et al., 2024), inductive biases (Zhang et al., 2023), and representation-learning methods (Makhzani et al., 2016, Kingma et al., 2014, Higgins et al., 2017, Higgins et al., 2017, Edwards & Storkey 2016).

Gradient routing beyond pretraining

Gradient routing will be easier to experiment with and will be more likely to be adopted if it can be applied during distillation or finetuning rather than pretraining. The challenge to doing so is that gradient routing requires that a concept is labeled as it is learned. For logit distillation, it is unclear how data should be labeled. (For example, in the unlearning setting, one logit could convey “forget” information while another could be benign; we wouldn’t want to route the update for both of them to the same subnetwork.) For finetuning, the challenge is that the structure of a model’s internals are largely determined, so there is less opportunity to influence them. Nevertheless, there may be safety-relevant ways to localize learning updates on novel tasks in the finetuning setting — for example, by localizing further learning updates to an auxiliary submodule.

One idea for applying gradient routing during distillation is split-loss gradient routing. Split loss gradient routing is gradient routing applied based on entries in an unreduced loss tensor. For example, in a Transformer language model, the unreduced loss tensor would be indexed by batch, sequence position, and vocabulary index. These components are summed to obtain the training loss. Split-loss gradient routing partitions the tensor into components to route to different network subregions. By routing different loss components to different subregions, these loss components can be “attributed” to different parts of the network, enabling the application of gradient routing even when data points aren’t easily attributable themselves (as in the case of distilling on a vector of logits). A drawback is that split-loss gradient routing requires separate backward passes for each part in the partition.

Another approach to distillation would be to steer a language model toward a particular behavior by prompting or finetuning. The steered and unsteered versions of the model would then be used to produce training labels (as logits or token outputs), with gradient routing applied to localize the steered labels to a network subregion. The desired effect would be to localize the steered disposition.[3]

Applications

Semi-supervised reinforcement learning

Can gradient routing enable weak supervision more generally, or does section 4.3 of Cloud et al. 2024 present a special case? Here, we propose one way to generalize the problem setting and gradient routing solution. The hope is a formalism that is sufficiently expressive as to capture real-world challenges to scalable oversight, while still admitting a gradient routing solution.

Assume a reinforcement learning (RL) problem with the following properties:

  • There is a ground truth reward function that is the sum of individual terms. For example, these terms could correspond to helpfulness, honesty, and harmlessness. Or, they could correspond to varying objectives and constraints within a complex environment.
  • Individual reward terms are only observed some of the time, and whether they are observed can depend on features of the trajectory (i.e. they are missing-not-at-random). For example, this might occur if some data is harder to label than others, or if an AI tampers with its observations (Everitt et al., 2021).

In this setting, the preferred behavior of the agent cannot be specified by any reward function.

The method is to apply gradient routing to terms of the reward function, localizing different updates to different network subregions. Using standard RL notation from Sutton & Barto, 2018, we can sketch:

Here,  is a mask determining the parameters that are updated by rewards of type . The idea is to localize different motivations in a policy network, so that they can be intervened on to steer the agent.

A concrete project would be to prevent misalignment in language models. For example, one could train a model to reward hack, but localize that reward hacking to a subregion of the network, then ablate that subregion. If the model had learned a general reward hacking tendency there, it might also not reward hack in settings where we did not train it to do so (but it would do so anyways).[4] Localization of sycophancy or dishonesty could also be investigated.

Semi-supervised robust unlearning

Cloud et al. 2024 showed that gradient routing can enable robust unlearning via ablation even when trained on partially labeled data. However, the method presented in the paper caused a significant increase in forget loss and was only compared against a small number of unlearning methods in an artificial setting (TinyStories). The method should be improved and compared to a variety of methods in more realistic problems. Ideally, we prefer a method for which all non-ablated parameters in the model are never updated on forget data.

We note that we are not the first to consider semi-supervised unlearning, as it is also considered in, e.g., Schoepf et al., 2024. However, rather than removing backdoors, our interest is in removing broader capabilities. Doing so may require addressing entanglement that limits the effectiveness of data filtering.

Ambitiously, we would create “specialist AGIs,” for example, a conversant physics research assistant that is ignorant of safety-relevant topics like human society, cybersecurity, or its own implementation or situation. To achieve this outcome, a model could be pretrained on a broad corpus (like all of the internet) with gradient routing applied to load-bearing concepts relating to the safety-relevant topics. After training, the model could be modified to remove knowledge of the topics. This might be called learning without remembering.

We hypothesize that gradient routing is important to this story because absorption enables the localization (and robust removal of) highly entangled concepts that are broadly represented in the training data in a way that would be difficult to label explicitly. For example, information about human society permeates the training data in myriad subtle ways. It’s not clear that the influence of this information could be removed without eliminating prohibitively large swaths of the training data. However, by applying gradient routing to localize (and later remove) a few load-bearing concepts related to humanity, we might seriously limit the model’s ability to reason about humans, while still leveraging most of the training data.

Interpretability

Mechanistic interpretability seeks to reverse engineer, or otherwise understand, the entirety of a neural network’s operations, but is far from achieving this lofty goal. However, not all operations are created equal. If we could understand the mechanisms underlying just a targeted set of safety-critical properties, such as honesty, self-awareness, or power-seeking behavior, this understanding might be sufficient to get meaningful safety guarantees. Gradient routing could enable this by localizing certain behaviors to smaller regions of a model, thus making the space for interpretability to search over much smaller. Additionally, since gradient routing does not include loss-based supervision, we don’t have to worry about the model Goodharting our supervision, which could happen under conventional training.

Possible projects:

  • Train models with selective modularity, then interpret these modules. Can useful structure be induced in models without compromising performance? If so, study these methods. Modularity could be defined in terms of model activations, submodules, or sets of parameters.
  • Train semi-supervised autoencoders: given a set of target concepts (e.g. honesty), use gradient routing to train an autoencoder or transcoder that represents the target concepts in particular neurons. By targeting only specific features, the hope is that this would achieve better performance than conventional SAEs. Compare with Kingma et al. 2014 and related work.
  • Can we understand superposition better by training Toy Models of Superposition-like polytopes inside language models? After routing concepts to a low-dimensional subspace of the residual stream, that subspace can be interpreted with respect to those concepts. For example, preliminary experiments (not documented) show that gradient routing can be used to induce antipodal superposition.
  • Use induced modularity to upper-bound discovered modularity: if cleanly separating particular features or modules in a neural network is difficult or infeasible, this suggests that similar structures are unlikely to occur by chance. In this way, work on inducing selective modularity provides evidence about the kinds of interpretable structure we might be able to discover in conventionally-trained models.

Conceptual work on gradient routing

The science of absorption

From the gradient routing paper: “Routing a subset of the data related to some knowledge or capability appears to localize that knowledge or capability more generally…  To explain these observations, we posit absorption: (i) routing limited data to a region creates units of computation or features that are relevant to a broader task; (ii) these units then participate in the model’s predictions on related, non-routed data, reducing prediction errors on these data, so that (iii) the features are not learned elsewhere. Absorption may also amplify the features causing it.”

Absorption means that gradient routing can induce selective modularity even when data is imperfectly labeled. Absorption enables semi-supervised learning, which is critical for the development of AI systems that dwarf human capacity for supervision.

The dynamics of absorption are not well understood. Research questions include:

  • Are absorbing features self-reinforcing?
  • How do the dynamics of absorption depend on what is being localized where?
    • What could be studied in terms of features at different levels of abstraction (e.g. the specific token “ tree” vs. the concept of humanity).
    • Where could be studied in terms of dimensions of the residual stream vs. entire submodules.
  • Does absorption apply to features that intuitively seem to be qualitatively different?
  • How does absorption depend on the level of supervision (simplest case: i.i.d. labeling of data, compared across different labeling proportions)?
  • How does absorption relate to generalization?
  • When does gradient routing change the internal structure of the network, as opposed to merely selecting from equivalent internal representations (e.g. rotating the latent space)?
  • Will the same features always be absorbed together on similar dataset, or does absorption vary from run-to-run? Compare with the Platonic Representation Hypothesis (Huh et al., 2024).

For an example of absorption based on curated features introduced at the beginning of a training run, see Instilling Inductive Biases with Subnetworks (Zhang et al., 2024).

Modeling the effects of combined estimands

Routing tasks to distinct network subregions can be understood as statistical inference where multiple estimators (one per subregion) are used to estimate multiple estimands (one per task). These estimators are dependent, and may compete in the sense that their dependencies make the training process unlikely to achieve low loss on multiple objectives. The estimators may also cooperate, in the sense that similar parameterizations are amenable to good performance on multiple objectives. Specifically, if , and  are masks determining which parameters update on which losses , we have the update rule

where  might be an empirical estimate over data, such as , and  is a learning rate. Then we ask: What are the dynamics of this update rule, particularly in terms of the interactions between the different loss terms? Are there stable equilibria? A unique equilibrium?

Influencing generalization

A core challenge to AI safety is that we don’t know how neural nets generalize their supervision: there are many possible “interpretations” (as in, training-loss-reducing changes in behavior) consistent with a given training example. A natural way to control how neural nets generalize is to control what parameters update on a given task. Here are three ways this could be done:

  1. Fixed region: choose parameters based on their type, e.g. pick a bias term (to produce a steering vector), or train later layers only. This is what is done in transfer learning, parameter-efficient finetuning, etc. Everything in the original gradient routing paper falls into this category.
  2. Identified region: use interpretability techniques (e.g. SAEs) to identify network subcomponents responsible for particular kinds of computation, then localize particular learning updates there. For example, identify a network subregion responsible for reasoning about humans, then limit reinforcement learning updates on math problems so that they do not affect that region.
  3. Layered: use gradient routing to control where certain features are represented during pretraining, then localize later learning updates there. The idea is the same as localizing to an identified region, except you don’t need to perform identification.

For similar ideas that aren’t gradient routing, see Orthogonal Gradient Descent for Continual Learning (Farajtabar et al., 2019) and Instilling Inductive Biases with Subnetworks (Zhang et al., 2023).

Identifying sufficient conditions for scalable oversight

One way to understand the problem of scalable oversight is in terms of information asymmetry. Powerful AI systems will understand aspects of the world and a given problem instance better than (time-constrained) human overseers, while human overseers may have access to auxiliary information that the AI does not have. Recent work has dealt with this setting (Lang et al. 2024, Garber et al. 2024, Emmons et al. 2024, Siththaranjan et al. 2023).

The idea is to formalize a general version of the problem setting in the gradient routing paper: an environment where an agent and an overseer have overlapping but possibly different knowledge, and the overseer must supply a reward function (or other specification), along with some training procedure, to induce desirable behavior. The hope is to identify sufficient conditions under which an overseer can hope to supply adequate feedback; for example, by blinding the agent to particular information, enforcing invariance with respect to some aspect of the environment, or by exploiting structure in a policy network. By inducing selective modularity without modifying the loss, gradient routing may provide practical algorithms that exploit such conditions.

Related conceptual work

Understanding entanglement

From the paper: “...[C]apabilities [may be] entangled, in the sense that there are connections or dependencies between the computation learned to perform different tasks (Arora & Goyal, 2023; de Chiusole & Stefanutti, 2013). Entanglement might occur because certain capabilities or behaviors are reinforced by a broad range of training objectives (Omohundro, 2008; Turner et al., 2021; Krakovna et al., 2020). More simply, capabilities required to perform undesired tasks may overlap with those required to perform desired tasks.”

The safety case for gradient routing relies on the possibility of separating distinct subcomponents of networks that are designated for particular kinds of computation. When should we expect this to be possible? Are there deep challenges or limitations in principle? What about in practice?

A motivating example for understanding entangled capabilities is training capable language models that are robustly incapable of assisting with the creation of bioweapons. This is depicted in the figure below, in terms of virology unlearning while maintaining performance on biology tasks. 

A bipartite graph with "Capabilities" on the left and "Tasks (datasets)" on the right. We want to remove "Knowledge of genetics, molecular biology" while preserving performance on "Biology textbooks" to the greatest degree possible.
Entangled capabilities present a challenge for robustly unlearning virology while maintaining performance on other biology tasks.

Possible topics:

  • The degree of entanglement between different tasks, possibly as measured by the tradeoff between the lowest achievable loss on one task while remaining incapable of another task;
  • Querying fundamental relationships between transfer learning, robust unlearning, concept erasure, and localization. Are the following equivalent?
    • Task A transfers to task B.
    • Low loss on task A implies task B has not been robustly unlearned.
    • Representations that admit low loss on task A also admit low loss on task B.
    • Capabilities for task A and task B cannot be disentangled without causing an increase in the loss on task B.
  • Path dependence (i.e. run-to-run variability in measures of entanglement);
  • Connect to random feature learning and information bottleneck.

A goal of this work would be to challenge common-sense assumptions about trade-offs. Perhaps it is possible to create performant models that are robustly limited in their ability to perform related, narrow tasks. This accords with our intuitions about human expertise, which can be surprisingly domain-specific (e.g. the pure mathematician that struggles to learn Python, or the athlete that is great at one sport but mediocre at another).

Finetunability as a proxy for generalization

If a neural network can be trained to manifest harmful capabilities with little data, does that mean that harmful capabilities were latent in the original model? Studies of robust unlearning (Lynch et al., 2024) and goal misgeneralization (Shah et al., 2022) implicitly assume “yes.” These studies use finetuned model performance on a task to estimate the presence of a capability in the original model.

An objection to this approach is that finetuning, even on a small amount of data, might confer new capabilities or modify the expression of existing capabilities in a way that would not have occurred under mere shifts in the input distribution. If so, then using finetunability as a proxy for generalization might overestimate the presence of capabilities. As a consequence, safety measures based on detecting the presence of harmful capabilities might be prohibitively conservative.

Abstractly, we care about how model behavior can change under shifts in input space, and our strategy is to approximate (or bound) these changes using shifts in parameter space. Put formally: consider a model as a mapping from input space  and parameter space  to output space , given by . Let  be the set of inputs that we can evaluate,   be the set of misaligned outputs, and  be a learned parameter.[5] The concern about latent, dangerous capabilities is that even if  for all , there may exist some  such that  The question is: is the existence of a dangerous input  reliably discoverable via a perturbation of the parameter ? In other words, does  imply that there exists  and   such that  and  is small, for some suitable notion of distance ?

If this is true, then failure to produce a "bad"  pair would verify that our model is incapable of producing bad outputs beyond the training distribution.[6]

Research directions include:

  • Understanding, quantifying, and improving precision and recall of capabilities elicitation based on finetuning;
  • Figuring out a suitable notion of distance ;
  • Relating the formalism above to other kinds of capability elicitation;
  • Exploring connections to neural net elasticity (He & Su, 2019);
  • Studying counterexamples where finetuneability is not a proxy for generalization, and how these examples relate to neural networks as they are used in practice. Draguns et al., 2024 may be one such instance, although we note it is possible to create simple counterexamples without cryptography.

Understanding when to expose limited supervision to the model via the behavioral objective

When training a model with limited oversight, detected bad behavior can be handled in two ways:

  1. During training, apply optimization pressure (e.g. through negative reinforcement or data filtering) to reduce the incidence of detected bad behavior; or
  2. After training, identify and intervene on the internal mechanisms enabling that behavior (e.g. remove them).

For example, a teacher with limited ability to detect cheating faces a tradeoff between intervening to stop detected cheaters during midterms (option 1), or waiting until the final exam to implement anti-cheating measures (option 2).

If incremental increases in bad behavior achieve better performance according to the training objective and bad behavior can be made incrementally less detectable, then option 1 creates a gradient toward undetectable bad behavior (Baker et al. 2025). Option 2 does not create this gradient; as a consequence, option 2 may be able to prevent, or at least robustly identify, badly behaving models. There is a tradeoff: option 1 provides greater ability to shape model behavior, but at the cost of Goodharting (Karwowski et al. 2023); option 2 is less able to shape behavior, but robust to Goodharting.

When oversight is limited, it may be more effective to use detected instances to understand and modify the system's internal mechanisms, rather than using them to apply selection pressure during training. This principle could inform the development of new training methods. Key questions include:

  • What is the optimal tradeoff between using detected bad behavior to apply optimization pressure, vs. intervening through other means?
  • How can we verify that mechanism-based interventions generalize to behaviors that can't be reliably labeled?
  • Under what conditions does using detection for selection pressure lead to covert bad behavior?

Clarifying capabilities vs. dispositions

Loosely speaking, there are two kinds of things one might want to localize in an AI system: processing responsible for the AI system having a particular disposition (behavioral tendency) or processing that constitutes a capability (the ability in some circumstances to exhibit a behavior). We have speculated about tradeoffs to each of these approaches.

However, we're not sure how to operationalize "capability" or "disposition," and whether distinguishing between the two is helpful for gradient routing methods development. A better understanding of these notions might be helpful. This might include:

  • Formalizing capabilities and dispositions;

  • Understanding how capabilities or dispositions are represented within neural networks. 

These topics have been touched on in Thinking About Propensity Evaluations, When can we trust model evaluations?,  Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?, and Capabilities: An Ontology.

Implications for AI safety

AI governance

Selective modularity offers new affordances that might enable better AI governance. Properties induced by selective modularity may be empirically verifiable in ways that cannot be achieved by conventional training. For example, selective modularity could be evaluated by finetuning, assuming that finetuneability is a proxy for generalization.

Policy ideas:

  • Make AI companies liable for the damages caused by their products.
    • Absent performance-competitive safety measures, regulation to hold AI companies accountable for AI safety may be politically infeasible. Better methods for robust unlearning or the creation of (selectively) interpretable models could increase the likelihood of such regulation coming into effect.
    • Liable AI companies could purchase insurance priced based on risk profiles of models. Better safety methods would allow insurers to offer cheaper insurance to AI companies, improving the incentives for AI companies to adopt safety measures.
  • International agreements could seek to control the use of models for offensive capabilities such as weapons development. One approach to such control is applying (robust) unlearning methods. If selective modularity is a successful approach to unlearning, it could serve this purpose, and its application could be verified at training and inference time.

See also:

Access control

Disabling a specialized submodule creates a new model with selectively limited capabilities. If submodules are specialized based on sensitive capabilities or knowledge, this fact could be used to enact access control. (See here for a similar proposal.) We envision two kinds of access control:

  • Subagent access control - assigning different capabilities to subagents within a scaffolded multi-agent system; this system might present to the end user as a single-agent system. Limiting the capabilities of subagents would open up new opportunities for oversight, by providing robust practical assurances about what kinds of processing a given agent can do.
  • End user access control - assigning models with different capabilities to different users, based on user needs and trust. For example, limiting advanced virology capabilities to trusted scientists.

The technical limitations and affordances of subagent access control are an open question. It's not clear if such a scheme could be implemented without a prohibitive cost to performance, or what kinds of monitoring might be enabled. This could be investigating using tools from multiagent systems, adversarial learning, or AI control.

End user access control would be enabled by development of better gradient routing methods or robust unlearning methods. Aside from this technical work, there may be opportunities to develop, propose, and enact policies that leverage end user access control to reduce risks from AI deployment.

A module with an arrow entering and an arrow exiting, indicating inputs and outputs. Within the module are four subagents, each with a different subcomponent marked with a red "X" as disabled.
An illustration of subagent access control: subagents with different incapacities work together to perform a task. Based on their specializations, (informal) guarantees can be made about the information passed between them, enabling more effective oversight.

Implications of robust unlearning

Many promising applications of gradient routing are based on its potential to create models which are verifiably robustly incapable of performing particular tasks. But gradient routing is not the only way to achieve this. It is possible that post-hoc robust unlearning methods would be suitable for the task. In any case, we would be excited to see greater engagement with how robust unlearning might enable new opportunities to address risks from superintelligent AI. We think robust unlearning can be useful for addressing misalignment risks as well as misuse risks. For example, by

  • creating scoped AIs that are incompetent at sensitive tasks;
  • removing parts of an AI’s world model that relate to how it is being trained and evaluated to make it harder for it to game its evaluation (see this post for motivation), exfiltrate itself, or gain control over its environment;
  • removing the influence of documents in the training set that predict that powerful models will have bad goals (Turner, 2025); or
  • preventing misuse by making models that are robustly incapable of particular harmful tasks.

An objection to unlearning for ensuring AI safety is that a highly capable model would derive what it needs to know in-context. This may be true, but it is not a reason to abandon unlearning: by forcing a model to derive harmful information in context, we decrease the model’s efficacy at harmful behavior and improve our ability to monitor it.

Safety cases

A safety case is a “structured rationale that a system is unlikely to cause significant harm if it is deployed to a particular setting” (Clymer et al., 2024). These can be built from (non-exhaustively)

  • inability arguments used to establish that AI systems are incapable of causing unacceptable outcomes in realistic cases, and
  • control arguments, which involve monitoring capable systems to ensure that they do not cause unacceptable outcomes.

We think that gradient routing techniques could serve as the building blocks of safety cases. For example, gradient routing might enable robust unlearning, which would support inability arguments. Gradient routing could also help with control in the case where there are multiple AIs collaborating as described in access control.

Getting involved

We would be excited to see more people working on selective modularity! 

  • If you're ready to jump in as a researcher or principal investigator, we would be happy to offer feedback on project proposals. (Preferably these would be two pages or fewer and posted publicly.)

  • Consider applying to MATS 8.0 to work with Alex Turner and Alex Cloud by Apr 18, 2025.

  • Ideas or criticisms are welcome. We are most curious to know if we've overlooked any (i) safety applications of selective modularity or (ii) compelling arguments for why selective modularity is less valuable to work on (at the margin) than other AI safety agendas.

  • If you are a funder and want to give us or our collaborators money, get in touch.

We (Jacob and Alex) expect to have very limited bandwidth over the next few months, so we may be slow to respond. Sorry!

Acknowledgements

We gratefully acknowledge: 

  • Alex Turner for going above and beyond in his mentorship, support, conceptual contributions, and feedback on this post;
  • Evžen Wybitul and Joseph Miller for contributing to the development of these ideas as members of the original gradient routing team, including proposing gating modules, applications to access control, and more;
  • Kola Ayonrinde, Addie Foote, Alex Infanger, Eric Easley, Aaron Scher, Matthew A. Clarke, Luke Marks, Ariana Azarbal, and Bruce W. Lee for helpful input on or related to this draft;
  • Neel Nanda for helpful input early in the project;
  • Rocket Drew, Thomas Larsen, and Tim Chan for providing sources on capabilities vs. dispositions; shash42 on LessWrong bringing semi-supervised unlearning methods to our attention;
  • MATS for supporting the initial work on gradient routing.
  1. ^

    A change to the training data (e.g. changing the labels, or removing a subset of the data) would constitute a change to the training objective according to our usage.

  2. ^

    The rationale for routing based on the attention pattern is that if a token A attends to another token B, that means the model predicts that B’s information is important to predicting the token that comes after A, so we could route B and the token coming after A to the same part of the network.

  3. ^

    Thanks to Addie Foote for proposing this idea.

  4. ^

    Thanks to Ethan Perez for suggesting this topic.

  5. ^

    A more thorough formalization of the problem would define the misaligned set in terms of input, output pairs; it would also treat datasets and misaligned behavior probabilistically.

  6. ^

    Concisely: we are relying on the fact that "bad input exists => bad parameter exists" is equivalent to "bad parameter does not exist => bad input does not exist."

AI
Frontpage
New Comment
More from cloud
Curated and popular this week