Just read the above post and some your related posts on model splintering and symbol grounding. Here are some thoughts and comments, also on some of the other posts.
In this post you are considering a type of of machine learning where the set of features in the learned model can be updated, not just the model's probability distribution . This is neat because it allows you to identify some specific risks associated with model refinements where changes. In many discussions in the AI alignment community, these risks are associated with the keywords 'symbol grounding' and 'ontological crises', so it is good to have some math that can deconfuse and disentangle the issues.
However, you also link model splintering to out-of-distribution robustness. Specifically, in section 1.1:
In the language of traditional ML, we could connect all these issues to "out-of-distribution" behaviour. This is the problems that algorithms encounter when the set they are operating on is drawn from a different distribution than the training set they were trained on.
[....] 2. What should the AI do if it finds itself strongly out-of-distribution?
and then in section 5 you write:
We can now rephrase the out-of-distribution issues of section 1.1 in terms of the new formalism:
- When the AI refines its model, what would count as a natural refactoring of its reward function?
- If the refinements splinter its reward function, what should the AI do?
- If the refinements splinter its reward function, and also splinters the human's reward function, what should the AI do?
Compared to Rohin's comment above, I interpret the strength of this link vert differently.
I believe that the link is pretty weak, in that I cannot rephrase the out-of-distribution problems you mentioned as being the same 'if the AI's refinements do X' problems of section 5.
To give a specific example which illustrates my point:
Say that we train a classifier to classify 100x100 pixel 24-bit color pictures as being pictures of either cats or dogs. The in this example consists of symbols that can identify each possible picture, and the symbols and . You can then have a probability distribution that gives you .
We train the classifier on correctly labeled pictures of black cats and white dogs only. So it learns to classify by looking at the color of the animal.
After training, we move the classifier out-of-distribution by feeding it pictures of white cats, black dogs, cats that look a bit like pandas, etc.
The main observation now is that this last step moves the classifier out-of-distribution. It is not the step of model refinement by the ML system that is causing any out-of-distribution issue here. The classifier is still using the same and , but it has definitely moved out-of-distribution in the last step.
So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.
Or I might think of splintering as something that can have two causes: 1) the ML system/agent landing out of distribution, 2) certain updates that machine learning does.
You are considering several metrics of model splintering above: I believe some of them are splintering metrics that would measure both causes. Others only measure cause 2.
As you note, there is an obvious connection between some of your metrics and those used in several RL and especially IRL reward function learning papers. To detect shattering from cause 2), one might use a metric form such a paper even if the paper did not consider cause 2), only cause 1).
Some more general remarks (also targeted at general readers of this comment section who want to get deeper into the field covered by the above post):
In many machine learning systems, from AIXI to most deep neural nets, the set of model features never changes: the system definition is such that all changes happen inside the model parameters representing .
Systems where a learned function is represented by a neural net with variable nodes, or by a dynamically constructed causal graph, would more naturally be ones where might be updated.
Of course, mathematical modeling is very flexible: one represent any possible system as having a fixed by shoving all changes it ever makes into .
As a general observation on building models to show and analyze certain problems: if we construct a machine learning system where never changes, then we can still produce failure modes that we can interpret as definite symbol grounding problems, or definite cases where the reward function is splintered, according to some metric that measures splintering.
Interpreting such a system as being capable of having an ontological crises gets more difficult, but if you really want to, you could.
I have recently done some work on modeling AGI symbol grounding failures, and on listing ways to avoid them, see section 10.2 of my paper here. (No current plans to cover the topic in the sequence about the topics in the paper too.) I wrote that section 10.2 to be accessible also to people who do not have years of experience with ML math, so in that sense it is similar to what the above post tries to do.
My approach to modeling symbol grounding failure in the paper is similar to that in your blog post here. I model in symbol grounding failures in an agent as failures of prediction that might be proven empirically.
In the terminology of this post, in the paper I advance the argument that it would be very good design practice (and that it is a commonly used design practice in ML architectures) to avoid reward function splintering as follows. First, define the reward function in a way where references only a subset of symbols , where any improved made by model refinement still has the same subset inside it. Furthermore, to prevent splintering, this has to be limited to the of symbols which directly represent a) possible sensor readings of physical sensors connected to the agent compute core, or b) potential commands to physical actuators connected to the agent compute core.
I also mention that in RL architectures with learning on a reward signal, the reward signal is the only sensor reading that one aims to keep symbol grounded always.
In your more recent modeling of symbol grounding errors here, that model strikes me more as being a special case that models symbol mapping failures in translation settings, not the symbol grounding problem we usually worry about in a black box RL agents.
Thanks! Lots of useful insights in there.
So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.
Why do you think it's important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.
The distinction is important if you want to design countermeasures that lower the probability that you land in the bad situation in the first place. For the first case, you might look at improving the agent's environment, or in making the agent detect when its environment moves off the training distribution. For the second case, you might look at adding features to the machine learning system itself. so that dangerous types of splintering become less likely.
I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
I'm most interested in mitigation options the agent can take itself, when it suspects it's out-of-distribution (and without being turned off, ideally).
OK. Reading the post originally, my impression was that you were trying to model ontological crisis problems that might happen by themselves inside the ML system when it learns of self-improves.
This is a subcase that can be expressed in by your model, but after the Q&A in your SSC talk yesterday, my feeling is that your main point of interest and reason for optimisim with this work is different. It is in the problem of the agent handling ontological shifts that happen in human models of what their goals and values are.
I might phrase this question as: If the humans start to splinter their idea of what a certain kind morality-related word they have been using for ages really means, how is the agent supposed to find out about this, and what should it do next to remain aligned?
The ML literature is full of uncertainty metrics that might be used to measure such splits (this paper comes to mind as a memorable lava-based example). It is also full of proposals for mitigation like 'ask the supervisor' or 'slow down' or 'avoid going into that part of the state space'.
The general feeling I have, which I think is also the feeling in the ML community, is that such uncertainty metrics are great for suppressing all kinds of failure scenarios. But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation (that the agent will see every unknown unknown coming before it can hurt you), you will be disappointed. So I'd like to ask you: what is your sense of optimism or pessimism in this area?
But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation
I'm more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it's not a Boolean yes/no, but a matter of degree.
Planned summary for the Alignment Newsletter:
This post introduces the concept of _model splintering_, which seems to be an overarching problem underlying many other problems in AI safety. This is one way of more formally looking at the out-of-distribution problem in machine learning: instead of simply saying that we are out of distribution, we look at the model that the AI previously had, and see what model it transitions to in the new distribution, and analyze this transition.
Model splintering in particular refers to the phenomenon where a coarse-grained model is “splintered” into a more fine-grained model, with a one-to-many mapping between the environments that the coarse-grained model can distinguish between and the environments that the fine-grained model can distinguish between (this is what it means to be more fine-grained). For example, we may initially model all gases as ideal gases, defined by their pressure, volume and temperature. However, as we learn more, we may transition to the van der Waal’s equations, which apply differently to different types of gases, and so an environment like “1 liter of gas at standard temperature and pressure (STP)” now splinters into “1 liter of nitrogen at STP”, “1 liter of oxygen at STP”, etc.
Model splintering can also apply to reward functions: for example, in the past people might have had a reward function with a term for “honor”, but at this point the “honor” concept has splintered into several more specific ideas, and it is not clear how a reward for “honor” should generalize to these new concepts.
The hope is that by analyzing splintering and detecting when it happens, we can solve a whole host of problems. For example, we can use this as a way to detect if we are out of distribution. The full post lists several other examples.
Planned opinion:
I think that the problems of generalization and ambiguity out of distribution are extremely important and fundamental to AI alignment, so I’m glad to see work on them. It seems like model splintering could be a fruitful approach for those looking to take a more formal approach to these problems.
Without having thought too hard about it ...
In the case of humans, it seems like there's some correlation between "feeling surprised and confused by something" vs "model refinement", and likewise some correlation between "feeling torn" and "reward function splintering". Do you agree? Or if not, what are examples where those come apart?
If so, that would be a good sign that we can actually incorporate something like this in a practical AGI. :-)
Also, if this is on the right track, then I guess a corresponding intuitive argument would be: If we have a human personal assistant, then we would want them to act conservatively, ask for help, etc., in situations where they feel surprised and confused by what they observe, and/or situations where they feel torn about what to do next. Therefore we should try to instill a similar behavior in AGIs. I like that intuitive argument—it feels very compelling to me.
In the last few months, I've become convinced that there is a key meta-issue in AI safety; a problem that seems to come up in all sorts of areas.
It's hard to summarise, but my best phrasing would be:
This sprawling post will be presenting examples of model splintering, arguments for its importance, a formal setting allowing us to talk about it, and some uses we can put this setting to.
In the language of traditional ML, we could connect all these issues to "out-of-distribution" behaviour. This is the problems that algorithms encounter when the set they are operating on is drawn from a different distribution than the training set they were trained on.
Humans can often see that the algorithm is out-of-distribution and correct it, because we have a more general distribution in mind than the one the algorithm was trained on.
In these terms, the issues of this post can be phrased as:
Let's build a more general framework. Say that you start with some brilliant idea for AI safety/alignment/effectiveness. This idea is phrased in some (imperfect) model. Then "model splintering" happens when you or the AI move to a new (also imperfect) model, such that the brilliant idea is undermined or underdefined.
Here are a few examples:
In all those cases, there are ways of improving the transition, without needing to go via some idealised, perfect model. We want to define the AI CEO's task in more generality, but we don't need to define this across every possible universe - that is not needed to restrain its behaviour. We need to distinguish any blegg from any rube we are likely to encounter, we don't need to define the platonic essence of "bleggness". For future splinterings - when hedonic happiness splinters, when we get a model of quantum gravity, etc... - we want to know what to do then and there, even if there are future splinterings subsequent to those.
And I think think that model splintering is best addressed directly, rather than using methods that go via some idealised perfect model. Most approaches seem to go for approximating an ideal: from AIXI's set of all programs, the universal prior, KWIK ("Knowing what it knows") learning with a full hypothesis class, Active Inverse Reward Design with its full space of "true" reward functions, to Q-learning which assumes any Markov decisions process is possible. Then the practical approaches rely on approximating this ideal.
Schematically, we can see as the ideal, as updated with information to time , and as an approximation of . Then we tend to focus on how well approximates , and on how changes to - rather than on how relates to ; the red arrow here is underanalysed:

But why is focusing on the transition important?
A lot has been written about image recognition programs going "out-of-distribution" (encountering situations beyond its training environment) or succumbing to "adversarial examples" (examples from one category that have the features of another). Indeed, some people have shown how to use labelled adversarial examples to improve image recognition.
You know what this reminds me of? Human moral reasoning. At various points in our lives, we humans seem to have pretty solid moral intuitions about how the world should be. And then, we typically learn more, realise that things don't fit in the categories we were used to (go "out-of-distribution") and have to update. Some people push stories at us that exploit some of our emotions in new, more ambiguous circumstances ("adversarial examples"). And philosophers use similarly-designed thought experiments to open up and clarify our moral intuitions.
Basically, we start with strong moral intuitions on under-defined features, and when the features splinter, we have to figure out what to do with our previous moral intuitions. A lot of developing moral meta-intuitions, is about learning how to navigate these kinds of transitions; AIs need to be able to do so too.
Moral realists and moral non-realists agree more than you'd think. In this situation, we can agree on one thing: there is no well-described system of morality that can be "simply" implement in AI.
To over-simplify, moral realists hope to discover this moral system, moral non-realists hope to construct one. But, currently, it doesn't exist in an implementable form, nor is there any implementable algorithm to discover/construct it. So the whole idea of approximating an ideal is wrong.
All humans seem to start from a partial list of moral rules of thumb, rules that they then have to extend to new situations. And most humans do seem to have some meta-rules for defining moral improvements, or extensions to new situations.
We don't know perfection, but we do know improvements and extensions. So methods that deal explicitly with that are useful. Those are things we can build on.
Sometimes the AI goes out-of-distribution, and humans can see the error (no, flipping the lego block doesn't count as putting it on top of the other). There are cases when humans themselves go out-of-distribution (see for example siren worlds).
It's useful to have methods available for both AIs and humans in these situations, and to distinguish them. "Genuine human preferences, not expressed in sufficient detail" is not the same as "human preferences fundamentally underdefined".
In the first case, it needs more human feedback; in the second case, it needs to figure out way of resolving the ambiguity, knowing that soliciting feedback is not enough.
Suppose that quantum mechanics is the true underlying physics of the universe, with some added bits to include gravity. If that's true, why would we need a moral theory valid in every possible universe? It would be useful to have that, but would be strictly harder than one valid in the actual universe.
Also, some problems might be entirely avoided. We don't need to figure out the morality of dealing with a willing slave race - if we never encounter or build one in the first place.
So a few degrees of "extend this moral model in a reasonable way" might be sufficient, without needing to solve the whole problem. Or, at least, without needing to solve the whole problem in advance - a successful nanny AI might be built on these kinds of extensions.
In a sort of converse to the previous point, what if the laws of physics are radically different from what we thought - what if, for example, they allow some forms of time-travel, or have some narrative features, or, more simply, what if the agent moves to an embedded agency model? What if hypercomputation is possible?
It's easy to have an idealised version of "all reality" that doesn't allow for these possibilities, so the ideal can be too restrictive, rather than too general. But the model splintering methods might still work, since it deals with transitions, not ideals.
Note that, in retrospect, we can always put this in a Bayesian framework, once we have a rich enough set of environments and updates rules. But this is misleading: the key issue is the missing feature, and figuring out what to do with the missing feature is the real challenge. The fact that we could have done this in a Bayesian way if we already knew that feature, is not relevant here.
Assume the blegg and rube classifier is an industrial robot performing a task. If humans filter out any atypical bleggs and rubes before it sees them, then the robot has no need for a full theory of bleggness/rubeness.
But what it the human filtering is not perfect? Then the classifier still doesn't need a full theory of bleggness/rubeness; it needs methods for dealing with the ambiguities it actually encounters.
Some ideas for AI control - low impact, AI-as-service, Oracles, ... - may require dealing with some model splintering, some ambiguity, but not the whole amount.
Some methods, like quantilizers or the pessimism approach rely on the algorithm having a certain degree of conservatism. But, as I've argued, it's not clear to what extent these methods actually are conservative, nor is it easy to calibrate them in a useful way.
Model splintering situations provide excellent points at which to be conservative. Or, for algorithms that need human feedback, but not constantly, these are excellent points to ask for that feedback.
Generally speaking, idealised methods can't capture model splintering at the point we would want it to. Imagine an ontological crisis, as we move from classical physics to quantum mechanics.
AIXI can go over the transition fine: it shifts from a Turing machine mimicking classical physics observations, to one mimicking quantum observations. But it doesn't notice anything special about the transition: changing the probability of various Turing machines is what it does with observations in general; there's nothing in its algorithm that shows that something unusual has occurred for this particular shift.
This could be seen as a sub-point of some of the previous two sections, but it deserves to be flagged explicitly, since iterated amplification and distillation is one of the major potential routes to AI safety.
To quote a line from that summary post:
- The proposed AI design is to use a safe but slow way of scaling up an AI’s capabilities, distill this into a faster but slightly weaker AI, which can be scaled up safely again, and to iterate the process until we have a fast and powerful AI.
At both "scaling up an AI's capabilities", and "distill this into", we can ask the question: has the problem the AI is working on changed? The distillation step is more of a classical AI safety issue, as we wonder whether the distillation has caused any value drift. But at the scaling up or amplification step, we can ask: since the AIs capabilities have changed, the set of possible environments it operates in has changed as well. Has this caused a splintering where the previously safe goals of the AI have become dangerous.
Detecting and dealing with such a splintering could both be useful tools to add to this method.
At a meta level, most problems in AI safety seem to be variants of model splintering, including:
Almost every recent post I've read in AI safety, I've been able to connect back to this central idea. Now, we have to be cautious - cure-alls cure nothing, after all, so it's not necessarily a positive sign that everything seems to fit into this framework.
Still, I think it's worth diving into this, especially as I've come up with a framework that seems promising for actually solving this issue in many cases.
In a similar concept-space is Abram's orthodox case against utility functions, where he talks about the Jeffrey-Bolker axioms, which allows the construction of preferences from events without needing full worlds at all.
This post is dedicated to explicitly modelling the transition to ambiguity, and then showing what we can gain from this explicit meta-modelling. It will do with some formal language (made fully formal in this post), and a lot of examples.
Just as Scott argues that if it's worth doing, it's worth doing with made up statistics, I'd argue that if an idea is worth pursuing, it's worth pursuing with an attempted formalism.
Formalisms are great at illustrating the problems, clarifying ideas, and making us familiar with the intricacies of the overall concept. That's the reason that this post (and the accompanying technical post) will attempt to make the formalism reasonably rigorous. I've learnt a lot about this in the process of formalisation.
What do we mean by a model? Do we mean mathematical model theory? As we talking about causal models, or causal graphs? AIXI uses a distribution over possible Turing machines, whereas Markov Decision Processes (MDPs) sees states and actions updating stochastically, independently at each time-step. Unlike the previous two, Newtonian mechanics doesn't use time-steps but continuous times, while general relativity weaves time into the structure of space itself.
And what does it mean for a model to make "predictions"? AIXI and MDPs make prediction over future observations, and causal graphs are similar. We can also try running them in reverse, "predicting" past observations from current ones. Mathematical model theory talks about properties and the existence or non-existence of certain objects. Ideal gas laws make a "prediction" of certain properties (eg temperature) given certain others (eg volume, pressure, amount of substance). General relativity establishes that the structure of space-time must obey certain constraints.
It seems tricky to include all these models under the same meta-model formalism, but it would be good to do so. That's because of the risk of ontological crises: we want the AI to be able to continue functioning even if the initial model we gave it was incomplete or incorrect.
All of the models mentioned above share one common characteristic: once you know some facts, you can deduce some other facts (at least probabilistically). A prediction of the next time step, a retrodiction of the past, a deduction of some properties from other, or a constraint on the shape of the universe: all of these say that if we know some things, then this puts constraints on some other things.
So let's define , informally, as the set of features of a model. This could be the gas pressure in a room, a set of past observations, the local curvature of space-time, the momentum of a particle, and so on.
So we can define a prediction as a probability distribution over a set of possible features , given a base set of features, :
Do we need anything else? Yes, we need a set of possible environments for which the model is (somewhat) valid. Newtonian physics fails at extreme energies, speeds, or gravitational fields; we'd like to include this "domain of validity" in the model definition. This will be very useful for extending models, or transitioning from one model to another.
You might be tempted to define a set of "worlds" on which the model is valid. But we're trying to avoid that, as the "worlds" may not be very useful for understanding the model. Moreover, we don't have special access to the underlying reality; so we never know whether there actually is a Turing machine behind the world or not.
So define , the environment on which the model is valid, as a set of possible features. So if we want to talk about Newtonian mechanics, would be a set of Newtonian features (mass, velocity, distance, time, angular momentum, and so on) and would be the set of these values where relativistic and quantum effects make little difference.
So see a model as
for a set of features, a set of environments, and a probability distribution. This is such that, for , we have the conditional probability:
Though is defined for , we generally want it to be usable from small subsets of the features: so should be simple to define from . And we'll often define the subsets in similar ways; so might be all environments with a certain angular momentum at time , while might be all environments with a certain angular momentum at a later time.
The full formal definition of these can be found here. The idea is to have a meta-model of modelling that is sufficiently general to apply to almost all models, but not one that relies on some ideal or perfect formalism.
It's very easy to include Bayesian models within this formalism. If we have a Bayesian model that includes a set of worlds with prior , then we merely have to define a set of features that is sufficient to distinguish all worlds in : each world is uniquely defined by its feature values[1]. Then we can define as , and on becomes on ; the definitions of terms like is just , per Bayes' rules (unless , in which case we set that to ).
This section will look at what we can do with the previous meta-model, looking at refinement (how models can improve) and splintering (how improvements to the model can make some well-defined concepts less well-defined).
Informally, is a refinement of model if it's at least as expressive as (it covers the same environments) and is better according to some criteria (simpler, or more accurate in practice, or some other measurement).
At the technical level, we have a map from a subset of , that is surjective onto . This covers the "at least as expressive" part: every environment in exists as (possibly multiple) environments in .
Then note that using as a map from subsets of to subsets of , we can define on via:
Then this is a model refinement if is 'at least as good as' on , according to our criteria[2].
This post presents some subclasses of model refinement, including -improvements (same features, same environments, just a better ), or adding new features to a basic model, called "non-independent feature extension" (eg adding classical electromagnetism to Newtonian mechanics).
Here's a specific gas law illustration. Let be a model of an ideal gas, in some set of rooms and tubes. The consists of pressure, volume, temperature, and amount of substance, and is the ideal gas laws. The is the standard conditions for temperature and pressure, where the ideal gas law applies. There are multiple different types of gases in the world, but they all roughly obey the same laws.
Then compare with model . The has all the features of , but also includes the volume that is occupied by one mole of the molecules of the given substance. This allows to express the more complicated van der Waals equations, which are different for different types of gases. The can now track situations where there are gases with different molar volumes, which include situations where the van der Waals equations differ significantly from the ideal gas laws.
In this case , since we now distinguish environments that we previously considered identical (environments with same features except for having molar volumes). The is just projecting down by forgetting the molar volume. Then since (van der Waals equations averaged over the distribution of molar volumes) is at least as accurate as (ideal gas law), this is a refinement.
Let's reuse Eliezer's example of rubes ("red cubes") and bleggs ("blue eggs").
Bleggs are blue eggs that glow in the dark, have a furred surface, and are filled with vanadium. Rubes, in contrast, are red cubes that don't glow in the dark, have a smooth surface, and are filled with palladium:

Define by having , is the set of all bleggs and rubes in some situation, and is relatively trivial: it predicts that an object is red/blue if and only if is smooth/furred.
Define as a refinement of , by expanding to . The projection is given by forgetting about those two last features. The is more detailed, as it now connects red-smooth-cube-dark together, and similarly for blue-furred-egg-glows.
Note that is larger than , because it includes, e.g., environments where the cube objects are blue. However, all these extra environments have probability zero.
Let be a reward function on (by which we mean that is define on , the set of features in ), and a refinement of .
A refactoring of for is a reward function on the features such that for any , .
For example, let and be from the rube/blegg models in the previous section. Let on simply count the number of rubes - or, more precisely, counts the number of objects to which the feature "red" applies.
Let be the reward function that counts the number of objects in to which "red" applies. It's clearly a refactoring of .
But so is , the reward function that counts the number of objects in to which "smooth" applies. In fact, the following is a refactoring of , for all :
There are also some non-linear combinations of these features that refactor , and many other variants (like the strange combinations that generate concepts like grue and bleen).
Model splintering, in the informal sense, is what happens when we pass to a new models in a way that the old features (or a reward function defined by the old features) no longer apply. It is similar to the web of connotations breaking down, an agent going out of distribution, or the definitions of Rube and Blegg falling apart.
So, note that in the rube/blegg example, is not a splintering of : all the refactorings are the same on all bleggs and rubes - hence on all elements of of non-zero probability.
We can even generalise this a bit. Let's assume that "red" and "blue" are not totally uniform; there exists some rubes that are "redish-purple", while some bleggs are "blueish-purple". Then let be like , except the colour feature can have four values: "red", "redish-purple", "blueish-purple", and "blue".
Then, as long as rubes (defined, in this instance, by being smooth-dark-cubes) are either "red" or "redish-purple", and the bleggs are "blue", or "blueish-purple", then all refactorings of to agree - because, on the test environment, on perfectly matches up with on .
So adding more features does not always cause splintering.
The preliminary definition runs into trouble when we add more objects to the environments. Define as being the same as , except that contains one extra object, ; apart from that, the environments typically have a billion rubes and a trillion bleggs.
Suppose is a "furred-rube", i.e. a red-furred-dark-cube. Then and are two different refactorings of , that obviously disagree on any environment that contains . Even if the probability of is tiny (but non-zero), then splinters .
But things are worse than that. Suppose that is fully a rube: red-smooth-cube-dark, and even contains palladium. Define as being counting the number of red objects, except for specifically (again, this is similar to the grue and bleen arguments against induction).
Then both and are refactorings of , so still splinters , even when we add another exact copy of the elements in the training set. Or even if we keep the training set for a few extra seconds, or add any change to the world.
So, for any a refinement of , and a reward function on , let's define "natural refactorings" of :
This leads to a full definition of splintering:
Notice the whole host of caveats and weaselly terms here; , "simply" (used twice), and . Simply might mean algorithmic simplicity, but and are measures of how much "error" we are willing to accept in these refactorings. Given that, we probably want to replace and with some measure of non-equality, so we can talk about the "degree of naturalness" or the "degree of splintering" of some refinement and reward function.
Note also that:
An easy example: it makes a big difference whether a new feature is "temperature", or "divergence from standard temperatures".
The concept of "reward refactoring" is transitive, but the concept of "natural reward refactoring" need not be.
For example, let be a training environment where red/blue cube/egg, and be a general environment where red/blue is independent of cube/egg. Let be a feature set with only red/blue, and a feature set with red/blue and cube/egg.
Then define as using in the training environment, as using in the general environment; and are defined similarly.
For these models, and are both refinements of , while is a refinement of all three other models. Define as the "count red objects" reward on . This has a natural refactoring to on , which counts red objects in the general environment.
And has a natural refactoring to on , which still just counts the red objects in the general environment.
But there is no natural refactoring from directly to . That's because, from 's perspective, on might be counting red objects, or might be counting cubes. This is not true for on , which is clearly only counting red objects.
Thus when a reward function come from a training environment, we'd want our AI to look for splinterings directly from a model of the training environment, rather than from previous natural refactorings.
We can also talk about splintering features and models themselves. For , the easiest way is to define a reward function as being the indicator function for feature being in the set .
Then a refinement splinters the feature if it splinters some .
The refinement splinters the model if it splinters at least one of its features.
For example, if is Newtonian mechanics, including "total rest mass" and is special relativity, then will splinter "total rest mass". Other examples of feature splintering will be presented in the rest of this post.
A reward function developed in some training environment will ignore any feature that is always present or always absent in that environment. This allows very weird situations to come up, such as training an AI to distinguish happy humans from sad humans, and it ending up replacing humans with humanoid robots (after all, both happy and sad humans were equally non-robotic, so there's no reason not to do this).
Let's try and do better than that. Assume we have a model , with a reward function defined on ( and can be seen as the training data).
Then the feature-preserving reward function , is a function that constrains the environments to have similar feature distributions as and . There are many ways this could be defined; here's one.
For an element , just define
Obviously, this can be improved; we might want to coarse-grain , grouping together similar worlds, and possibly bounding this below to avoid singularities.
Then we can use this to get the feature-preserving version of , which we can define as
for the maximal value of on . Other options can work as well, such as for some constant .
Then we can ask an AI to use as its reward function, refactoring that, rather than .
The is almost certainly too restrictive to be of use. For example, if time is a feature, then this will fall apart when the AI has to do something after the training period. If all the humans in a training set share certain features, humans without those features will be penalised.
There are at least two things we can do to improve this. The first is to include more positive and negative examples in the training set; for example, if we include humans and robots in our training set - as positive and negative examples, respectively - then this difference will show up in directly, so we won't need to use too much.
Another approach would be to explicitly allow certain features to range beyond their typical values in , or allow highly correlated variables explicitly to decorrelate.
For example, though training during a time period to , we could explicitly allow time to range beyond these values, without penalty. Similarly, if a medical AI was trained on examples of typical healthy humans, we could decorrelate functioning digestion from brain activity, and get the AI to focus on the second[3].
This has to be done with some care, as adding more degrees of freedom adds more ways for errors to happen. I'm aiming to look further at this issue in later posts.
We can now rephrase the out-of-distribution issues of section 1.1 in terms of the new formalism:
The rest of this post is applying this basic framework, and its basic insights, to various common AI safety problems and analyses. This section is not particularly structured, and will range widely (and wildly) across a variety of issues.
Let's go back to the blegg and rube examples. A human supervises an AI in a training environment, labelling all the rubes and bleggs for it.
The human is using a very simple model, with the only feature being the colour of the object, and being the training environment.
Meanwhile the AI, having more observational abilities and no filter as to what can be ignored, notices their colour, their shape, their luminance, and their texture. It doesn't know , but is using model , where covers those four features (note that is a refinement of , but that isn't relevant here).

Suppose that the AI is trained to be rube-classifier (and hence a blegg classifier by default). Let be the reward function that counts the number of objects, with feature , that the AI has classified as rubes. Then the AI could learn many different reward function in the training environment; here's one:
Note that, even though this gets the colour reward completely wrong, this reward matches up with the human's assessment on the training environment.
Now the AI moves to the larger testing environment , and refines its model minimally to (extending to in the obvious way).
In , the AI sometimes encounters objects that it can only see through their colour. Will this be a problem, since the colour component of is pointing in the wrong direction?
No. It still has , and can deduce that a red object must be cube-smooth-dark, so will continue treating this as a rube[4].
Now imagine the AI learns about the content of the rubes and bleggs, and so refines to a new model that includes vanadium/palladium as a feature in .
Furthermore, in the training environment, all rubes have palladium and all bleggs have vanadium in them. So, for a refinement of , has only palladium-rubes and vanadium-bleggs. But in , the full environment, there are rather a lot of rubes with vanadium and bleggs with palladium.
So, similarly to section 4.7, there is no natural refactoring of the rube/blegg reward in , to . That's because , the feature set of , includes vanadium/palladium which co-vary with the other rube/blegg features on the training environment (q^{-1}(\E_{AI}^1)), but not on the full environment of .
So looking for reward splintering from the training environment is a way of detecting going out-of-distribution - even on features that were not initially detected in the training distribution, by either the human nor the AI.
Some of the most promising AI safety methods today rely on getting human feedback[5]. Since human feedback is expensive, as in it's slow and hard to get compared with almost all other aspects of algorithms, people want to get this feedback in the most efficient ways possible.
A good way of doing this would be to ask for feedback when the AI's current reward function splinters, and multiple options are possible.
A more rigorous analysis would look at the value of information, expected future splinterings, and so on. This is what they do in Active Inverse Reinforcement Learning; the main difference is that AIRL emphasises an unknown reward function with humans providing information, while this approach sees it more as an known reward function over uncertain features (or over features that may splinter in general environments).
I argued that many "conservative" AI optimising approaches, such as quantilizers and pessimistic AIs, don't have a good measure of when to become more conservative; their parameters and don't encode useful guidelines for the right degree of conservatism.
In this framework, the alternative is obvious: AIs should become conservative when their reward functions splinter (meaning that the reward function compatible with the previous environment has multiple natural refactorings), and very conservative when they splinter a lot.
This design is very similar to Inverse Reward Design. In that situation, the reward signal in the training environment is taken as information about the "true" reward function. Basically they take all reward functions that could have given the specific reward signals, and assume the "true" reward function is one of them. In that paper, they advocate extreme conservatism at that point, by optimising the minimum of all possible reward functions.
The idea here is almost the same, though with more emphasis on "having a true reward defined on uncertain features". Having multiple contradictory reward functions compatible with the information, in the general environment, is equivalent with having a lot of splintering of the training reward function.
The post "By default, avoid ambiguous distant situations" can be rephrased as: let be a model in which we have a clear reward function , and let be a refinement of this to general situations. We expect that this refinement splinters . Let be like , except with smaller than , defined such that:
Then that post can be summarised as:
Stuart Russell writes:
A system that is optimizing a function of variables, where the objective depends on a subset of size , will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.
The approach in sections 4.9 and 4.10 explicitly deals with this.
Now consider two agents doing a rube/blegg classifications task in the training environment; each agent only models two of the features:

Despite not having a single feature in common, both agents will agree on what bleggs and rubes are, in the training environment. And when refining to a fuller model that includes all four (or five) of the key features, both agents will agree as to whether a natural refactoring is possible or not.
This can be used to help define the limits of interpretability. The AI can use its own model, and its own designed features, to define the categories and rewards in the training environment. These need not be human-parsable, but we can attempt to interpret them in human terms. And then we can give this interpretation to the AI, as a list of positive and negative examples of our interpretation.
If we do this well, the AI's own features and our interpretation will match up in the training environment. But as we move to more general environments, these may diverge. Then the AI will flag a "failure of interpretation" when its refactoring diverges from a refactoring of our interpretation.
For example, if we think the AI detects pandas by looking for white hair on the body, and black hair on the arms, we can flag lots of examples of pandas and that hair pattern (and non-pandas and unusual hair patterns. We don't use these examples for training the AI, just to confirm that, in the training environment, there is a match between "AI-thinks-they-are-pandas" and "white-hair-on-arms-black-hair-on-bodies".
But, in an adversarial example, the AI could detect that, while it is detecting gibbons, this no longer matches up with our interpretaion. A splintering of interpretations, if you want.
The approach can also be used to detect wireheading. Imagine that the AI has various detectors that allow it to label what the features of the bleggs and rubes are. It models the world with ten features: features representing the "real world" versions of the features, and representing the "this signal comes from my detector" versions.
This gives a total of features, the features "in the real world" and the "AI-labelled" versions of these:

In the training environment, there was full overlap between these features, so the AI might learn the incorrect "maximise my labels/detector signal" reward.
However, when it refines its model to all features and environments where labels and underlying reality diverge, it will realise that this splinters the reward, and thus detect a possible wireheading. It could then ask for more information, or have an automated "don't wirehead" approach.
To get around the slowness of the real world, some approaches train AIs in virtual environments. The problem is to pass that learning from the virtual environment to the real one.
Some have suggested making the virtual environment sufficiently detailed that the AI can't tell the difference between it and the real world. But, a) this involves fooling the AI, an approach I'm always wary of, and b) it's unnecessary.
Within the meta-formalism of this post, we could train the AI in a virtual environment which it models by , and let it construct a model of the real-world. We would then motivate the AI to find the "closest match" between and , in terms of features and how they connect and vary. This is similar to how we can train pilots in flight simulators; the pilots are never under any illusion as to whether this is the real world or not, and even crude simulators can allow them to build certain skills[6].
This can also be used to allow the AI to deduce information from hypotheticals and thought experiments. If we show the AI an episode of a TV series showing people behaving morally (or immorally), then the episode need not be believable or plausible, if we can roughly point to the features in the episode that we want to emphasise, and roughly how these relate to real-world features.
The approach for synthesising human preferences, defined here, can be rephrased as:
This is just one way of doing this, but it does show that "automating what AIs do with multiple refactorings" might not be impossible. The following subsection has some ideas with how to deal with that.
In an old post, I talked about the concept of "emergency learning", which was basically, "lots of examples, and all the stuff we know and suspect about how AIs can go wrong, shove it all in, and hope for the best". The "shove it all in" was a bit more structured than that, defining large scale preferences (like "avoid siren worlds" and "don't over-optimise") as constraints to be added to the learning process.
It seems we can do better than that here. Using examples and hypotheticals, it seems we could construct ideas like "avoid slavery", "avoid siren worlds", or "don't over-optimise" as rewards or positive/negative examples certain simple training environments, so that the AI "gets an idea of what we want".
We can then label these ideas as "global preferences". The idea is that they start as loose requirements (we have much more granular human-scale preferences than just "avoid slavery", for example), but, the more the world diverges from the training environment, the stricter they are to be interpreted, with the AI required to respect some softmin of all natural refactorings of these features.
In a sense, we'd be saying "prevent slavery; these are the features of slavery, and in weird worlds, be especially wary of these features".
Krakovna et. al. presented a paper on avoiding side-effects from AI. The idea is to have an AI maximising some reward function, while reducing side effects. So the AI would not smash vases or let them break, nor would it prevent humans from eating sushi.
In this environment, we want the AI to avoid knocking the sushi off the belt as it moves:

Here, in contrast, we'd want the AI to remove the vase from the belt before it smashes:

I pointed out some issues with the whole approach. Those issues were phrased in terms of sub-agents, but my real intuition is that syntactic methods are not sufficient to control side effects. In other words, the AI can't learn to do the right thing with sushis and vases, unless it has some idea of what these objects mean to us; we prefer sushis to be eaten and vases to not be smashed.
This can be learnt if the AI has a enough training examples, learning that eating sushi is a general feature of the environments it operates in, while vases being smashed is not. I'll return to this idea in a later post.
The ideas of this post were present in implicit form in the idea of training an AI to cure cancer patients.
Using examples of successfully treated cancer patients, we noted they all shared some positive features (recuperating, living longer) and some incidental or negative features (complaining about pain, paying more taxes).
So, using the approach of section 4.9, we can designate that we want the AI to cure cancer; this will be interpreted as increasing all the features that correlate with that.
Using the explicit decorrelation of section 4.10, we can also explicitly remove the negative options from the desired feature sets, thus improving the outcomes even more.
In Eliezer's original post on the hidden complexity of wishes, he talks of the challenge of getting a genie to save your mother from a burning building:
So you hold up a photo of your mother's head and shoulders; match on the photo; use object contiguity to select your mother's whole body (not just her head and shoulders); and define the future function using your mother's distance from the building's center. [...]
You cry "Get my mother out of the building!", for luck, and press Enter. [...]
BOOM! With a thundering roar, the gas main under the building explodes. As the structure comes apart, in what seems like slow motion, you glimpse your mother's shattered body being hurled high into the air, traveling fast, rapidly increasing its distance from the former center of the building.
How could we avoid this? What you want is your mother out of the building. The feature "mother in building" must absolutely be set to false; this is a priority call, overriding almost everything else.
Here we'd want to load examples of your mother outside the building, so that the genie/AI learns the features "mother in house"/"mother out of house". Then it will note that "mother out of house" correlates with a whole lot of other features - like mother being alive, breathing, pain-free, often awake, and so on.
All those are good things. But there are some other features that don't correlate so well - such as the time being earlier, your mother not remembering a fire, not being covered in soot, not worried about her burning house, and so on.
As in the cancer patient example above, we'd want to preserve the features that correlate with the mother out of the house, while allowing decorrelation with the features we don't care about or don't want to preserve.
If the Antikythera mechanism had been combined with the Aeolipile to produce an ancient Greek AI, and Homer had programmed it (among other things) to "increase people's honour", how badly would things have gone?
If Babbage had completed the analytical engine as Victorian AI, and programmed it (among other things) to "protect women", how badly would things have gone?
If a modern programmer were to combine our neural nets into a superintelligence and program it (among other things) to "increase human happiness", how badly will things go?
There are three moral-relevant categories here, and it's illustrative to compare them: honour, gender, and hedonic happiness. The first has splintered, the second is splintering, and the third will likely splinter in the future.
I'm not providing solutions in this subsection, just looking at where the problems can appear, and encouraging people to think about how they would have advised Homer or Babbage to define their concepts. Don't think "stop using your concepts, use ours instead", because our concepts/features will splinter too. Think "what's the best way they could have extended their preferences even as the features splinter"?
If we look at the concept of honour, we see a concept that has already splintered.
That article reads like a meandering mess. Honour is "face", "reputation", a "bond between an individual and a society", "reciprocity", a "code of conduct", "chastity" (or "virginity"), a "right to precedence", "nobility of soul, magnanimity, and a scorn of meanness", "virtuous conduct and personal integrity", "vengeance", "credibility", and so on.
What a basket of concepts! They only seem vaguely connected together; and even places with strong honour cultures differ in how they conceive of honour, from place to place and from epoch to epoch[7]. And yet, if you asked most people within those cultures about what honour was, they would have had a strong feeling it was a single, well defined thing, maybe even a concrete object.
In his post the categories were made for man, not man for the categories, Scott writes:
Absolutely typical men have Y chromosomes, have male genitalia, appreciate manly things like sports and lumberjackery, are romantically attracted to women, personally identify as male, wear male clothing like blue jeans, sing baritone in the opera, et cetera.
But Scott is writing this in the 21st century, long after the gender definition has splintered quite a bit. In middle class middle class Victorian England[8], the gender divide was much stronger - in that, from one component of the divide, you could predict a lot more. For example, if you knew someone wore dresses in public, you knew that, almost certainly, they couldn't own property if they were married, nor could they vote, they would be expected to be in charge of the household, might be allowed to faint, and were expected to guard their virginity.

We talk nowadays about gender roles multiplying or being harder to define, but they've actually being splintering for a lot longer than that. Even though we could define two genders in 1960s Britain, at least roughly, that definition was a lot less informative than it was in Victorian-middle-class-Britain times: it had many fewer features strongly correlated with it.
On to happiness! Philosophers and others have been talking about happiness for centuries, often contrasting "true happiness", or flourishing, with hedonism, or drugged out stupor, or things of that nature. Often "true happiness" is a life of duty to what the philosopher wants to happen, but at least there is some analysis, some breakdown of the "happiness" feature into smaller component parts.
Why did the philosophers do this? I'd wager that it's because the concept of happiness was already somewhat splintered (as compared with a model where "happiness" is a single thing). Those philosophers had experience of joy, pleasure, the satisfaction of a job well done, connection with others, as well as superficial highs from temporary feelings. When they sat down to systematise "happiness", they could draw on the features of their own mental model. So even if people hadn't systematised happiness themselves, when they heard of what philosophers were doing, they probably didn't react as "What? Drunken hedonism and intellectual joy are not the same thing? How dare you say such a thing!"
But looking into the future, into a world that an AI might create, we can foresee many situations where the implicit assumptions of happiness come apart, and only some remain. I say "we can foresee", but it's actually very hard to know exactly how that's going to happen; if we knew it exactly, we could solve the issues now.
So, imagine a happy person. What do you think that they have in life, that are not trivial synonyms of happiness? I'd imagine they have friends, are healthy, think interesting thoughts, have some freedom of action, may work on worthwhile tasks, may be connected with their community, probably make people around them happy as well. Getting a bit less anthropomorphic, I'd also expect them to be a carbon-based life-form, to have a reasonable mix of hormones in their brain, to have a continuity of experience, to have a sense of identity, to have a personality, and so on.
Now, some of those features can clearly be separated from "happiness". Even ahead of time, I can confidently say that "being a carbon-based life-form" is not going to be a critical feature of "happiness". But many of the other ones are not so clear; for example, would someone without continuity of experience or a sense of identity be "happy"?
Of course, I can't answer that question. Because the question has no answer. We have our current model of happiness, which co-varies with all those features I listed and many others I haven't yet thought of. As we move into more and more bizarre worlds, that model will splinter. And whether we assign the different features to "happiness" or to some other concept, is a choice we'll make, not a well-defined solution to a well-defined problem.
However, even at this stage, some answers are clearly better than others; statues of happy people should not count, for example, nor should written stories describing very happy people.
In apprenticeship learning (or learning from demonstration), the AI would aim to copy what experts have done. Inverse reinforcement learning can be used for this purpose, by guessing the expert's reward function, based on their demonstrations. It looks for key features in expert trajectories and attempts to reproduce them.
So, if we had an automatic car driving people to the airport, and fed it some trajectories (maybe ranked by speed of delivery), it would notice that passengers would also arrive alive, with their bags, without being pursued by the police, and so on. This is akin to section 4.9, and would not accelerate blindly to get there as fast as possible.
But the algorithm has trouble getting to truly super-human performance[9]. It's far too conservative, and, if we loosen the conservatism, it doesn't know what's acceptable and what isn't, and how to trade these off: since all passengers survived and the car was always painted yellow, their luggage intact in the training data, it has no reason to prefer human survival to taxi-colour. It doesn't even have a reason to have a specific feature resembling "passenger survived" at all.
This might be improved by the "allow decorrelation" approach from section 4.10: we specifically allow it to maximise speed of transport, while keeping the other features (no accidents, no speeding tickets) intact. As in section 6.7, we'll attempt to check that the AI does prioritise human survival, and that it will warn us if a refactoring moves it away from this.
Now, sometimes worlds may be indistinguishable for any feature set. But in that case, they can't be distinguished by any observations, either, so their relative probabilities won't change: as long as it's defined, is constant for all observations . So we can replace and with , of prior probability . Doing this for all indistinguishable worlds (which form an equivalence class) gives , a set of distinguishable worlds, with a well defined on it. ↩︎
It's useful to contrast a refinement with the "abstraction" defined in this sequence. An abstraction throws away irrelevant information, so is not generally a refinement. Sometimes they are exact opposites, as the ideal gas law is an abstraction of the movement of all the gas particles, while the opposite would be a refinement.
But they are exact opposites either. Starting with the neurons of the brain, you might abstract them to "emotional states of mind", while a refinement could also add "emotional states of mind" as new features (while also keeping the old features). A splintering is more the opposite of an abstraction, as it signals that the old abstraction features are not sufficient.
It would be interesting to explore some of the concepts in this post with a mixture of refinements (to get the features we need) and abstractions (to simplify the models and get rid of the features we don't need), but that is beyond the scope of this current, already over-long, post. ↩︎
Specifically, we'd point - via labelled examples - at a clusters of features that correlate with functioning digestion, and another cluster of features that correlate with brain activity, and allow those two clusters to decorrelate with each other. ↩︎
It is no coincidence that, if and are rewards on , that are identical on , and if is a refactoring of , then is also a refactoring of . ↩︎
Though note there are some problems with this approach, both in theory and in practice. ↩︎
Some more "body instincts" skills require more realistic environments, but some skills and procedures can perfectly well be trained in minimal simulators. ↩︎
You could define honour as "behaves according to the implicit expectations of their society", but that just illustrates how time-and-place dependent honour is. ↩︎
It's not impossible to get superhuman performance from apprenticeship learning; for example, we could select the best human performance on a collection of distinct tasks, and thus get the algorithm to have a overall performance that no human could ever match. Indeed, one of the purposes of task decomposition is to decompose complex tasks in ways that allow apprenticeship-like learning to have safe and very superhuman performance on the whole task. ↩︎