Here's a place where I want one of those disagree buttons separate from the downvote button :P
Given a world model that contains a bunch of different ways of modeling the same microphysical state (splitting up the same world into different parts, with different saliency connections to each other, like the discussion of job vs. ethnicity and even moreso), there can be multiple copies that coarsely match some human-intuitive criteria for a concept, given different weights by the AI. There will also be ways of modeling the world that don't get represented much at all, and which ways get left out can depend how you're training this AI (and a bit more subtly, how you're interpreting its parameters as a world model).
Especially because of that second part, finding good goals in an AI's world model isn't satisfactory if you're just training an fixed, arbitrary AI. Your process for finding good goals needs to interact with how the AI learns its mode of the world in the first place. In which case, world-model interpretability is not all we need.
I agree that the AI would only learn the abstraction layers it'd have a use for. But I wouldn't take it as far as you do. I agree that with "human values" specifically, the problem may be just that muddled, but with none of the other nice targets — moral philosophy, corrigibility, DWIM, they should be more concrete.
The alternative would be a straight-up failure of the NAH, I think; your assertion that "abstractions can be on a continuum" seems directly at odds with it. Which isn't impossible, but this post is premised on the NAH working.
Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility.
In my plan, interpretable world-modeling is a key component of Step 1, but my idea there is to build (possibly just by fine-tuning, but still) a bunch of AI modules specifically for the task of assisting in the construction of interpretable world models. In step 2 we’d throw those AI modules away and construct a completely new AI policy which has no knowledge of the world except via that human-understood world model (no direct access to data, just simulations). This is pretty well covered by your routes numbered 2 and 3 in section 1A, but I worry those points didn’t get enough emphasis and people focused more on route 1 there, which seems much more hopeless.
As an established case for tractability, we have the natural abstraction hypothesis. According to it, efficient abstractions are a feature of the territory, not the map (at least to a certain significant extent). Thus, we should expect different AI models to converge towards the same concepts, which also would make sense to us. Either because we're already using them (if the AI is trained on a domain we understand well), or because they'd be the same abstractions we'd arrive at ourselves (if it's a novel domain).
Even believing in a relatively strong version of the natural abstractions hypothesis doesn't (on its own) imply that we should be able to understand all concepts the AI uses. Just the ones which:
These three properties seem reasonably likely in practice for some common stuff like 'trees' or 'dogs'.
Reading this post I think it insufficiently addresses motivations, purpose, reward functions, etc. to make the bold claim that perfect world-model interpretability is sufficient for alignment. I think this because ontology is not the whole of action. Two agents with the same ontology and very different purposes would behave in very different ways.
Perhaps I'm being unfair, but I'm not convinced that you're not making the same mistake as when people claim any sufficiently intelligent AI would be naturally good.
Two agents with the same ontology and very different purposes would behave in very different ways.
I don't understand this objection. I'm not making any claim isomorphic to "two agents with the same ontology would have the same goals". It sounds like maybe you think I'm arguing that if we can make the AI's world-model human-like, it would necessarily also be aligned? That's not my point at all.
The motivation is outlined at the start of 1A: I'm saying that if we can learn how to interpret arbitrary advanced world-models, we'd be able to more precisely "aim" our AGI at any target we want, or even manually engineer some structures over its cognition that would ensure the AGI's aligned/corrigible behavior.
Isn't a special case of aiming at any target we want the goals we would want it to have? And whatever goals we'd want it to have would be informed by our ontology? So what I'm saying is I think there's a case where the generality of your claim breaks down.
Goals are functions over the concepts in one's internal ontology, yes. But having a concept for something doesn't mean caring about it — your knowing what a "paperclip" is doesn't make you a paperclip-maximizer.
The idea here isn't to train an AI with the goals we want from scratch, it's to train an advanced world-model that would instrumentally represent the concepts we care about, interpret that world-model, then use it as a foundation to train/build a different agent that would care about these concepts.
I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.
It is not that AI would naturally learn human values, but that it's relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.
This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.
The robust values hypothesis from DragonGod is worth looking at, too.
From the link below, I'll quote:
Consider the following hypothesis:
There exists a "broad basin of attraction" around a privileged subset of human values[1] (henceforth "ideal values") The larger the basin the more robust values are Example operationalisations[2] of "privileged subset" that gesture in the right direction: Minimal set that encompasses most of the informational content of "benevolent"/"universal"[3] human values The "minimal latents" of "benevolent"/"universal" human values Example operationalisations of "broad basin of attraction" that gesture in the right direction: A neighbourhood of the privileged subset with the property that all points in the neighbourhood are suitable targets for optimisation (in the sense used in #3 Larger neighbourhood → larger basin Said subset is a "naturalish" abstraction The more natural the abstraction, the more robust values are Example operationalisations of "naturalish abstraction" The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe More privileged → more natural Most efficient representations of our universe contain a simple embedding of the subset Simpler embeddings → more natural Points within this basin are suitable targets for optimisation The stronger the optimisation pressure applied for which the target is still suitable, the more robust values are. Example operationalisations of "suitable targets for optimisation": Optimisation of this target is existentially safe[4] More strongly, we would be "happy" (where we fully informed) for the system to optimise for these points.
This is an important hypothesis, since if it has a non-trivial chance of being correct, then AI Alignment gets quite easier. And given the shortening timelines, I think this is an important hypothesis to test.
Here's a link below for the robust values hypothesis:
https://www.lesswrong.com/posts/YoFLKyTJ7o4ApcKXR/disc-are-values-robust
Now this is admittedly very different from the thesis that value is complex and fragile.
I disagree. The fact that some concept is very complicated doesn't mean it won't be necessarily represented in any advanced AGI's ontology. Humans' psychology, or the specific tools necessary to build nanomachines, or the agent foundation theory necessary to design aligned successor agents, are all also "complex and fragile" concepts (in the sense that getting a small detail wrong would result in a grand failure of prediction/planning), but we can expect such concepts to be convergently learned.
Not that I necessarily expect "human values" specifically to actually be a natural abstraction — an indirect pointer at "moral philosophy"/DWIM/corrigibility seem much more plausible and much less complex.
I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.
It is not that AI would naturally learn human values, but that it's relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.
This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.
If this is the case, my concern seems yet more warranted, as this is hoping we won't suffer a false positive alignment scheme that looks like it could work but won't. Given the his cost of getting things wrong, we should minimize false positive risks which means not pursuing some ideas because the risk if they are wrong is too high.
Summary, by sections:
1. Introduction
1A. Why Aim For This?
Imagine that we develop interpretability tools that allow us to flexibly understand and manipulate an AGI's world-model — but only its world-model. We would be able to see what the AGI knows, add or remove concepts from its mental ontology, and perhaps even use its world-model to run simulations/counterfactuals. But its thoughts and plans, and its hard-coded values and shards, would remain opaque to us. Would that be sufficient for robust alignment?
I argue it would be.
Primarily, this would solve the Pointers Problem. A central difficulty of alignment is that our values are functions of highly abstract variables, and that makes it hard to point an AI at them, instead of at easy-to-measure, shallow functions over sense-data. Cracking open a world-model would allow us to design metrics that have depth.
From there, we'd have several ways to proceed:
That leaves open the question of the "target metric". It primarily depends on what will be easy to specify — what concepts we'll find in the interpreted world-model. Some possibilities:
1B. Is It A Realistic Goal?
Is there reason to think we can achieve perfect interpretability into an AI's world-model? Why go for the world-model specifically, instead of trying to focus on understanding the AI's plans, thoughts, values, mesa-objectives, shards?
I don't expect it'd be easy in an absolute sense, no. But when choosing from the set of targets that'd suffice for robust alignment, I do expect it's the easiest one.
As an established case for tractability, we have the natural abstraction hypothesis. According to it, efficient abstractions are a feature of the territory, not the map (at least to a certain significant extent). Thus, we should expect different AI models to converge towards the same concepts, which also would make sense to us. Either because we're already using them (if the AI is trained on a domain we understand well), or because they'd be the same abstractions we'd arrive at ourselves (if it's a novel domain).
A different case is presented in my earlier post on internal interfaces. In short:
Crucially, if a world-model does follow consistent data formats, it should be possible to interpret it all at once. Instead of interpreting features one-by-one, we can figure out their encoding, and "crack" the world-model in one fell swoop.
By comparison, there's no reason (as far as I can currently tell) to expect the same consistent formatting for the AI's heuristics/shards/mesa-objectives. They'd have consistent inputs and outputs, but their internals? Totally incomprehensible and ad-hoc. Still interpretable in principle, but only one-by-one.
On this argument, there's another prospective target: an AGI's plans/thoughts. They'd also need to be consistently-formatted, since future instances of the AGI's planner process would need to access plans generated by its earlier instances. But there are fewer reasons to expect that, and their formats may be dramatically more complex and difficult to decode. In particular, the NAH may not apply to them as strongly — plan formats may not be convergent across humans and AGIs, or even across different AGI systems.
(Unduly optimistic possibility: Or it may be that plans would be formatted using the same formats the world-model uses. In which case "interpreting the world-model" and "learning to read the AGI's thoughts " solve each other. I wouldn't bet on that working out perfectly, though.)
In addition, consider that advanced AGI models would likely need to modify their world-models at runtime — perhaps via their native GPS algorithms. If so, we can expect advanced world-models to have built-in functionality for editing. They'd have functions like "add a new concept", "remove a concept", "chunk two abstractions together", "expand a given abstraction", "propagate an update", or "fetch all concepts connected to this one", which should likewise be possible to reverse-engineer.[3]
As such, I think perfect world-model interpretability is a reasonable target to aim for.
2. Would World-Models Look Like We Imagine?
This section is a kitchen sink of arguments regarding whether world-models would satisfy a bunch of nice high-level desiderata.
Namely: whether they'd be learned at all, how they'd be used, whether "the world-model" would be a unified module (as opposed to a number of non-interacting specialized modules), and whether it'd have recognizable internal modules.
2A. Are World-Models Necessary?
That is, should we actually expect AIs to learn a module we can reasonably describe as a "world-model"? It seems intuitively obvious, but can we prove it?
The Gooder Regulator Theorem aims to do just that. Translating it into the framework of ML, it essentially says the following:
Suppose that we have some system S in which we want our AI to optimally perform a task Z, a training dataset X that contains some information on S, and some set of variables Y (which can be thought of as information about the "current" state of S) which must be taken into account when choosing the optimal action R at runtime. M is some minimal summary of X available to the AI at runtime — its parameters.
The theorem states that M would need to contain all information from X which impacts the optimal policy for choosing R given Y — i. e., all information about decision-relevant features of S contained in X.
(For example, if the AI is trained to drive a car, it probably doesn't need to pay attention to the presence of planes in the sky. Thus, any data about planes present in X (their frequency, the trajectories they follow...) would be discarded, as it's not relevant to any decisions the AI would need to make. On the other hand, the presence of heavy clouds is correlated with rain, which would impact visibility and maneuverability, so the AI would learn to notice them.)
More specifically, M would need to be isomorphic to the Bayesian posterior on S given X. That seems like a reasonable definition of a "world-model".
2B. How Are World-Models Useful?
That is, why are world-models necessary? What practical purpose does this module serve?
As I've mentioned at the beginning, they provide "depth". Imagine the world as a causal graph. At the start, the AI can only read off the states of the nodes nearest to it (its immediate sensory inputs). Correspondingly, it can only act on their immediate states. The only policies available to it are reactions no more sophisticated than "if you see bright light, close your eyes".
By building a world-model, it reconstructs the unobserved parts of that causal graph, and starts being able to "see" nodes that are farther away. Seeing them allows it to respond to changes in them, to make its policy a function of their states.
There's a few points to be made here.
First, this solves the Credit Assignment Problem by providing a policy gradient. To improve, you need some way to distinguish whether you're performing better or worse according to whatever goal you have. But if your goal is a function of a far-away node, a node you don't see — well, you have no idea whether your actions improve or worsen matters, so you can't learn. Having a world-model directly addresses this concern, providing you fine-grained/deep feedback on your actions.
(Notably, "get good at modelling the world" itself is a very "shallow" goal. We get the gradient for it "for free": setting up self-supervised learning on our own sensory inputs, trying to get better at predicting them, naturally (somehow) lets us recover the world structure.)
Second, it dramatically increases the space of available reaction patterns, such as shards or heuristics. We can think of them as functions that take as input some subset of the nodes in our world-model, and steer the agent towards certain actions. If so, the set of available shards is defined over the power set of the nodes in the world-model.
A rich world-model, thus, allows the creation of rich economies of shards with complex functionality, while still keeping every shard relatively simple (and therefore easier to learn), since they can "build off" the world-model's complexity. A shard whose complexity is comparable to "if you see bright light, close your eyes" can cause some very complex behavior if it's attached to a highly-abstract node, instead of a shallow observable.
Third, note that world-models span not only space, but also time. Advanced world-models can be rolled forwards or backwards, to simulate the future or infer the past. This is useful both under the "goal-directedness" framework (allowing the agent to optimize across time) and the "richness of heuristics" one (we can view past or future states of nodes as "clones" of nodes, which expand the space of heuristics even more).
Fourth, they allow advanced in-context learning, or "virtual training". Suppose you want to learn how to do X. Instead of doing it via trial-and-error in reality (which may be dangerous), you can train your policy on your world-model instead.
2C. Are World-Models Unitary?
That is, can we expect "a world-model" to be a proper module, which only interacts with the rest of the AI's mind via pre-specified interfaces/API channels (as we'd like)? Or will it be in pieces, mini-world-models scattered all across the agent, each of them specialized to serve the needs of particular shards? Can we actually expect all information and inferences about the world to be pooled in one place, consistently-formatted?
Well, the evidence is mixed. Empirically, it does not seem to be the case with the modern ML models. Take the ROME paper: it describes a technique for editing factual associations in LLMs, but such edits don't generalize properly. For example, rewriting the The Eiffel Tower→Paris association with The Eiffel Tower→Rome, and then prompting the LLM to talk about the Eiffel Tower, correctly leads to it acting as if the Tower is in Rome. But it's one-directional: talking about Rome doesn't make it mention the Eiffel Tower, as it should if it pools all "facts about Rome" it has in one place. Neither does it follow a hierarchical structure (editing facts about "cheese" does not propagate the update to all sub-categories of cheese).
However, there's strong theoretical support for unitarity, which I'll get to in a bit. There's three explanations for this contradiction:
I'm pretty sure it's a mix of (1) and (2). Let's get to the arguments.
a) Future states are a function of increasingly further-away nodes. Depending on whether you're planning for the next second, next day, or next century, the optimal action to take will be a function of increasingly more causally distant objects. Thus, your "planning horizon" is limited by your ability to correlate data across distant regions of your world-model.
b) The "Crud" factor. Everything is connected to everything else. While details can often be abstracted away, a significant change often needs to be propagated throughout the whole world-model, and a minor change may unexpectedly snowball.
If your life is just a sequence of causally-disconnected games or tasks, you can have specialized models for each of them. But if these tasks can bleed into each other, it's necessary to cross-reference all available data.
For example, consider an AI trained to play chess and argue with people. If these tasks are wholly separate, and have no effect on each other, the AI can learn wholly separate, non-interacting generative models for both of them. But if a human can talk to the AI concurrently with playing a chess match against it, that changes. The chess strategy a human is using may reveal useful information about how they think, a heated argument may be distracting them from the game, and winning or losing the match may impact the human's emotional state in ways relevant to the arguments you should make. Thus, the need for a unified world-model arises: changes in one place need to be propagated everywhere.
In a sense, it's just a generalization of point (a), from the space-time to all abstract domains. The broader the scale at which you reason (whether spatially, or by interacting at a high level with societies, philosophy, logic), the more "distant" nodes you need to take into account.
c) Centralized planning. Everything-is-connected has implications not only for making accurate predictions, but also for planning. At a basic level, we can propagate our updates throughout our entire world-model: our noticing that the interlocutor is distracted by the argument would change our distribution over the chess moves they'd make. But we can also act in one domain with the deliberate aim of causing a consequence in a different domain: distract the interlocutor on purpose so we can win easier.
This implies a centralized planning capability: a module in the agent that can access every feature of the world-model, "understand" any of them, and incorporate them in plans together.
Importantly, it would need to be universal access, since it's often impossible to predict a priori which facts will end up relevant for any given goal. The behavior of market economies may end up relevant for formalizing reasoning under uncertainty; the success of a complex social deception may end up coupled with weather patterns on Mars.
Just like every component of the world-model would need to interact with each other, so would they need to be able to interact with the planner.
d) The Dehaene Model of Consciousness. I'll close with an argument from neuroscience. There's a very solid amount of evidence showing that "pool all information about the world together into one place and cross-correlate it" is a distinct neural event, and those are exactly the moments at which we seem to be conscious (and therefore capable of general reasoning). In-between those moments, on the other hand, the information remains in different compartments of the brain, and the brain only acts on it using fairly primitive heuristics.
That seems to fit fairly well with the picture painted by the other arguments.
2D. Are World-Models Modular?
Now it seems worth addressing the opposite failure mode: the possibility that world-models would be strangely opaque. Instead of having distinct internal "submodules" that we'd recognize as familiar concepts, perhaps they'd morph into a strange mess of heuristics and sub-simulations, such that it's downright impossible to tell what's going on inside?
Most of the arguing against this has already been done, back in the section 1B. It's the premise of the Natural Abstraction Hypothesis, and it's what my interface-development model suggests.
I'll supply one additional one: situational heuristics would remain useful even after the rise of centralized planning. Humans rely on instincts, cognitive shortcuts, and cached computations all the time, we don't manually reason about every little detail. It's much slower, for one.
Making the world-model "opaque" would cut off all that functionality, and presumably drastically reduce the capabilities.
3. World-Model Structure
I've made some arguments on why we should expect world-models to exist, and to convergently take the forms we'd expect of them. None of that, however, helps much when you're staring at trillion-entried matrices trying to figure out which subset of them spells out "niceness". The real meat of this approach is on constraining the possible data structures that all world-models must be converging towards — what specific NN features we should be looking for and how they'd look like.
There's been nonzero progress in this area, by which I mean the natural abstractions research agenda. Indeed: in essence, it aims to determine how efficient world-models are built, founded on the assumptions that certain basic principles of doing so are universal.
Aside from 3B, this section mostly consists of my personal speculations.
3A. Major Sub-Modules
It seems that when we say "world-model", we're conflating (at least) two things:
The former is, essentially, declarative knowledge. Dry logical facts which we know we know and can flexibly manipulate and reason over. (Like mathematical equations, or the faces of your friends, or an understanding of what structures count as "trees", or the knowledge of how many people are in the room with you.)
The latter is an ability to run counterfactual scenarios while drawing on the declarative knowledge — an ability to maintain a specific world-state in one's imagination and modify it. (Like having a "feel" for how people in the room move, or picturing yourself walking through a forest, or mentally tracking the behavior of a mathematical system to better understand it.)
Using the two are very distinct experiences, and they have somewhat different functionality. I believe that "simulations" extend a bit beyond "world-models", actually — that they "spoof" the whole mental context (in the shard theory's parlance). They access shards' inputs and feed them data from the simulated counterfactual, thereby allowing to "test out" shard activations in advance of being in the simulated situation.
That is useful for several reasons:
Both of those effects, I think, are shard-based, and may be missed if you only thought about "walking through a dark forest" in the abstract. And they're useful: for predicting your own behavior, and for improving predictions in general (since there's presumably a reason you expect dangers in the dark, like an (evolutionary) history of the darkness actually hiding predators).
Convergence-wise, there are arguments for all of this functionality. Without the simulation engine, you don't get the ability to simulate future states, which means no ability to compute policy gradients, have "cross-temporal" heuristics, or do advanced in-context training. In turn, without the conceptual repository, your ability to improve the simulation engine at runtime is very limited, the available "richness of counterfactuals" you can run is limited, and you don't get advanced planning.
Interpretability-wise, however, most of the arguments primarily apply to the conceptual repository, not to the simulation engine. Our ability to perfectly "snoop" on any counterfactual an AI is running is less certain:
At the least, I think that we'd definitely be able to get some insight into what's happening in the counterfactuals by looking at what concepts are being activated, and how strongly.
... Or so my inside-view on that goes. On a broader view, this section (3A) is the part that most consists of informal speculations on my part.
3B. Abstractions As Basic Units
The basic units in which information is stored in the conceptual repository seem to be natural abstractions: high-level summaries of lower-level systems that consist only of the information that's relevant far-away/is redundantly represented/constitutes a minimal latent variable.
A topic of relevance, here, is what data formats those abstractions follow/what is their type signature. John's current speculations are that they're probability distributions over deterministic constraint on the environment.
For example, imagine a species of trees growing in a savannah. These trees replicate, mutate, and spread; take in energy and matter, and grow. All of that is subject to some variance/entropy, yet its extent is constrained. Every replication is imperfect, but the rate or severity of mutation is bounded. Trees spread, but they can only do that in certain conditions or in certain directions. The shapes they take are constrained by their genome and the available resources. And so on.
Once you compute all of that, you're left with the knowledge of what information does get replicated perfectly, in the form of some constraints beyond which environmental entropy doesn't grow. You take in a set of trees X1,X2,...,Xn, and compute from them some abstraction P(Λ). P(Λ) contains information like "trees can have shapes that vary like this" and "trees are distributed across the landscape like this" — probability distributions over possible structures.
If this result is correct, that's what we should be looking for in ML models: vast repositories of conditional probability distributions of this form.
3C. Higher-Level Organization
Of course, abstractions are not just stored in one big pile. They're connected to each other, forming complex webs. When we think about trees, we can bring to mind their relation to sunlight and water and animals. They're also connected to the lower-level or higher-level abstractions — the abstract concept of a tree relates to the wood and the cells trees are made of, and the forests and ecosystems they make up, and the specific instances of trees we remember.
Seems natural to think that abstractions would have multi-level hierarchies. For example, we may compute a "first-order" abstraction of a particular tree species, Y1. Then we may encounter more tree species, and compute a set of tree-specie-abstractions, Y1,Y2,...,Yn. In the same manner, we may then go further up the ladder of taxonomic ranks, until a fully general "tree" abstraction.
Here's a caveat: there are "abstraction hierarchies" that are mutually incompatible.
Consider the entire set of humans on Earth. Suppose that you're encountering subsets of them in some order, and you're abstracting over these subsets each time.
(Here's a bit more formalization on this angle, including how we may "split" a category like "all humans" into several individually-well-abstracting subcategories.)
Each of choice of order would result in a different "abstraction hierarchy" — you can't recover the abstraction of a mathematician from first-order abstractions of ethnicities. Yet all of these hierarchies would be made of meaningful, useful abstractions!
A slightly different view presents itself if instead of going top-down, you go bottom-up. I. e., rather than picking out all objects of the type "human" and trying to summarize them, you can look at the world and try to locate stable, well-abstracting high-level structures from scratch. In this case, looking at the macro-level and focusing on different constraints might yield you "homes → cities → countries", or "businesses → industries → economies", or "local bureaucracies → bureaucratic systems → geopolitical entities". Similarly, those would be useful yet incompatible abstraction hierarchies.
That's not exactly a problem, in my view. We're still ending up with ground-truth-determined natural abstractions that can be computed by looking at the deterministic constraints in the environment. But there's some sense in which choosing to look at a particular constraint (like "how is the spread of philosophical beliefs constrained?") locks us out of others (like "how is the geographical spread of these people constrained?").
Based on this, it seems that the set of all abstractions over some underlying system may be organized along the following dimensions:
3D. Laziness
World-models are lazy, in a programmer's sense. That is, we don't keep the entire state of the world in our heads all at once. We keep it in a compressed form, and only compute the specifics about particular aspects of it on demand. It's as much a consequence of embedded agency as the natural abstractions themselves — the peril of having to reason about a world that you're part of.
Let's consider how people do it. I see three primary methods:
Thus, laziness seems to be possible due to a combination of the "simulation engine" from 3A, and the fact that our environment is well-abstracting.
4. Research Directions
I believe it's a highly tractable avenue of research, and the shortest path to robust alignment in worlds where alignment is really hard[1]. It's the one part of agent foundations that seems necessary to get right.
Ways to contribute:
That is, in worlds that mostly agree with the models of Eliezer Yudkowsky/Nate Soars/John Wentworth, on which we need to get the AGI exactly right to survive.
In theory, there should be a "buffer zone" of capability, between an AGI smart enough to model itself, and an AGI smart enough to hack through interpretability tools (e. g., humans are self-reflective, but not smart enough to do that).
But "is self-reflective" is also not a binary. An AGI's self-model can be more or less right. On the lower capability level, it'll probably be very flawed, therefore not very useful. On the flipside, if it's very close to reality, the AGI is likely to be smart enough that reading its mind is dangerous.
We may easily misjudge that, too. An AGI that achieved self-awareness is likely already on the cusp of its sharp left turn, past which it'd be unsafe to interpret. Depending on how sharp the turn is, that "buffer zone" may be passed in the blink of an eye, easily missed.
Runtime-editability also reassures another concern: that the inferences the AGI makes at runtime would be encoded differently from the knowledge hard-coded into its parameters. But since both types of knowledge would be used as inputs into the same algorithms (the planner, the shards), there's probably no reason to expect much mutation by default, due to the need for backwards compatibility.
(To be clear, once the AGI undergoes the sharp left turn/goes FOOM, and starts designing successor agents or directly modifying itself or just becomes incomprehensibly superintelligent, then this'll obviously stop applying. But if we haven't aligned it by then, we're dead either way, so that's irrelevant.)