In the strategy stealing assumption I describe a policy we might want our AI to follow:
Intuitively this is basically what I expect out of a corrigible AI, but I agree with Eliezer that this seems more realistic as a goal if we can see how it arises from a reasonable utility function.
So what does that utility function look like?
A first pass answer is pretty similar to my proposal from A Formalization of Indirect Normativity: we imagine some humans who actually have the opportunity to deliberate however they want and are able to review all of our AI's inputs and outputs. After a very long time, they evaluate the AI's behavior on a scale from [-1, 1], where 0 is the point corresponding to "nothing morally relevant happens," and that evaluation is the AI's utility.
The big difference is that I'm now thinking about what would actually happen, in the real world if the humans had the space and security to deliberate rather than formally defining a hypothetical process. I think that is going to end up being both safer and easier to implement, though it introduces its own set of complications.
Our hope is that the policy "keep the humans safe, then listen to them about what to do" is a good strategy for getting a high utility in this game, even if our AI is very unsure about what the humans would ultimately want. Then if our AI is sufficiently competent we can expect it to find a strategy at least this good.
The most important complication is that the AI is no longer isolated from the deliberating humans. We don't care about what the humans "would have done" if the AI hadn't been there---we need our AI to keep us safe (e.g. from other AI-empowered actors), we will be trusting our AI not to mess with the process of deliberation, and we will likely be relying on our AI to provide "amenities" to the deliberating humans (filling the same role as the hypercomputer in the old proposal).
Going even further, I'd like to avoid defining values in terms of any kind of counterfactual like "what the humans would have said if they'd stayed safe" because I think those will run into many of the original proposal's problems.
Instead we're going to define values in terms of what the humans actually conclude here in the real world. Of course we can't just say "Values are whatever the human actually concludes" because that will lead our agent to deliberately compromise human deliberation rather than protecting it.
Instead, we are going to add in something like narrow value leaning. Assume the human has some narrow preferences over what happens to them over the next hour. These aren't necessarily that wise. They don't understand what's happening in the "outside world" (e.g. "am I going to be safe five hours from now?" or "is my AI-run company acquiring a lot of money I can use when I figure out what I want?"). But they do assign low value to the human getting hurt, and assign high value to the human feeling safe and succeeding at their local tasks; they assign low value to the human tripping and breaking their neck, and high value to having the AI make them a hamburger if they ask for a hamburger; and so on. These preferences are basically dual to the actual process of deliberation that the human undergoes. There is a lot of subtlety about defining or extracting these local values, but for now I'm going to brush that aside and just ask how to extract the utility function from this whole process.
It's no good to simply use the local values, because we need our AI to do some lookahead (both to future timesteps when the human wants to remain safe, and to the far future when the human will evaluate how much option value the AI actually secured for them). It's no good to naively integrate local values over time, because a very low score during a brief period (where the human is killed and replaced by a robot accomplice) cannot be offset by any number of high scores in the future.
Here's my starting proposal:
I think there are a lot of problems with this method of quantitative aggregation. But I think this direction is promising and I currently expect something along these lines will work.
Here's my starting proposal:
- We quantify the human's local preferences by asking "Look at the person you actually became. How happy are you with that person? Quantitatively, how much of your value was lost by replacing yourself with that person?" This gives us a loss on a scale from 0% (perfect idealization, losing nothing) to 100% (where all of the value is gone). Most of the values will be exceptionally small, especially if we look at a short period like an hour.
- Eventually once the human becomes wise enough to totally epistemically dominate the original AI, they can assign a score to the AI's actions. To make life simple for now let's ignore negative outcomes and just describe value as a scalar from 0% (barren universe) to 100% (all of the universe is used in an optimal way). Or we might use this "final scale" in a different way (e.g. to evaluate the AI's actions rather than the actually assessing outcomes, assigning high scores to corrigible and efficient behavior and somehow quantifying deviations from that ideal).
- The utility is the product of all of these numbers.
If I follow correctly, the first step requires the humans to evaluate the output of narrow value learning, until this output becomes good enough to become universal with regard to the original AI and supervise it? I'm not sure I get why the AI wouldn't be incentivized to temper with the narrow value learning, à la Predict-o-matic? Depending on certain details, (like maybe the indescribable hellworld hypothesis), maybe the AI can introduce changes to the partial imitations/deliberations that end up hidden and compounding until the imitations epistemically dominates the AI, and then it ask it to do simple stuff.
The hope is that a tampering large enough to corrupt the human's final judgment would get a score of ~0 in the local value learning. 0 is the "right" score since the tampered human by hypothesis has lost all of the actual correlation with value. (Note that at the end you don't need to "ask it to do simple stuff" you can just directly assign a score of 1.)
This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversight process also needs to understand that. And that's what we are trying to do with all of this business about training AIs to answer questions honestly. The point is just that you don't have to clear up any of the ambiguity about what the human wants, you just have to be able to detect someone tampering with deliberation. (And the operationalization of tampering doesn't have to be so complex.)
(I'm not sure if this made too much sense, I have a draft of a related comment that I'll probably post soon but overall expect to just leave this as not-making-much-sense for now.)
This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversight process also needs to understand that. And that's what we are trying to do with all of this business about training AIs to answer questions honestly. The point is just that you don't have to clear up any of the ambiguity about what the human wants, you just have to be able to detect someone tampering with deliberation. (And the operationalization of tampering doesn't have to be so complex.)
So you want a sort of partial universality sufficient to bootstrap the process locally (while not requiring the understanding of our values in fine details), giving us enough time for a deliberation that would epistemically dominate the AI in a global sense (and get our values right)?
If that's about right, then I agree that having this would make your proposal work, but I still don't know how to get it. I need to read your previous posts on reading questions honestly.
You basically just need full universality / epistemic competitiveness locally. This is just getting around "what are values?" not the need for competitiveness. Then the global thing is also epistemically competitive, and it is able to talk about e.g. how our values interact with the alien concepts uncovered by our AI (which we want to reserve time for since we don't have any solution better than "actually figure everything out 'ourselves'").
Almost all of the time I'm thinking about how to get epistemic competitiveness for the local interaction. I think that's the meat of the safety problem.
The upside of humans in reality is that there is no need to figure out how to make efficient imitations that function correctly (as in X-and-only-X). To be useful, imitations should be efficient, which exact imitations are not. Yet for the role of building blocks of alignment machinery, imitations shouldn't have important systematic tendencies not found in the originals, and their absence is only clear for exact imitations (if not put in very unusual environments).
Suppose you already have an AI that interacts with the world, protects it from dangerous AIs, and doesn't misalign people living in it. Then there's time to figure out how to perform X-and-only-X efficient imitation, which drastically expands the design space, makes it more plausible that the kinds of systems that you wrote about a lot relying on imitations actually work as intended. In particular, this might include the kind of long reflection that has all the advantages of happening in reality without wasting time and resources on straightforwardly happening in reality, or letting the bad things that would happen in reality actually happen.
So figuring out object level values doesn't seem like a priority if you somehow got to the point of having an opportunity to figure out efficient imitation. (While getting to that point without figuring out object level values doesn't seem plausible, maybe there's a suggestion of a process that gets us there in the limit in here somewhere.)
I think the biggest difference is between actual and hypothetical processes of reflection. I agree that an "actual" process of reflection would likely ultimately involve most humans migrating to emulations for the speed and other advantages. (I am not sure that a hypothetical process necessarily needs efficient imitations, rather than AI reasoning about what actual humans---or hypothetical slow-but-faithful imitations---might do.)
I see getting safe and useful reasoning about exact imitations as a weird special case or maybe a reformulation of X-and-only-X efficient imitation. Anchoring to exact imitations in particular makes accurate prediction more difficult than it needs to be, as it's not the thing we care about, there are many irrelevant details that influence outcomes that accurate predictions would need to take into account. So a good "prediction" is going to be value-laden, with concrete facts about actual outcomes of setups built out of exact imitations being unimportant, which is about the same as the problem statement of X-and-only-X efficient imitation.
If such "predictions" are not good enough by themselves, underlying actual process of reflection (people living in the world) won't save/survive this if there's too much agency guided by the predictions. Using an underlying hypothetical process of reflection (by which I understand running a specific program) is more robust, as AI might go very wrong initially, but will correct itself once it gets around to computing the outcomes of the hypothetical reflection with more precision, provided the hypothetical process of reflection is defined as isolated from the AI.
I'm not sure what difference between hypothetical and actual processes of reflection you are emphasizing (if I understood what the terms mean correctly), since the actual civilization might plausibly move in into a substrate that is more like ML reasoning than concrete computation (let alone concrete physical incarnation), and thus become the same kind of thing as hypothetical reflection. The most striking distinction (for AI safety) seems to be the implication that an actual process of reflection can't be isolated from decisions of the AI taken based on insufficient reflection.
There's also the need to at least define exact imitations or better yet X-and-only-X efficient imitation in order to define a hypothetical process of reflection, which is not as absolutely necessary for actual reflection, so getting hypothetical reflection at all might be more difficult than some sort of temporary stability with actual reflection, which can then be used to define hypothetical reflection and thereby guard from consequences of overly agentic use of bad predictions of (on) actual reflection.
It seems to me like "Reason about a perfect emulation of a human" is an extremely similar task to "reason about a human," to me it does not feel closely related to X-and-only-X efficient imitation. For example, you can make calibrated predictions about what a human would do using vastly less computing power than a human (even using existing techniques), whereas perfect imitation likely requires vastly more computing power.
The point is that in order to be useful, a prediction/reasoning process should contain mesa-optimizers that perform decision making similar in a value-laden way to what the original humans would do. The results of the predictions should be determined by decisions of the people being predicted (or of people sufficiently similar to them), in the free-will-requires-determinism/you-are-part-of-physics sense. The actual cognitive labor of decision making needs to in some way be an aspect of the process of prediction/reasoning, or it's not going to be good enough. And in order to be safe, these mesa-optimizers shouldn't be systematically warped into something different (from a value-laden point of view), and there should be no other mesa-optimizers with meaningful influence in there. This just says that prediction/reasoning needs to be X-and-only-X in order to be safe. Thus the equivalence. Prediction of exact imitation in particular is weird because in that case the similarity measure between prediction and exact imitation is hinted to not be value-laden, which it might have to be in order for the prediction to be both X-and-only-X and efficient.
This is only unimportant if X-and-only-X is the likely default outcome of predictive generalization, so that not paying attention to this won't result in failure, but nobody understands if this is the case.
The mesa-optimizers in the prediction/reasoning similar to the original humans is what I mean by efficient imitations (whether X-and-only-X or not). They are not themselves the predictions of original humans (or of exact imitations), which might well not be present as explicit parts of the design of reasoning about the process of reflection as a whole, instead they are the implicit decision makers that determine what the conclusions of the reasoning say, and they are much more computationally efficient (as aspects of cheaper reasoning) than exact imitations. At the same time, if they are similar enough in a value-laden way to the originals, there is no need for better predictions, much less for exact imitation, the prediction/reasoning is itself the imitation we'd want to use, without any reference to an underlying exact process. (In a story simulation, there are no concrete states of the world, only references to states of knowledge, yet there are mesa-optimizers who are the people inhabiting it.)
If prediction is to be value-laden, with value defined by reflection built out of that same prediction, the only sensible way to set this up seems to be as a fixpoint of an operator that maps (states of knowledge about) values to (states of knowledge about) values-on-reflection computed by making use of the argument values to do value-laden efficient imitation. But if this setup is not performed correctly, then even if it's set up at all, we are probably going to get bad fixpoints, as it happens with things like bad Nash equilibria etc. And if it is performed correctly, then it might be much more sensible to allow an AI to influence what happens within the process of reflection more directly than merely by making systematic distortions in predicting/reasoning about it, thus hypothetical processes of reflection wouldn't need the isolation from AI's agency that normally makes them safer than the actual process of reflection.
Suppose that someone has trained a model to predict given , and I want to extend it to a question-answering model that answers arbitrary questions in a way that reflects all of 's knowledge.
Two prototypical examples I am thinking of are:
Here's an approach that seems kind of silly to me but is probably worth exploring:
This feels appealing for the same reason that you might hope minimal circuits are not deceptive. For example, it seems like would never bother re-using its model of a human because it would be faster to just hard-code all the facts about how humans answer questions.
In addition to feeling a bit silly, there are some obvious problems:
Some random thoughts:
The most fundamental reason that I don't expect this to work is that it gives up on "sharing parameters" between the extractor and the human model. But in many cases it seems possible to do so, and giving on up on that feels extremely unstable since it's trying to push against competitiveness (i.e. the model will want to find some way to save those parameters, and you don't want your intended solution to involve subverting that natural pressure).
Intuitively, I can imagine three kinds of approaches to doing this parameter sharing:
(You could imagine having slightly more general diagrams corresponding to any sort of d-connection between and .)
Approach 1 is the most intuitive, and it seems appealing because we can basically leave it up to the model to introduce the factorization (and it feels like there is a good chance that it will happen completely automatically). There are basically two challenges with this approach:
I've been thinking about approach 2 over the last 2 months. My biggest concern is that it feels like you have to pay the bits of H back "as you learn them" with SGD, but you may learn them in such a way that you don't really get a useful consistency update until you've basically specified all of H. (E.g. suppose you are exposed to brain scans of humans for a long time before you learn to answer questions in a human-like way. Then at the end you want to use that to pay back the bits of the brain scans, but in order to do so you need to imagine lots of different ways the brain scans could have looked. But there's no tractable way to do that, because you have to fill in the the full brain scan before it really tells you about whether your consistency condition holds.)
Approach 3 is in some sense most direct. I think this naturally looks like imitative generalization, where you use a richer set of human answers to basically build on top of your model. I don't see how to make this kind of thing work totally on its own, but I'm probably going to spend a bit of time thinking about how to combine it with approaches 1 and 2.
I think the biggest problem is that can compute the instrumental policy (or a different policy that works well, or a fragment of it). Some possible reasons:
I don't know if any of those particular failures are too likely. But overall it seems really bad to rely on never computing something inconvenient, and it definitely doesn't look like it's going to work in the worst case.
What are some possible outs, if in fact computes something adversarial to try to make it easy for to learn something bad?
Overall this kind of approach feels like it's probably doomed, but it does capture part of the intuition for why we should "just" be able to learn a simple correspondence rather than getting some crazy instrumental policy. So I'm not quite ready to let it go yet. I'm particularly interested to push a bit on the third of these approaches.
Here's another approach to "shortest circuit" that is designed to avoid this problem:
The "intended" circuit just follows along with the computation done by and then translates its internal state into natural language.
What about the problem case where computes some reasonable beliefs (e.g. using the instrumental policy, where the simplicity prior makes us skeptical about their generalization) that could just read off? I'll imagine those being written down somewhere on a slip of paper inside of 's model of the world.
Probably nothing like this can work, but I now feel like there are two live proposals for capturing the optimistic minimal circuits intuition---the one in this current comment, and in this other comment. I still feel like the aggressive speed penalization is doing something, and I feel like probably we can either find a working proposal in that space or else come up with some clearer counterexample.
We could try to exploit some further structural facts about the parts of that are used by . For example, it feels like the intended model is going to be leveraging facts that are further "upstream." For example, suppose an attacker observes that there is a cat in the room, and so writes out "There is a cat in the room" as part of a natural-language description of what it's going on that it hopes that will eventually learn to copy. If predicts the adversary's output, it must first predict that there is actually a cat in the room, which then ultimately flows downstream into predictions of the adversary's behavior. And so we might hope to prefer the "intended" by having it preferentially read from the earlier activations (with shorter computational histories).
The natural way to implement this is to penalize not for the computation it does, but for all the computation needed to compute its output (including within .). The basic problem with this approach is that it incentivizes to do all of the computation of from scratch in a way optimized for speed rather than complexity. I'd set this approach aside for a while because of this difficulty and the unnaturalness mentioned in the sibling (where we've given up on what seems to be an important form of parameter-sharing).
Today I was thinking about some apparently-totally-different angles of attack for the ontology identification problem, and this idea seems to have emerged again, with a potential strategy for fixing the "recompute problem". (In the context of ontology identification, the parameter-sharing objection no longer applies.)
Here's the idea:
This still feels a bit weird, but you could imagine it handling a bunch of cases in a promising way:
But right now it's a pretty vague proposal, because it's unclear what the nature of these facts or justifications are. If you set that up in a naive way, then the justification effectively just needs to simulate all of . That's a problem because it reintroduces the failure mode where you need to simulate the human, and therefore there's no extra cost to just simulating and then listening to whatever they say.
Overall I think that probably nothing like this works, but I'm still feeling a lot more optimistic than I was last week and want to explore it further. (This is partially for reasons not discussed in this comment, that several other approaches/motivations seem to converge on something similar.)
Here's a slightly more formal algorithm along these lines:
Reviewing how this behaves in each of the bad cases from the parent:
There are a lot of problems and missing details in this proposal:
Overall I'm becoming significantly more optimistic that something like this will work (though still less likely than not). Trying to step back and see the big picture, it seems like there are three key active ingredients:
My next step would probably be looking at cases where these high-level ingredients aren't sufficient (e.g. are there cases where "generate obs then do inference in the human model" is actually cheaper?). If they look pretty good, then I'll spend some more time trying to fill in the details in a more plausible way.
We might be able to get similar advantages with a more general proposal like:
Fit a function f to a (Q, A) dataset with lots of questions about latent structure. Minimize the sum of some typical QA objective and the computational cost of verifying that f is consistent.
Then the idea is that matching the conditional probabilities from the human's model (or at least being consistent with what the human believes strongly about those conditional probabilities) essentially falls out of a consistency condition.
It's not clear how to actually formulate that consistency condition, but it seems like an improvement over the prior situation (which was just baking in the obviously-untenable requirement of exactly matching). It's also not clear what happens if this consistency condition is soft.
It's not clear what "verify that the consistency conditions are met" means. You can always do the same proposal as in the parent, though it's not really clear if that's a convincing verification. But I think that's a fundamental philosophical problem that both of these proposals need to confront.
It's not clear how to balance computational cost and the QA objective. But you are able to avoid most of the bad properties just by being on the Pareto frontier, and I don't think this is worse than the prior proposal.
Overall this approach seems like it could avoid making such strong structural assumptions about the underlying model. It also helps a lot with the overlapping explanations + uniformity problem. And it generally seems to be inching towards feeling plausible.
One aspect of this proposal which I don't know how to do is evaluation the answers of the question-answerer. That looks too me very related to the deconfusion of universality that we discussed a few months ago, and without an answer to this, I feel like I don't even know how to run this silly approach.
You could imitate human answers, or you could ask a human "Is answer much better than answer ?" Both of these only work for questions that humans can evaluate (in hindsight), and then the point of the scheme is to get an adequate generalization to (some) questions that humans can't answer.
Ok, so you optimize the circuit both for speed and for small loss on human answers/comparisons, hoping that it generalizes to more questions while not being complex enough to be deceptive. Is that what you mean?
I'm mostly worried about parameter sharing between the human models in the environment and the QA procedure (which leads the QA to generalize like a human instead of correctly). You could call that deception but I think it's a somewhat simpler phenomenon.
Recently I've been thinking about ML systems that generalize poorly (copying human errors) because of either re-using predictive models of humans or using human inference procedures to map between world models.
My initial focus was on preventing re-using predictive models of humans. But I'm feeling increasingly like there is going to be a single solution to the two problems, and that the world-model mismatch problem is a good domain to develop the kind of algorithm we need. I want to say a bit about why.
I'm currently thinking about dealing with world model mismatches by learning a correspondence between models using something other than a simplicity prior / training a neural network to answering questions. Intuitively we want to do something more like "lining up" the two models and seeing what parts correspond to which others. We have a lot of conditions/criteria for such alignments, so we don't necessarily have to just stick with simplicity. This comment fleshes out one possible approach a little bit.
If this approach succeeds, then it also directly applicable to avoiding re-using human models---we want to be lining up the internal computation of our model with concepts like "There is a cat in the room" rather than just asking the model to predict whether there is a cat however it wants (which it may do by copying a human labeler). And on the flip side, I think that the "re-using human models" problem is a good constraint to have in mind when thinking about ways to do this correspondence. (Roughly speaking, because something like computational speed or "locality" seems like a really central constraint for matching up world models, and doing that approach naively can greatly exacerbate the problems with copying the training process.)
So for now I think it makes sense for me to focus on whether learning this correspondence is actually plausible. If that succeeds then I can step back and see how that changes my overall view of the landscape (I think it might be quite a significant change), and if it fails then I hope to at least know a bit more about the world model mismatch problem.
I think the best analogy in existing practice is probably doing interpretability work---mapping up the AI's model to my model is kind of like looking at neurons and trying to make sense of what they are computing (or looking for neurons that compute something). And giving up on a "simplicity prior" is very natural when doing interpretability, instead using other considerations to determine whether a correspondence is good. It still seems kind of plausible that in retrospect my current work will look like it was trying to get a solid theoretical picture on what interpretability should do (including in the regime where the correspondence is quite complex, and when the goal is a much more complete level of understanding). I swing back and forth on how strong the analogy to interpretability seems / whether or not this is how it will look in retrospect. (But at any rate, my research methodology feels like a very different approach to similar questions.)
(To restate the obvious, all of the stuff here is extremely WIP and rambling.)
I've often talked about the case where an unaligned model learns a description of the world + the procedure for reading out "what the camera sees" from the world. In this case, I've imagined an aligned model starting from the unaligned model and then extracting additional structure.
It now seems to me that the ideal aligned behavior is to learn only the "description of the world" and then have imitative generalization take it from there, identifying the correspondence between the world we know and the learned model. That correspondence includes in particular "what the camera sees."
The major technical benefit of doing it this way is that we end up with a higher prior probability on the aligned model than the unaligned model---the aligned one doesn't have to specify how to read out observations. And specifying how to read out observations doesn't really make it easier to find that correspondence.
We still need to specify how the "human" in imitative generalization actually finds this correspondence. So this doesn't fundamentally change any of the stuff I've recently been thinking about, but I think that the framing is becoming clearer and it's more likely we can find our way to the actually-right way to do it.
It now seems to me that a core feature of the situation that lets us pull out a correspondence is that you can't generally have two equally-valid correspondences for a given model---the standards for being a "good correspondence" are such that it would require crazy logical coincidence, and in fact this seems to be the core feature of "goodness." For example, you could have multiple "correspondences" that effectively just recompute everything from scratch, but by exactly the same token those are bad correspondences.
(This obviously only happens once the space and causal structure is sufficiently rich. There may be multiple ways of seeing faces in clouds, but once your correspondence involves people and dogs and the people talking about how the dogs are running around, it seems much more constrained because you need to reproduce all of that causal structure, and the very fact that humans can make good judgments about whether there are dogs implies that everything is incredibly constrained.)
There can certainly be legitimate ambiguity or uncertainty. For example, there may be a big world with multiple places that you could find a given pattern of dogs barking at cats. Or there might be parts of the world model that are just clearly underdetermined (e.g. there are two identical twins and we actually can't tell which is which). In these cases the space of possible correspondences still seems effectively discrete, rather than being a massive space parameterized as neural networks or something. We'd be totally happy surfacing all of the options in these cases.
There can also be a bunch of inconsequential uncertainty, things that feel more like small deformations of the correspondence than moving to a new connected component in correspondence-space. Things like slightly adjusting the boundaries of objects or of categories.
I'm currently thinking about this in terms of: given two different correspondences, why is it that they manage to both fit the data? Options:
I don't know where all of this ends up, but I feel some pretty strong common-sense intuition like "If you had some humans looking at the model, they could recognize a good correspondence when they saw it" and for now I'm going to be following that to see where it goes.
I tentatively think the whole situation is basically the same for "intuition module outputs a set of premises and then a deduction engine takes it from there" as for a model of physics. That is, it's still the case that (assuming enough richness) the translation between the "intuition module"'s language and human language is going to be more or less pinned down uniquely, and we'll have the same kind of taxonomy over cases where two translations would work equally well.
Here's an example I've been thinking about today to investigate the phenomenon of re-using human models.
Suppose that the "right" way to answer questions is . And suppose that a human is a learned model trained by gradient descent to approximate (subject to architectural and computational constraints). This model is very good on distribution, but we expect it to fail off distribution. We want to train a new neural network to approximate , without inheriting the human's off-distribution failures (though the new network may have off-distribution failures of its own).
The problem is that our model needs to learn the exact parameters for the human model in order to other aspects of human behavior. The simplest case is that we sometimes directly open human brains to observe directly.
Once we've learned it is very easy to learn the question-answering policy . So we're worried that our model will do that rather than learning the additional parameters to implement .
Intuitively there is a strong connection between and . After all, is optimized to make them nearly equal on the training distribution. If you understood the dynamics of neural network training it is likely possible to essentially reconstruct from , i.e. the complexity fo specifying both and is essentially the same as the complexity of specifying only .
But it's completely unclear how to jointly represent and using some parameters of similar size to . So prima facie there is a strong temptation to just reuse .
How much hope is there for jointly representing and ?
The most obvious representation in this case is to first specify , and then actually model the process of gradient descent that produces . This runs into a few problems:
My current take is that even in the case where was actually produced by something like SGD, we can't actually exploit that fact to produce a direct, causally-accurate representation .
That's kind of similar to what happens in my current proposal though: instead we use the learning process embedded inside the broader world-model learning. (Or a new learning process that we create from fresh to estimate the specialness of , as remarked in the sibling comment.)
So then the critical question is not "do we have enough time to reproduce the learning process that lead to ?" it is "Can we directly learn as an approximation to ?" If we able to do this in any way, then we can use that to help compress . In the other proposal, we can use it to help estimate the specialness of in order to determine how many bits we get back---it's starting to feel like these things aren't so different anyway.
Fully learning the whole human-model seems impossible---after all, humans may have learned things that are more sophisticated then what we can learn with SGD (even if SGD learned a policy with "enough bits" to represent , so that it could memorize them one by one if it saw the brain scans or whatever).
So we could try to do something like "learning just the part of the human policy is that is about answering questions." But it's not clear to me how you could disentangle this from all the rest of the complicated stuff going in for the human.
Overall this seems like a pretty tricky case. The high-level summary is something like: "The model is able to learn to imitate humans by making detailed observations about humans, but we are not able to learn a similarly-good human model from scratch given data about what the human is 'trying' to do or how they interpret language." Under these conditions it seems particularly challenging to either jointly represent and , or to compute how many bits you should "get back" based on a consistency condition between them. I expect it's going to be reasonably obvious what to do in this case (likely exploiting the presumed limitation of our learning process), which is what I'll be thinking about now.
The difficulty of jointly representing and motivates my recent proposal, which avoids any such explicit representation. Instead it separately specifies and , and then "gets back" bits by imposing a consistency condition that would have been satisfied only for a very small fraction of possible 's (roughly of them).
But thinking about this neural network case also makes it easy to talk about why my recent proposal could run into severe computational problems:
So in order to salvage a proposal like this, it seems like (at a minimum) the "specialness evaluation" needs to take place separately from the main learning of the human model, using a very different process (where we consider lots of different human models and see that it's actually quite hard to find one that is similarly-consistent with ). This would take place at the point where the outer model started actually using its human model in order to answer questions.
I don't really know what that would look like or if it's possible to make anything like that work.
Suppose I am interested in finding a program M whose input-output behavior has some property P that I can probabilistically check relatively quickly (e.g. I want to check whether M implements a sparse cut of some large implicit graph). I believe there is some simple and fast program M that does the trick. But even this relatively simple M is much more complex than the specification of the property P.
Now suppose I search for the simplest program running in time T that has property P. If T is sufficiently large, then I will end up getting the program "Search for the simplest program running in time T' that has property P, then run that." (Or something even simpler, but the point is that it will make no reference to the intended program M since encoding P is cheaper.)
I may be happy enough with this outcome, but there's some intuitive sense in which something weird and undesirable has happened here (and I may get in a distinctive kind of trouble if P is an approximate evaluation). I think this is likely to be a useful maximally-simplified example to think about.
The speed prior is calibrated such that this never happens if the learned optimizer is just using brute force---if it needs to search over 1 extra bit then it will take 2x longer, offsetting the gains.
That means that in the regime where P is simple, the speed prior is the "least you can reasonably care about speed"---if you care even less, you will just end up pushing the optimization into an inner process that is more concerned with speed and is therefore able to try a bunch of options.
(However, this is very mild, since the speed prior cares only a tiny bit about speed. Adding 100 bits to your program is the same as letting it run 2^100 times longer, so you are basically just optimizing for simplicity.)
To make this concrete, suppose that I instead used the kind-of-speed prior, where taking 4x longer is equivalent to using 1 extra bit of description complexity. And suppose that P is very simple relative to the complexities of the other objects involved. Suppose that the "object-level" program M has 1000 bits and runs in 2^2000 time, so has kind-of-speed complexity 2000 bits. A search that uses the speed prior will be able to find this algorithm in 2^3000 time, and so will have a kind-of-speed complexity of 1500 bits. So the kind-of-speed prior will just end up delegating to the speed prior.
The speed prior still delegates to better search algorithms though. For example, suppose that someone is able to fill in a 1000 bit program using only 2^500 steps of local search. Then the local search algorithm has speed prior complexity 500 bits, so will beat the object-level program. And the prior we'd end up using is basically "2x longer = 2 more bits" instead of "2x longer = 1 more bit," i.e. we end up caring more about speed because we delegated.
The actual limit on how much you care about speed is given by whatever search algorithms work best. I think it's likely possible to "expose" what is going on to the outer optimizer (so that it finds a hypothesis like "This local search algorithm is good" and then uses it to find an object-level program, rather than directly finding a program that bundles both of them together). But I'd guess intuitively that it's just not even meaningful to talk about the "simplest" programs or any prior that cares less about speed than the optimal search algorithm.
This is interesting to me for two reasons:
In traditional settings, we are searching for a program M that is simpler than the property P. For example, the number of parameters in our model should be smaller than the size of the dataset we are trying to fit if we want the model to generalize. (This isn't true for modern DL because of subtleties with SGD optimizing imperfectly and implicit regularization and so on, but spiritually I think it's still fine..)
But this breaks down if we start doing something like imposing consistency checks and hoping that those change the result of learning. Intuitively it's also often not true for scientific explanations---even simple properties can be surprising and require explanation, and can be used to support theories that are much more complex than the observation itself.
Some thoughts:
Causal structure is an intuitively appealing way to pick out the "intended" translation between an AI's model of the world and a human's model. For example, intuitively "There is a dog" causes "There is a barking sound." If we ask our neural net questions like "Is there a dog?" and it computes its answer by checking "Does a human labeler think there is a dog?" then its answers won't match the expected causal structure---so maybe we can avoid these kinds of answers.
What does that mean if we apply typical definitions of causality to ML training?
Here's an abstract example to think about these proposals, just a special case of the example from this post.
This is also a way to think about the proposals in this post and the reply:
So are there some facts about conditional independencies that would privilege the intended mapping? Here is one option.
We believe that A' and C' should be independent conditioned on B'. One problem is that this isn't even true, because B' is a coarse-graining and so there are in fact correlations between A' and C' that the human doesn't understand. That said, I think that the bad map introduces further conditional correlations, even assuming B=B'. For example, if you imagine Y preserving some facts about A' and C', and if the human is sometimes mistaken about B'=B, then we will introduce extra correlations between the human's beliefs about A' and C'.
I think it's pretty plausible that there are necessarily some "new" correlations in any case where the human's inference is imperfect, but I'd like to understand that better.
So I think the biggest problem is that none of the human's believed conditional independencies actually hold---they are both precise, and (more problematically) they may themselves only hold "on distribution" in some appropriate sense.
This problem seems pretty approachable though and so I'm excited to spend some time thinking about it.
Actually if A --> B --> C and I observe some function of (A, B, C) it's just not generally the case that my beliefs about A and C are conditionally independent given my beliefs about B (e.g. suppose I observe A+C). This just makes it even easier to avoid the bad function in this case, but means I want to be more careful about the definition of the case to ensure that it's actually difficult before concluding that this kid of conditional independence structure is potentially useful.