We discuss the definition of "knowledge" a bit in this appendix; compared to your definitions, we want to only say that the model "knows" the value of X when it is actually behaving differently based on the value of X in order to obtain a lower loss. I think this is strictly weaker than your level 2 (since the model needs to actually be using that knowledge) and incomparable to your level 1 (since the model's behavior might depend on an estimate of X without the model having any introspective knowledge about that dependence), though I might be misunderstanding.
This definitely isn't well-defined, and this is the main way in which ELK itself is not well-defined and something I'd love to fix. That said, for now I feel like we can just focus on cases where the counterexamples obviously involve the model knowing things (according to this informal definition). Someday in the future we'll need to argue about complicated border cases, because our solutions work in every obvious case. But I think we'll have to make a lot of progress before we run into those problems (and I suspect that progress will mostly resolve the ambiguity).
(The situation is reversed if someone tries to make impossibility arguments about ELK---those arguments very often involve murky cases where it's not really clear if the model knows.)
First in this list is the case where the correct translation is arbitrarily computationally complex.
However, this assumption implies that ELK is intractable.
This isn't clear to me---we argue that direct translation can be arbitrarily complex, and that we need to solve ELK anyway, but we don't think the translator can be arbitrarily complex relative to the predictor. So we can still hope that jointly learning the (predictor, translator) is not much harder than learning the predictor alone.
The answer (at least, as I see it) is by arguing that this case is impossible.
If we find a case that's impossible, I definitely want to try to refine the ELK problem statement, rather than implicitly narrowing the statement to something like "solve ELK in all the cases where it's possibly possible" (not sure if that's what you are suggesting here). And right now I don't know of any cases that seem impossible.
This definitely isn't well-defined, and this is the main way in which ELK itself is not well-defined and something I'd love to fix. That said, for now I feel like we can just focus on cases where the counterexamples obviously involve the model knowing things (according to this informal definition). Someday in the future we'll need to argue about complicated border cases, because our solutions work in every obvious case. But I think we'll have to make a lot of progress before we run into those problems (and I suspect that progress will mostly resolve the ambiguity).
Well, it might be that a proposed solution follows relatively easily from a proposed definition of knowledge, in some cases. That's the sort of solution I'm going after at the moment.
This still leaves the question of borderline cases, since the definition of knowledge may be imperfect. So it's not necessarily that I'm trying to solve the borderline cases.
We discuss the definition of "knowledge" a bit in this appendix;
Ah, yep, I missed that!
This isn't clear to me---we argue that direct translation can be arbitrarily complex, and that we need to solve ELK anyway, but we don't think the translator can be arbitrarily complex relative to the predictor. So we can still hope that jointly learning the (predictor, translator) is not much harder than learning the predictor alone.
Ahh, I see. I had 100% interpreted the computational complexity of the Reporter to be 'relative to the predictor' already. I'm not sure how else it could be interpreted, since the reporter is given the predictor's state as input, or at least given some form of query access.
What's the intended mathematical content of the statement "the direct translation can be arbitrarily complex", then?
Also, why don't you think the direct translator can be arbitrarily complex relative to the predictor?
> The answer (at least, as I see it) is by arguing that this case is impossible.
If we find a case that's impossible, I definitely want to try to refine the ELK problem statement, rather than implicitly narrowing the statement to something like "solve ELK in all the cases where it's possibly possible" (not sure if that's what you are suggesting here). And right now I don't know of any cases that seem impossible.
Yeah, sorry, poor wording on my part. What I meant in that part was "argue that the direct translator cannot be arbitrarily complex", although I immediately mention the case you're addressing here in the parenthetical right after what you quote.
In any case, what you say makes sense.
Yeah, sorry, poor wording on my part. What I meant in that part was "argue that the direct translator cannot be arbitrarily complex", although I immediately mention the case you're addressing here in the parenthetical right after what you quote.
Ah, I just totally misunderstood the sentence, the intended reading makes sense.
Well, it might be that a proposed solution follows relatively easily from a proposed definition of knowledge, in some cases. That's the sort of solution I'm going after at the moment.
I agree that's possible, and it does seem like a good reason to try to clarify a definition of knowledge.
Ahh, I see. I had 100% interpreted the computational complexity of the Reporter to be 'relative to the predictor' already. I'm not sure how else it could be interpreted, since the reporter is given the predictor's state as input, or at least given some form of query access.
What's the intended mathematical content of the statement "the direct translation can be arbitrarily complex", then?
Sorry, what I mean is:
Also, why don't you think the direct translator can be arbitrarily complex relative to the predictor?
I assume this was based on my confusing use of "relative to." But answering just in case: if we are defining "knowledge" in terms of what the predictor actually uses in order to get a low loss, then there's some hope that the reporter can't really be more complex than the predictor (for the part that is actually playing a role in the predictor's computation) plus a term that depends only on the complexity of the human's model.
Your definition of L-knowledge implies there can 'only' be total possible latent variables in the universe that are L-knowable for any given L, I believe.
This isn't strictly a problem, as you can just increase L... but your upper bound on L before the answer is trivially 'yes' is the inverse Kolmogorov complexity of the program trace + o(1). This grows slower than any computable function.
I'd be concerned that for programs of 'realistic' (read: 'fits within the universe') sizes there is no such L.
So how are we supposed to solve ELK, if we are to assume that it's intractable?
A different answer to this could be that a "solution" to ELK is one that is computable, even if intractable. By analogy, algorithmic "solutions" to probabilistic inference on Bayes nets are still solutions even though the problem is provably NP-hard. It's up to the authors of ELK to disambiguate what they're looking for in a "solution," and I like the ideas here (especially in Level 2), but just wanted to point out this alternative to the premise.
Sidenote: there seems to be some connection between this line of thought and my ideas about antitraining (how to design a learning algorithm for task D that doesn't develop latent knowledge about task E).
ELK is described as a problem we wish to solve in the "worst case"; IE, for any potential failure case we can describe, we want to have a strong argument about why we won't run into that particular problem. The ELK document lists several cases we want to handle in that way. First in this list is the case where the correct translation is arbitrarily computationally complex.
However, this assumption implies that ELK is intractable. At best, we might hope to define some notion of approximate solution, and prove that there are efficient approximate solutions. So how are we supposed to solve ELK, if we are to assume that it's intractable?
The answer (at least, as I see it) is by arguing that the correct translation cannot be too computationally complex. (Or, failing that, by providing a strong argument that it is possible, at which point we would have to narrow the scope of ELK ambitions.)
A proposed algorithm for solving ELK which (1) has a strong argument for its correctness, and (2) is itself a tractable algorithm, is usually already an argument that correct translations are tractable. For example, a search procedure usually has to evaluate candidate translations by running them; so if we could prove that the search terminates in a reasonable amount of time, we've probably already proven that the solution it finds is tractable (at least in certain cases, EG, on the training data).
But this doesn't really answer the question. Since we don't already have such a solution, we may need to explicitly think about tractability in order to produce one.
So, in this post, I'm going to think about how we might possibly bound the computational complexity of ELK.
Level 3: Probabilistic Knowledge
The third level of difficulty for ELK (IE, the most computationally difficult that I'll discuss here) is the most conceptually simple to understand: we define the "correct translation" of a pattern of activity in the Predictor to be any and all things correlated with such activity.
In other words, we define the "meaning" of Predictor activity as just whatever we can predictably infer from it. In my previous post I formalized this by imagining probability distributions called "third person perspectives" which can be conditioned on the full execution trace of the Predictor, and infer as much as possible about the world from this information.
This seems intractable for two main reasons:
By trying to detect any correlation, we impose an arbitrarily high requirement of detailed simulation. As a consequence, this Reporter would know far more than the Predictor can realistically be said to know. If the Predictor knows how the dominoes have been set up, the Reporter is required to know how they fall.
From a safety perspective, we're fine with the reporter knowing too much -- we're only concerned with cases where the Predictor knows something which the humans don't know. However, this does seem like a sign that we don't have to solve such a hard version of the problem.
So, can we specify a more tractable version?
Level 2: Computationally Accessible Knowledge
The intuition I'm trying to capture with 'level 2' is: to "know" something means that a (computationally) relatively close version of you could report the information. This rules out computationally intractable "knowledge" which follows probabilistically from the Predictor's state, but which cannot be easily computed from it.
To put it a different way: knowledge is only knowledge if it can feasibly be used to do something. For example, if I have a book telling me how to use the SuperSort algorithm (a hypothetical sorting algorithm that’s better than linear time), then there’s a pragmatic sense in which I know how to write SuperSort -- if you ask me to optimize some code which includes a sorting step, I’ll look up the SuperSort algorithm and implement it, to see if it improves the performance of your code.
However, if I instead had an encrypted copy of the book, then I wouldn’t possess the knowledge in the pragmatic sense. If you asked me to optimize the same code, I would try a different approach.
This could be formalized in many different ways. Here is a provisional attempt:
Definition: The L-knowledge of a latent variable X based on the existence of a program, as follows. The program takes in an execution trace of the Predictor in a given situation. It outputs a pair of booleans. The first boolean indicates whether the program is able to decide the latent variable in this case. (Think of this as “whether the predictor has the latent knowledge”.) If so, the second boolean indicates the state of the latent variable X. (This could be easily generalized to non-boolean variables.) Call the length of the program K, and the time to run the program T. L-knowledge exists (in a specific situation) when K+log(T)<L, and the first output bit is true.
The first bit is needed to allow programs to opt out of telling us the status of the latent variable in cases where it is impossible for them to do so. Otherwise, ‘knowledge’ would only exist when a reporter could infallibly tell us the status of the latent variable in every case. We can’t just define L-knowledge as the existence of a program which tells us the status of the latent variable in this specific case (and satisfies K+log(T)<L), because if we defined it in this way, L-knowledge would be too easy. There’s always a short program that outputs 1 and a short program which outputs 0, so whichever value the variable actually has, a program certifying L-knowledge for very low L would exist.
Possibly, the above definition is still too easy to achieve, because we can still do that exact same trick by making the first bit of the program recognize only one target situation. (This allows us to create a bound above which L-knowledge always exists; but, at least the bound is not constant. It depends on the Levin complexity of recognizing the specific scenario.)
However, as I've already argued for Level 3, it’s fine (for safety purposes) if our definition of knowledge includes too many cases. If we can construct a reporter who can tell us L-knowledge whenever it exists, and we have confidence that knowledge implies L-knowledge, then we can create a reporter at least as informative as the direct reporter. (It might sometimes tell us the truth when the predictor doesn’t actually know the truth, but that’s fine!)
However, unfortunately, it's not obvious that meaningful "knowledge" implies L-knowledge. My Level 3 definition seemed to safely over-estimate the Predictor's knowledge; but that is not nearly so obvious with L-knowledge.
Knowledge Without L-Knowledge?
L-knowledge allows the translator to self-select cases when knowledge exists and when it doesn't; but, within those cases, knowledge is required to be perfect. A translation which predicts the target variable 95% of the time does not count for anything (because it is wrong infinitely often). We could also define some sort of relaxed p,L-knowledge which only required the knowledge to be p% correct, or something along those lines. I won't discuss this possibility any further in the current essay, however.
Another potential failure case I'm worried about relates to using knowledge for planning.
There are roughly two types of concerning cases for ELK. In one case, the SafeVault AI uses latent knowledge which the human lacks, to enact a plan which drives apart the proxy reward from what humans would like to optimize. Call this “malign use of latent knowledge.” Example: the SafeVault AI executes an undetectable exploit to jettison itself (with the diamond) into space. Everything looks fine on the sensors, and from the outside of the vault, but if someone tries to go inside (eg, to retrieve the diamond) they’ll discover that the building is now totally empty.
In another case, the SafeVault AI fails to use latent knowledge. For example, a burglar executes an undetectable sensor exploit in order to take the diamond. The SafeVault understands what’s going on but doesn’t care, because its reward is not impacted. Call this “malign negligence of latent knowledge.”
One way to understand malign negligence is, “AI systems should not ignore mission-critical information that they could have taken into account”. But how do we judge when they “could have” taken information into account? L-knowledge seems like a reasonable way to cash out this “could”. Where L-knowledge exists, there is a “close” version of the system which takes this knowledge into account. (This close version is the policy we want to incentivise, using the reporter’s information.) Where L-knowledge doesn’t exist, such a variant is either too costly to run or too hard to find.
However, malign use of latent knowledge is in some sense more concerning. Can we provide any guarantees against it?
Informal definition: planning-knowledge of X is the ability to make plans that importantly rely on X. This means something like: in worlds where X is different, the actions that the SmartVault AI takes are different (while, presumably, achieving similar ends).
Conjecture: Planning-knowledge implies L-knowledge (for some fairly low L).
If something like this can be established, it would provide significant reassurance that uncovering the Predictor's L-knowledge protects us from malign use of latent knowledge.
Efficiency of L-knowledge-based algorithms?
L-knowledge, or some suitable variation, could formalize the term "knowledge" in ELK without requiring the Reporter to be logically omniscient. However, might it still imply severe computational problems for ELK?
Although the concept of L-knowledge implies that it is relatively easy to run the correct translation once found, it could still be computationally intractable to find the correct translation.
Can these concerns help us to further refine our definition of knowledge?
Level 1: Knowledge of Knowledge
The intuition behind level 1 is: for computational knowledge ("level 2") to really count, the Predictor has to know what it knows. Consider belief-in-belief. If I'm deluded into thinking that water is made of tiny water-demons, but all of my predictions line up with the standard hydrogen hydroxide hypothesis, then at some level you might say I "know" the hydrogen hydroxide hypothesis. There's probably a computationally close version of me who would answer questions in line with the hydrogen hydroxide hypothesis, instead of the water-demon hypothesis. However, I may not know that this is the case.
I guess another way to put it is: deception requires knowledge of deception. To deceive you, I need to know that there's a computationally close version of me who could tell the truth. We may be able to eliminate deception (which seems good!) without solving harder versions of ELK.
The main advantage of this idea is that it could allow us to bound the difficulty of finding the correct translation, rather than just bounding the difficulty of executing the correct translation. This could solve the computational complexity problem entirely (perhaps at the cost of delivering a weaker version of ELK, with weaker safety guarantees).
I don't have any idea how to formalize level 1 at the moment, so I'll leave it at this for now.