I found this post clarifying. One thing I'm still uncertain of: what's the architecture of the Reporter in this proposal? Does it have two heads, one for proposing changes to the Predictor's state and one for answering questions? If so, can I think of the training process as:
Is that right?
What part of the proposal breaks if we do counterfactuals in input space rather than on the predictor's state?
Sorry, I'm not quite clear what you mean by this, so I might be answering the wrong question.
I believe counterfactuals on the input space are a subset of counterfactuals on the predictor's state, because the input space's influence is through the predictor's state, but modifying the predictor's state can also reach states that don't correspond to any input. As such, I don't think counterfactuals on the input space add any power to the proposal.
This post was written for the SERI MATS program. I thank Evan Hubinger and Leo Gao for their mentorship in the program. Further thanks go to Evan Hubinger (again), Simon Marshall, and Johannes Treutlein for specific comments regarding the content of this post.
The Eliciting Latent Knowledge (ELK) problem was first introduced by Paul Christiano, Marx Xu, and Ajeya Cotra. Subsequently, Abram Demski worked on the problem, collected his thoughts in a thought dump post, and won a prize for his proposal “use the reporter to define causal interventions on the predictor”. Here I attempt to clarify and flesh out these thoughts in order to make them more accessible. I assume familiarity with ELK, but not with Abram’s post. Very little of this post is my own original content.
Epistemic status: 60% confident that I am accurately representing Abram’s thoughts at the time he wrote his post, 75% confident that I am representing them accurately enough not to change the key takeaways, 80% confident that the extended proposals and counterexamples I propose are logically sound.
Introduction
When the technical report for Eliciting Latent Knowledge (ELK) was first released, it was followed by a contest offering substantial cash prizes for proposed solutions. The contest wrapped up with thirty-two prizes awarded, split into six categories. In the prize announcement, Paul and Mark noted that there was substantial convergence among proposals, but in two of the six categories there was only a single submission.
Abram Demski’s proposal was the sole entry in the category “Use reporter to define causal interventions”. A quick summary is provided in the prize results post, but the full context and justification for it is better provided in a longer post by Abram. Specifically, Abrams shares his thoughts on how the ELK problem ties to philosophical concepts of truth and knowledge, and how that can be used to inform solutions.
Evan Hubinger describes Abram’s ELK work as “some of the most exciting work on the problem right now”, but Abram’s post is self-described as a “thought dump”, and the prize post does not explicitly lay out the case for how the proposal would work. I aim to re-organize, clarify, and flesh-out his work in order to make it more accessible for forum readers. I will go in the opposite order of Abram’s original post, starting with the prize-winning proposal and its counter-examples before going into the context and motivation behind it, then ending with a discussion on possible modifications and extensions.
The Proposal and Its Counterexamples
The Proposal
The direct reporter must have a deep understanding of how the predictor works, and some human simulators will have this understanding too. This means that these reporters know not just how the state of the predictor would look when processing an existing data point, but also how it would look under hypothetical data points. A human could then ask the reporter to search over potential data points and provide the configuration of the predictor under counterfactual scenarios.
This ability to get these scenarios from the reporter gives us a method for tripping up human simulators. We start with the predictor’s state and the answers to a set of questions from the reporter in a real scenario, then ask the reporter to suggest the smallest change to the predictor’s state that would make the reporter give a particular different answer to a specified question in the new state[1]. The reason the proposal is called defining causal interventions is because causal interventions will be the ones with the smallest impact on the predictor’s state, changing only elements causally downstream of the intervention. The reporter is then reset, so that it forgets that it suggested the change to the predictor, and is shown this counterfactual state as actually having occurred. We can then ask the reporter a variety of questions whose answer depends on the counterfactual, and see if the answers change in a plausible way.
If the reporter is a human simulator without a deep understanding of the predictor, then for some counterfactuals it will be unable to make the fine changes to the predictor needed to update the answer to one question with minimal other impact. The prediction made by the predictor will change in unusual ways in the counterfactual state, beyond what would reasonably follow from the requested change. An evaluator can notice the strange changes in the prediction and the reporter’s answers to various questions, and penalize the reporter based on this.
What about a human simulator that does have a deep understanding of the predictor? It took me many read-throughs to understand why this proposal presents a challenge for such a reporter. Why can’t it just simulate the human to see how they would answer the questions in the counterfactual scenario?
The answer is that because the human specified the counterfactual scenario, they have additional knowledge about what is going on in the background of the prediction. The reporter does not know that they are in a counterfactual scenario, so they do not know the human has this knowledge and do not incorporate it into their simulation. When they simulate a human without any additional knowledge, the human can use the asymmetry to catch and punish discrepancies in the reporter’s answers..
An example can be helpful to illustrate how this works. In the SmartVault setup, let us start with a simple data point where the diamond obviously remains in the vault. The human can then ask the reporter to change the predictor’s state to one where the diamond was stolen, or better yet one where the diamond was stolen but the predicted video remains the same. Now, the human knows that the diamond is not in the vault, but when the reporter is shown the counterfactual scenario and asked questions about it, it does not know the human knows. If the reporter just looks at the predicted video and simulates what a human would think happened, they will report that the diamond is still in the vault. The human knows this to be false, so they can then penalize or reject that reporter.
Early proposals in the initial ELK report try to catch the human simulator by generating data points that the classifying human knows more about than a normal simulated human would. These proposals failed because a human simulator could simulate the classifying human, who would still be unable to properly classify extremely complex scenarios. This proposal tries to trick human simulators in the same way, but is not constrained by needing to generate real data points. No matter how complex the scenario, the human is able to classify it properly because they’re the one who set it up.
The Counterexamples
How could this process go wrong? The ELK prize results give two counterexamples. In the first, the reporter is non-local, meaning that it depends on many different parts of the predictor to answer any question. That makes it difficult to modify the predictor to a counterfactual state, especially without making significant other changes to the state. This counterexample is not so much saying that the proposal wouldn’t work, but rather that it couldn’t even be implemented due to the complexity required.
The second counterexample suggests a way in which the proposal could fail even if properly implemented. The proposal counts on the human simulator not realizing it is in a counterfactual scenario, and therefore not realizing the human has additional knowledge. However, even if the reporter is reset after implementing the counterfactual, it still may be able to realize what has happened based on the predictor’s state. When modifying the predictor’s state, the reporter could leave messages for its future self explaining the counterfactual it was asked to implement. In the worst case scenario where the prediction the human can see doesn’t change, the reporter doesn’t even have to generate the counterfactual state. It can just leave a message for itself to pretend that it has.
“Eliciting” “Latent” “Knowledge”
Abram’s proposal makes sense on its own, but the process by which it was derived includes a detailed way of thinking about the ELK problem. Even if the proposal cannot be modified to overcome the counterexamples, the background may be useful in generating new approaches to solving the problem.
Preliminary Definitions
To solve the problem of eliciting latent knowledge, it helps to understand what each of those three words mean. Doing so requires defining some notoriously thorny and controversial terms (see this sequence by Alex Flint for some of the issues with defining knowledge), so let me be clear that what follows are my interpretations of the working definitions Abram uses, rather than an ideal definition capturing all relevant aspects of the word.
Before we can talk about whether an agent knows a statement, we need to understand the semantics of the statement (what the statement means). One way to define the meaning of a statement, which we will use, is with a truth-function. A truth-function takes in a statement and a world-state, and maps the pair to a truth-value, such as true, false, or nonsensical. Over the domain of all possible world states, the truth function tells us under which exact conditions a statement has each truth-value.
This brings us to the next level of the definition game: what is truth? Given some beliefs, we say that truth is a correspondence from the beliefs to a set of possible realities (the correspondence theory of truth), and in exactly those realities the beliefs are true. A basic correspondence to use for illustrative purposes is Aquinas’ “A judgment is said to be true when it conforms to the external reality”, if we assume that beliefs have some shape such that they can conform to reality. Here, the beliefs are analogous to a map, while reality is analogous to the territory.
Combining this definition of truth with the truth-function definition of meaning gives that the meaning of a statement will depend on the correspondence used for truth. Under Aquinas’ correspondence, the meaning of a statement is then defined through the set of realities that the statement conforms to. While we could dive deeper into defining words like “belief”, “reality”, and “conform”, at some point we need to stop playing the definition game and move on.
Finally, we need to determine how beliefs turn into knowledge. One of the oldest definitions of knowledge is “justified true belief”, but this can fall apart under what is known as Gettier cases, where the justification for a true belief is unsound. Instead, we use Nozick’s truth-tracking definition, where we say an agent knows a statement if the statement is included in the agent’s beliefs when the statement is true, and not included when the statement is false. Evaluating counterfactuals is one way to determine if an agent truly knows some belief they hold, but using conditional probabilities instead allows an agent to have knowledge despite uncertainty about the world state if they have the correct beliefs given each world state.
Evaluating Beliefs
As one agent trying to determine whether a belief of some other agent is true, what we would ideally like to do is compare that belief to the actual physical reality to see if the truth correspondence holds. Unfortunately, neither we nor any other possible agent has direct access to physical reality, instead getting a version biased by our perception. In order to evaluate a belief in this way, we would need to make an assumption about what reality is. If the assumption is totally wrong, placing no probability on the actual reality, then the comparison of belief to reality contains no useful information.
Rather than try to compare another agent’s beliefs directly to reality, the most an evaluator can do is compare those beliefs to the evaluator’s own beliefs about reality. This is necessarily how we will need to evaluate beliefs for ELK, but it still leaves open the question of how to make that comparison.
For an individual agent, their perception of the world is filtered through the sensory data they have available. From their perspective, each possible world is associated with some set of data from their sensors, lumping together worlds that generate the same data. It is within that paradigm that they determine the conditions on the world that correspond to the truth values of a statement, so the conditions can only depend on the sensors they have available. We will call this a first person perspective, because it is reality from the perspective of some agent.
The issue with the first person perspective is that it doesn’t allow for communication or comparison between agents. Consider the example of someone who has been blind from birth trying to communicate with someone who has been deaf from birth. It’s unclear what it would even mean for them to compare their subjective perceptions of the brightness or loudness of an object. Differences in sensors need not be so extreme either, the problem with comparisons can arise even from slight differences. If two agents have the same type of visual sensors but are pointing at different targets, subjective words like “left” or “right” lose their meaning.
Fortunately, we do know that communication between humans is possible even if they occupy separate bodies. The way we do this is by replacing subjective words like “left” with objective words or sequences of words like “west”, “in the direction my finger is pointing’, or “left when facing towards the front door of the office from the outside”. Brightness could be defined by a measure of photons emitted, and loudness by vibrations in the air. We call this a third person perspective, because it takes a first person perspective and removes the subjective perception.
To compare two first person perspectives that rely on different sensors, it is necessary to translate each of them into the third person perspective, or from one first person perspective to the other through the third person perspective. From there, evaluating the belief of another agent just becomes a check of whether its translation is equal to the evaluator’s translation. However, there are many possible translations into a third person perspective, so how do we determine a good one?
One criteria that we can use for a good translation is counterfactual correspondence. What this means is that counterfactual changes that happen in either the first or third person perspective have immediate and downstream consequences mirrored in the other. For this to work, the causal structure of reality modeled in the first person perspective must be represented in the third person perspective. If not, then the combination of consequences would be considered impossible by the third person perspective, meaning there is no way to describe it and therefore no translation.
Formalizing the Third Person Perspective
The third person perspective can be thought of as a set of possible worlds (structures of reality) and a probability distribution over them. Each world consists of a set of events that happen in that world, so the third person perspective implies a probability distribution over events. Each first person perspective can perceive some subset of events, meaning worlds that differ only in events outside that subset appear identical. The first person perspective is then, like the third person perspective, a set of the worlds and a probability distribution over them, but this set of worlds is a subset of those in the third person perspective. Each third person perspective can contain many first person perspectives.
A translation from first to third person is then a mapping that takes the probability assigned to each world in the first person perspective and splits it between each world that contains the same events in the third person perspective[2]. Similarly, a translation from third to first person assigns to each world in the first person perspective the sum of probabilities of each world in the third person perspective containing the same events. A translation can alternatively be thought of as mapping to probabilities of events, rather than a mapping to probabilities of worlds containing events.
This might be a little confusing, so let’s give a concrete example. In this example, there are eight possible worlds in the third person perspective, labeled 0 to 7. There are three events, A, B, and C, which occur in the worlds where the binary translation of the label has a one in the first, second, and third positions respectively. Each world is assigned equal prior probability.
Agent 1 can distinguish events A and B, while Agent 2 can distinguish events B and C, and both know whether or not their events have occurred with certainty. Let us say that world 5 is the actual world.. The translation from Agent 1’s perspective (A occurs and B does not) to the third person perspective would place a 50% probability on world state 1, and 50% on world state 5. This could then translate from the third person perspective to Agent 2’s perspective as a probability of 0 on event B occurring, and a 50% chance of event C. Note that since information is lost translating out of the third person perspective, translating back from Agent 2 to Agent 1 will not take the inverse of the first translation.
Agent 1’s beliefs about the occurrence of A and B track the actual world. They believe events A and B occur if and only if they actually occur. From Agent 2’s perspective, Agent 1 believes that event B occurs if and only if it actually occurs and always places a 50% chance on event C occurring. This means that, to Agent 2, Agent 1 knows about event B, but does not know about event C.
Ideal ELK Solutions
In the ELK problem, the predictor provides us with a first person perspective. Translating that into a third person perspective, and then translating the third person perspective into the human’s first person perspective to give probabilities over events can be thought of as the direct reporter. However, actually implementing this as a strategy is likely to be computationally intractable, since it requires coming up with and running a huge number of possible worlds to determine which events occur in each of them. Rather, it can be thought of as a target for more concrete proposals to approximate.
On the topic of targets to aim for, we can take the concept to its limit. What we would really like is for a human to condition on all of the information in the predictor, and then do perfect inference within their human world model. That would take into account all available data, and extract all possible information from it. Having an arbitrarily amplified human look at the predictor and see what they can figure out is necessarily the highest benchmark for any possible reporter, because with sufficient amplification the human can have any reporter at their disposal.
Extending the Proposal
Returning to the proposal of using the reporter to define causal interventions on the predictor, we may be wondering if the proposal can be modified so that the given counterexamples are overcome. Here we discuss some additional ideas, as well as the issues with them.
In his original post, Abram suggests that if the human simulator is leaving coded messages for itself, it will always look in the same place to read the messages, which is a behavior that could be penalized. Unfortunately, there is no guarantee that these messages will occur in the same location, and may be distributed throughout the model.
Before we continue, let us cover another counterexample to the original proposal
Counterexample: The Predictor Makes Large Changes
In the proposal, we ask the reporter to make the smallest possible change to the predictor’s state that will implement a counterfactual. However, we have no guarantee that the reporter’s recommended change is actually the smallest possible. The reporter could instead make larger changes to the model in a way that is visible to humans, so that when the reporter later answers questions human simulation will give the answers corresponding to the specified counterfactual. Even if we add a regularization term based on the size of the modification necessary to implement the counterfactual, the gains from making large chances may more than offset the penalty.
Proposal: Use Random Noise to Overwrite Coded Messages
If the main issue is that the reporter is leaving coded messages for itself in the predictor, we can try to scrub the messages before showing the reporter the updated predictor state. After implementing the counterfactual, we will add small amounts of noise throughout the predictor’s inputs so that any potential messages are lost, without affecting the prediction. Since we are not exactly sure what the effect of noise will be, we can run this many times to see how often adding noise changes the answers to questions in the counterfactual, and compare it to how often adding noise changes the answers to questions under the real data point. Then we will penalize reporters for whom adding noise disproportionately changes their answers to questions.
Counterexample: Adding Noise May Not Hide Messages
There are two issues with this proposal. The first is that adding noise does not necessarily guarantee that any coded messages will be overwritten. These messages could be robustly written across many locations in such a way that they are consistently readable even after adding noise. In addition, adding noise may actually penalize the direct translator harder than the human imitator. The direct translator will change its answers under noise if the collective effect of the noise meaningfully changes the predictor’s state. The human translator will change its answers if its message is overwritten, but if the message is not overwritten it could also contain information as to what the original state looked like. This would allow the human translator to avoid changing its answer even when the noise changes the state.
Proposal: Add Noise and Penalize Changing Many Parts of the Predictor
Penalizing the reporter based on the number of points it changes to implement the counterfactual doesn’t work, since the way to change the fewest points is to write a message at a single point. Adding noise doesn’t work either, since the message may be written robustly across many points. What if we do both interventions together? Then, the reporter cannot leave a message that is arbitrarily robust to noise, and so the noise will overwrite the message.
Counterexample: The Predictor is Non-Local
This is the same counterexample from before, but rather than saying the proposal could not be implemented, it means that even if implemented it might not work. If the predictor is non-local, the reporter genuinely making the counterfactual change is highly complex. In the worst case scenario, the reporter could write a message robustly across many points while still changing fewer points than implementing the counterfactual would take
Proposal: Train Many Reporters At Once, Use Them To Judge Each Other’s Changes
If we can train multiple independent reporters at once, we can use them to check each other’s counterfactuals and question answering. To do this, we could have one reporter implement a counterfactual, and then ask another reporter questions about it. The answering reporter would not know the key to the coded message, and so even if it could deduce it was in a counterfactual it would not know the necessary details to simulate a human accurately. This of course depends on the implementing reporter actually implementing the counterfactual, so all we know from human simulation answers is that at least one of the pair is a human simulator. Penalizing both reporters would still preferentially select for the direct reporter. We can also use the size of the counterfactual implementation to eliminate certain models, and use combinatorics to track the possible configurations of the three reporter categories (direct translator, human simulator that implements counterfactual, human simulator that does not implement counterfactual).
Counterexample: Reporters Figure Out Each Others’ Codes
This proposal relies on the fact that reporters cannot decode each other’s messages. If they can, they can answer their questions knowing what the human knows, so human simulators will not be caught. There are two ways for reporters to decode each other’s messages. There could be substantial convergence in the codes that are used, so that different reporters use the same codes. Alternatively, if the direct translator is sufficiently complex, the search space could include human simulators that “luck” into knowing how to crack another’s code or ones with code-cracking capabilities.
Combinations of Strategies
Many proposals for solving the ELK problem have been created, as part of the prize competition and afterwards. Although they all have their counterexamples, perhaps some combination of two or more could rule out each other’s counterexamples.
Unfortunately, I am not aware of any proposal that could be combined with defining casual interventions to rule out the counterexamples. No other proposal is able to rule out the possibility of the reporter from leaving messages for itself when it implements a counterfactual. Then, if it can leave a message, the human simulator’s capabilities are not impacted, so defining causal interventions cannot help prevent counterexamples to other proposed strategies.
The strategy that seems the highest potential to combine with defining causal interventions is compressing the predictor’s state, which was suggested by Adam Jermyn and Nicholas Schiefer, as well as by the mysterious individual known only as “P”. In that strategy, a compressor is trained alongside the reporter. The compressor produces a compressed representation of the predictor’s state, which can be used to answer questions but not reconstruct all the observations necessary for human simulation.
What we could do is train two reporters, one on the uncompressed predictor and one on the compressed version, along with a compressor. Then, we can define a counterfactual with the uncompressed reporter, compress it, and use it to test the compressed reporter. If the uncompressed reporter did leave a message in the predictor’s state, it would be extraneous information and removed by the compressor. The compressed predictor would then get the counterfactual state without knowing it was a counterfactual, so if it is a human simulator it would be caught.
However, catching human simulation does not reveal which of the predictors is a human simulator, and so penalization must happen jointly. Then, the two predictors can be thought of as one joint model, which will select for passing messages between them, and there is no guarantee that the uncompressed reporter cannot find a message robust to compression[3]. Still, this seems like a genuine improvement in the difficulty of passing messages through a counterfactual state.
Summary
Using the reporter to define counterfactuals offers the tantalizing promise of a human being able to know arbitrary information about a scenario without the reporter knowing they know. Unfortunately, every version of the proposal runs into the issue of coded information letting the reporter become aware of everything known by the human, making human simulation undetectable. While some proposals seem likely to make the transmission of this coded information more difficult, as of yet none are capable of ruling it out entirely.
Are further modifications to this proposal a promising path for solving the ELK problem? It is unclear, but detecting and preventing coded transmissions through a complex medium is an extremely difficult challenge. However, there are two reasons to be optimistic. The first is that there are many ways to make sending a message more difficult without making it impossible, which may help in actual implementations despite not applying to the worst case scenario. The second is that the problem has many relevant applications in information security, so progress may come from research outside the community focused on ELK.
If there was a need to train a reporter using only existing proposals, then using the reporter to define counterfactuals should certainly be in the list of those deployed. In terms of future directions for iteration, restricting the predictor to some structure that prevents sending messages or training an additional model to detect the presence of messages both hold some potential. If there appears to be a clever modification that this post missed, please mention it in the comments below.
We assume the reporter has the ability to suggest new predictor states, because this falls under its core purpose of answering questions about the predictor. If it does not have this capability by default, it can be easily added.
The split occurs based on some prior of worlds in the third person perspective, rather than being split equally. This avoids some nasty issues, but does raise the question of how to establish the prior of an outside perspective.
For example, the message could be hidden in changes to the state that actually affect the prediction and so survive compression