There is something very deep going on with pessimism: the same general method can produce a truthful agent, prevent feedback tampering, and solve the ELK challenge. Pessimism has been discovered by theoretical and empirical researchers to produce policies that are robust to distributional shift. And it is extremely simple, not epicycle-laden.
Here is how to apply Pessimism to the RL setting:
In an outer loop, an agent acts in the world, and along the way it collects training data. The Adversary learns a world-model which must successfully model the observed observations and rewards. It's not allowed to get much more loss than the very best model. (This loss can include a regularization term to promote simple models). But within that constraint, The Adversary aims to live up to its name and assign the agent low rewards.
The agent is trained in Adversaryland--that is, the world-model that the adversary has come up with. So it judges its actions by how well they perform in that world-model. This training process can run in an inner loop. We say the agent is more pessimistic when The Adversary has more leeway to reach for higher-loss models.
Here is a safety case template for the pessimistic agent making the argument that it will not lie.
It is easy enough to provide training data where lying has always been punished with minimal return. With that training data, an adversary could select a model where lying is always met with minimal return. This makes the "pessimistic value" of lying--the value of it in Adversaryland--minimal.
The cells on the bottom are establish that the training process is working properly. The gradients for the training The Adversary directly encourages it to output lower rewards with higher probability. And in the training process for the agent's policy, nothing is out of distribution, because it's trained in Adversaryland. That's a key point that changes the game for OOD (Goal) Misgeneralization. Novel actions are in-distribution--the agent can try them out in Adversaryland. Subtle lies no human has ever had to contend with are in-distribution--these can be tried (and met with punishment) in Adversaryland.
So note that the pessimistic RL agent will not lie, even if the most likely world-model says that it only gets punished for human-recognizable lies. Even if the most likely world-model says that feedback tampering would prevent any punishment for lying.
The same kind of safety case works just as well for feedback tampering.
This safety case holds even if the most likely world-model says that it only gets punished for unsuccessful tampering.
Christiano, Cotra, and Xu (2021) introduce the ELK challenge as the problem of creating a reporter that reports accurately on the internal knowledge of an AI system. Their methodology is a game: the builder proposes a training regime or some algorithm for that is supposed to promote the desired reporter. Then, the breaker identifies a bad reporter that is also promoted by the builder's algorithm. If the breaker manages this, then breaker wins, and the builder's algorithm is unsafe. Instead of taking a single reporter promoted by the builder's algorithm, we can take multiple, and then if any of those reporters report that the diamond is missing, we pessimistically consider it to be gone. Here is a schematic:
The first approach presented by Christiano, Cotra, and Xu (2021) is where humans label videos with labels indicating the presence or absence of the diamond, and then a reporter is trained to match these labels. They note that a "direct translator" could report on the model's understanding of the presence of the diamond. But also, a "human simulator" could report on what the model thinks a human would say when looking at the video feed. If the training data is good, both reporters would be promoted by the training process. A clever heist could go unnoticed by a human watching the video feed, so the "human simulator" reporter would reenact that oversight. This means the builder's training regime fails to promise the successful elicitation of knowledge from the model.
But if we take many different models that are promoted by the builder's training regime, then as long as one of them is a "direct translator," then if the diamond is stolen, at least one model will say so, so we can raise the alarm. Alternatively, we could modify the builder's algorithm by adding a small loss for saying that the diamond is still there. It would still have to match the training data, which often does say that the diamond is there, but when plausible models disagree about the presence of the diamond, it would learn to say it's not there.
A lot of the attempts at solving the ELK problem fail to eliminate the breaker's ability to find a bad reporter, but many do seem to force the breaker's selected reporter to be more complex. The more complex the breaker's selected reporter, the less pessimistic we need to be.
If, for all actions, The Adversary can come up with a plausible account of how it will end badly, then the agent can't act. But by adding to the training data, we can force Adversaryland to respect the fact that some courses of action have been found to be fine. I showed in this paper that if a pessimistic agent has a mentor, it will achieve at least mentor-level performance. So pessimism is perfectly compatible with competent goal-seeking, even in a stochastic environment. (That paper also proves safety results for an idealized pessimistic agent along the lines of the safety cases above, if one prefers formal arguments.)
I think a promising avenue to reducing the costs of pessimism is to only be pessimistic about certain things, while being otherwise risk neutral. For example, if one has a dataset of truths and lies, one can be pessimistic about the question "Am I telling a lie right now?" and stay on the safe side. At the same time, one can stay risk-neutral on the question of whether God hates people who take ferries on Tuesdays.
We can also reduce the costs of pessimism by searching for ways to encumber the bad models with complexity--as some ELK proposals do--and then reduce the amount of pessimism. If an agent is not too pessimistic, then the possibility of God hating people who take ferries on Tuesdays could be placed out of reach of The Adversary, by virtue of the complexity of a world-model that implements this.
Finally, if pessimism does end up hampering an agent too much, this would be a graceful failure, rather than a catastrophic one.