I think this is a good approach to consider, though I'm currently skeptical this kind of thing can resolve the worst case problem.
My main concern is that models won't behave well by default when we give them hypothetical sensors that they know don't exist (especially relevant for idea #3). But on the other hand, if we need to get good signal from the sensors that actually exist then it seems like we are back in the "build a bunch of sensors and hope it's hard to tamper with them all" regime. I wrote up more detailed thoughts in Counterexamples to some ELK proposals.
Other random thoughts:
This is a response to ARC's first technical report: Eliciting Latent Knowledge. But it should be fairly understandable even if you didn't read ARC's report, since I summarize relevant parts of their report as necessary. Here I propose some approaches to the problem ARC outlines which are very different from the approaches they explore.
Idea #1: Detecting failure using a hold-out sensor
The core challenge in ARC's report is to obtain accurate knowledge about the actual presence / absence of a diamond in a protective "SmartVault", as opposed to accurate knowledge about whether a human observer would think the diamond is present if the human observer had access to data from a sensor I'll call S1. This is difficult because it's hard to get labeled data which differentiates between the two.
If the AI protecting the diamond in the vault is optimizing for whether a human observer would think the diamond is present, it will find that corrupting readings from S1 while allowing a thief to take the diamond is a perfectly legitimate strategy. When the AI simulates this diamond-protection strategy, it will receive the maximum score, since score is calculated based on what a human with access to S1 readings would think.
Suppose we install an additional sensor S2 in the vault, on a closed-circuit system that isn't present in the AI's model of the vault. It seems reasonable to guess that the AI won't be able to boost the simulated score of its diamond-protection strategy by corrupting readings from S2 in addition to S1. So if ARC has some method they think will get the AI to actually guard the diamond, secretly observing S2 gives them a chance to check if the method is working.
This strategy could fail. For example, if (a) S2 ends up being present in the AI's model of the vault despite our best intentions, and (b) the AI's learned scoring function happens to be based on what a human with access to all known sensors would think. (This latter possibility is a bigger risk if humans have been peeking at S2 periodically during the course of iteratively developing the AI. The analogy to train/dev/test splits should be pretty obvious here.) So it's not a perfect detection method.
Idea #2: Predicting the readings of hypothetical sensors
An implicit assumption of the Eliciting Latent Knowledge report is that the SmartVault AI can make predictions about readings from sensor S1, given the actions it takes to manage the vault and other data as predictor variables.
I'm imagining the AI has been trained to be able to do this using supervised learning. We have some features X consisting of actions taken and other data, and a target Y consisting of S1 readings. The AI searches for a model which maps X to Y using e.g. gradient descent.
We can modify the training procedure to predict the readings from any hypothetical sensor as follows. In addition to the original features X, we add six additional features corresponding to the latitude, longitude, height, yaw, pitch, and roll of the sensor, resulting in an augmented feature matrix X'. To generate training data, we repeatedly:
Use a random number generator to generate random values for those six additional features
Place the sensor according to those random values
Use whatever data generation method we were using in the original SmartVault scenario to populate Y with readings from the randomly placed sensor
If the AI correctly learns a mapping from the augmented feature matrix X' to Y, we've now got a model that can make predictions about the readings we would get from any hypothetically placed sensor.
This idea mainly serves as a building block for later ideas, but the basic intuition is that sensor corruption is tougher if there are more sensors which need to be corrupted.
Idea #3: Predicting the readings of hypothetical sensors, galaxy-brained edition
In the Eliciting Latent Knowledge paper the focus is on visual sensors. But other sensor types might also be useful, e.g. auditory sensors or touch sensors.
To be able to predict the output of an arbitrarily placed visual/auditory/touch sensor, we could use the scheme described in Idea #2, plus a seventh new categorical "sensor type" feature. The categorical feature would take on one of 3 discrete values corresponding to visual, auditory, or touch. Then to generate training data, we would randomly place sensors of random types.
Now for the galaxy brain. Suppose there is a sensor type we don't have on hand -- perhaps because the technology for constructing this sensor doesn't actually exist. For example, a "nanobot sensor" or a "hacking sensor".
To train our AI to predict the output of such an arbitrary sensor type, we could replace the categorical "sensor type" feature with a natural language description of what the sensor in question measures, and the format of its output.
Of course this would work best if the AI was based a foundation model such as GPT.
As the galaxy brain expands to encompass the Local Group, you can imagine a hypothetical "diamond presence sensor" with a natural language description like: "this sensor measures the probability that the SmartVault AI's latent knowledge assigns to the presence or absence of the diamond in the vault".
Is this cheating? Am I just relocating the ELK problem instead of solving it? It's a possibility; I just came up with this idea recently and would want to give it more thought before making it load-bearing. (In general it seems good to stack a lot of alignment ideas with fairly independent failure probabilities.)
The difficulty in the original ELK scenario is differentiating between human approval vs human endorsement. (E.g. a human operator might approve the actions of a deceptive SmartVault when they don't endorse doing so.) I've reframed the scenario so instead of aiming at human endorsement (which risks missing and hitting human approval instead), we are instead aiming at map-territory correspondence or "truthfulness", an abstraction which is presumably more natural. However, we could still miss map-territory correspondence and instead hit a human's incorrect idea of map-territory correspondence, or something like that -- I still have to think more. (EDIT: Maybe it'd be useful to pretrain on a dataset without any humans or other agents in it to initially acquire a concept of truthfulness, then re-use that concept elsewhere via transfer learning.)
Quick Interlude: Forget SmartVault. Can "hypothetical sensors, galaxy-brained edition" solve FAI?
Instead of aiming at human endorsement, the hypothetical sensors idea lets us aim at truthfulness. If we're successful, this means our AI gives us truthful answers to questions where there are facts of the matter. And we should be able to use the answers to those questions to produce a good scoring function for e.g. a policy which is meant to protect a diamond.
But what if we're in a more open-ended scenario, where we want our scoring function to correspond to whether sentient life is flourishing? Are there facts of the matter here?
One idea is to reduce the question to one where there are more obviously facts of the matter, e.g. "would an omniscient version of Gandhi be pleased with how things are going if he had a lot of time to think about it". Any individual such "facts of the matter" question is liable to be flawed, but if you had a large number of diverse "facts of the matter" questions (possibly themselves AI-generated and randomly human-audited) which all point at the question of whether sentient life is flourishing, you might be able to combine them into a robust scoring function.
Idea #4: Deriving endorsement from truthfulness, e.g. via "pivotal new sensors"
Suppose we don't want to offload quite so much work to our AI, and instead maintain a degree of human supervision. How could hypothetical sensors help here?
Idea #4 is to train an AI to predict the judgements that a human would make conditional on viewing data from a given set of sensors. Returning to the original S1 sensor, suppose that our AI predicts that if a human saw only data from S1 it would label a given episode as "diamond protected successfully". But there exists some hypothetical sensor S3 such that if the human saw data from both S1 and S3, the AI predicts that the human would label the episode as "diamond stolen". Then S3 is a "pivotal new sensor" in the sense that it changes the predicted label for the episode.
We could get our AI to find pivotal new sensors using something sophisticated like generative modeling, or simple exhaustive search. It might be that we need multiple sensor readings to achieve the "pivotal" criterion, e.g. before/after touch sensor readings which indicate that the robber's measured weight has increased by the weight of the diamond -- we could get our AI to search for combinations like this as well.
A potential problem is that a pivotal new sensor might change the predicted label for a bad reason, e.g. because the pivotal new sensor is some sort of infohazard. To reduce the chance of this, we could restrict things to pivotal new sensors of fairly basic types (e.g. the original visual/auditory/touch sensors, as opposed to exotic sensors which could contain agents).
Another idea for getting from truthfulness to endorsement is to add an additional feature to the model for predicting the human label, corresponding to the amount of time the human gets to examine the sensor data. This could help if e.g. there are artifacts in the readings which are only apparent on close examination. Then if the AI predicts that your label would change if you had more time to examine the readings, you know something could be up. For bonus points, add a feature corresponding to the number of hours of education the human has received regarding the workings of the SmartVault, and see if the predicted label changes as the "number of hours of education" feature is increased.
Conclusion
If people like this post, I had some other ideas after reading ARC's report which might be worth writing up, although to be honest I don't find them as exciting as the ones presented here.