This is in the vein of the approach I call "epsitemic boxing". The probability distribution over questions corresponds to the external inputs part of the stochastic model of the box.
Thanks Anna Salamon for the idea of making an AI which cares about what happens in a counterfactual ideal world, rather than the real world world with the transistors in it, as a corrigibility strategy. I haven't yet been able to find a way to make that idea work for an agent/utility maximizer, but it inspired the idea of doing the same thing in an oracle.
You could have an agent that cares about what an idealised counterfactual human would think about its decisions (if the idealised human had a huge amount of time to think them over). Compare with Paul Christiano's ideas.
Now, this isn't safe, but it's at least something you might be able to play with.
Math oracles are a special type of AI which answers questions about math - proving theorems, or finding mathematical objects with desired properties and proving they have them. While a superintelligent math oracle could easily be used to construct a dangerous agent-like AI (such as by asking it to find an object which is a utility maximizer), it could also be used to empower an FAI research team by solving mathematically-well-specified parts of the problem and proving that the AI has the properties it is supposed to.
There is one glaring problem with a superintelligent math oracle, which is safely constructing one. To be effective enough, a math oracle would need to identify and solve subproblems, and would enormously benefit from improving its own problem-solving capability. If constructed carelessly, this would lead to an AI which anted to take over computers and other resources and turn them to studying the problem it's interested in.
I propose a structure for a math oracle which I claim can recursively self-improve but which will not try to take over resources, trick its human operators, or improve itself at faster than the intended rate, and which will switch itself off at each self-improvement step and at regular intervals. The overall structure is that you start with a math oracle and repeatedly ask "what would be a better math oracle", where "better" is defined in terms of what would happen when run on an idealized mathematical model of a computer (rather than on the physics-embedded computer that will actually be used).
M(p,q,t) is the state of an idealized computer running program p on question q after t steps
We say that a program shuts down at t if M(p,q,t)=0 (the machine-state is halted).
PV(thm,pf) is a proof verifier which returns true if pf is a valid proof of thm, false otherwise
Question are properties for which you want to find an object which exhibits the property, and a proof that it does. So an answer (obj,pf) is an answer to a question q if PV(q(obj),pf).
An oracle is a program which attempts to answer questions. We say that an oracle program p answers a question in time t if M(p,q,t)=(obj,pf) and PV(q(obj),pf). An oracle program has a strength with respect to a probability distribution over questions Q and an amount of time, str(p,Q,t) which is the probability that it will answer a question drawn from q in time t.
With these definitions in place, we can ask an oracle p with strength s to make a better oracle with strength s′, and to guarantee that it shuts down when finished, by inputting q(x)=str(x,Q,t)>=s′∧∀q∈Q.M(x,q,t+1)=0.
This oracle, when asked to produce successors in this way, will not try to produce successors that probe for weaknesses to escape a box, because instead of optimizing for how well the successor would do in an actual physically embedded system, it is optimizing for how well the successor would do in a counterfactual idealized system. This is not a guarantee, however; it might be that it is easier to produce a system which does probe for such weaknesses, because this behavior arises out of some general principles. This can be somewhat improved by tweaking the strength metric by making it favor proofs arrived at earlier; this would incentivize it to make an oracle which does not waste time and which does not make use of extra time in between when it finds a proof of something and when it reaches the deadline.