Seems interesting, but the adversary seems to need a very specific definition of what's outside the domain. Absent that, this just becomes a patch or a nearest unblocked strategy: the solution will the one that's best in the domain and doesn't trigger the specific outside-domain adversary.
I agree... if there are specific things you don't want to be able to do / predict, then you can do something very similar to the cited "Censoring Representations" paper.
But if you want to censor all "out-of-domain" knowledge, I don't see a good way of doing it.
Yup, this isn't robust to extremely capable systems; it's a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.
(In the example with the agent doing engineering in a sandbox that doesn't include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)
A whitelisting variant would be way more reliable than a blacklisting one, clearly.
A way to achieve whitelisting might be:
But censor the censor network from producing output containing knowledge about any information relevant to the whitelisted domain.
The agent should then not be censored from knowing about anything related to the whitelisted domain.
This will run into issues about the scope implied by the whitelisted domain data set (certain datasets might imply too small or too large of a domain being relevant, and this might be tricky to know in advance).
One example that could be tested is a translation system that translates both ways between language pairs (A,B) and (B,C), and that by default allows for zero-shot translation between A and C (as in https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html). Then apply this method. One way would be to inhibit learning translation between A and C (you could try inhibiting only A->C to see if it prevents C->A, or introduce additional language pairs into the setup).
(An idea from a recent MIRI research workshop; similar to some ideas of Eric Drexler and others. Would need further development before it's clear whether this would do anything interesting, let alone be a reliable part of a taskifying approach.)
If you take an AI capable of pretty general reasoning, and you ask it to do a task in a restricted domain, you can't necessarily count on it to find a solution that stays conceptually within that domain. This covers a couple of different failure modes, including wireheading (the AI modifies its own sensory inputs rather than the external object you wanted it to modify) and manipulation of humans (the AI influences human psychology rather than the thing you wanted it to directly affect).
Directly forbidding actions outside a domain seems tricky if we can't define the domain in closed form (this is especially the case if we have an AI affecting humans, and we don't want it to develop models of human psychology sufficient for manipulating them). One thing we could try instead is ensuring that the AI doesn't "know too much", in its internal representations, outside its domain.
"Know too much", here, can be defined in the sense of Censoring Representations with an Adversary. We add to our network some additional networks seeking to extract out-of-domain information from internal representations, and then by reversing the sign of the backprop from that network to the representation, we cause it to erase such information.
Furthermore, so that we don't destroy any hope of decent task performance, we can erase out-of-domain information only to the degree that it exceeds the info from some known-safe "black box" model of the environment outside the domain.
Examples of tasks that we might want to use this on:
Play well on a suite of interactive fiction games, but without understanding human psychology any better than some baseline natural language model (e.g. test whether it can predict text conversations between humans).
Optimize the newsfeed algorithm for a social network, as measured by analytics data, again without learning human psychology beyond some simple baseline model.
Do engineering or chemistry within a physically enclosed box, without learning anything about the world outside that box beyond a simple model of it (e.g. ask it questions about external objects, including the AI's own hardware, which you've put outside the box for obvious reasons).
Question: Is there an extremely simple toy example we could test out on current systems without human input?