Learning human preferences: black-box, white-box, and structured white-box access

[-]Steven Byrnes5y50

I'm not so sure about the "labeled white box" framing. It presupposes that the thing we care about (e.g. preferences) is part of the model. An alternative possibility is that the model has parameters a,b,c,d,... and there's a function f with

preferences = f(a,b,c,d,...),

but the function f is not part of the algorithm, it's only implemented by us onlookers. Right?

[-]Stuart_Armstrong5y10

but the function f is not part of the algorithm, it's only implemented by us onlookers. Right?

Then isn't that just a model at another level, a (labelled) model in the heads of the onlookers?

[-]Gordon Seidoh Worley5y10

Any model is going to be in the head of some onlooker. This is the tough part about the white box approach: it's always an inference about what's "really" going on. Of course, this is true even of the boundaries of black boxes, so it's a fully general problem. And I think that suggests it's not a problem except insofar as we have normal problems setting up correspondence between map and territory.

[-]Steven Byrnes5y20

My understanding of the OP was that there is a robot, and the robot has source code, and "black box" means we don't see the source code but get an impenetrable binary and can do tests of what its input-output behavior is, and "white box" means we get the source code and run it step-by-step in debugging mode but the names of variables, functions, modules, etc. are replaced by random strings. We can still see the structure of the code, like "module A calls module B". And "labeled white box" means we get the source code along with well-chosen names of variables, functions, etc.

Then my question was: what if none of the variables, functions, etc. corresponds to "preferences"? What if "preferences" is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot's programmer?

But now this conversation is suggesting that I'm not quite understanding it right. "Black box" is what I thought, but "white box" is any source code that produces the same input-output behavior—not necessarily the robot's actual source code—and that includes source code that does extra pointless calculations internally. And then my question doesn't really make sense, because whatever "preferences" is, I can come up a white-box model wherein "preferences" is calculated and then immediately deleted, such that it's not part of the input-output behavior.

Something like that?

[-]Stuart_Armstrong5y20

My understanding of the OP was that there is a robot [...]

That understanding is correct.

Then my question was: what if none of the variables, functions, etc. corresponds to "preferences"? What if "preferences" is a way that we try to interpret the robot, but not a natural subsystem or abstraction or function or anything else that would be useful for the robot's programmer?

I agree that preferences is a way we try to interpret the robot (and how we humans try to interpret each other). The programmer themselves could label the variables; but its also possible that another labelling would be clearer or more useful for our purposes. It might be a "natural" abstraction, once we've put some effort into defining what preferences "naturally" are.

but "white box" is any source code that produces the same input-output behavior

What that section is saying is that there are multiple white boxes that produce the same black box behaviour (hence we cannot read the white box simply from the black box).

[-]John_Maxwell5y10

Note that this decomposition is simpler than a "reasonable" version of figure 4, since the boundaries between the three modules don't need to be specified.

Consider two versions of the same program. One makes use of a bunch of copy/pasted code. The other makes use of a nice set of re-usable abstractions. The second program will be shorter/simpler.

Boundaries between modules don't cost very much, and modularization is super helpful for simplifying things.

[-]Stuart_Armstrong5y10

modularization is super helpful for simplifying things.

The best modularization for simplification will not likely correspond to the best modularization for distinguishing preferences from other parts of the agent's algorithm (that's the "Occam's razor" result).

[-]John_Maxwell5y10

Let's say I'm trying to describe a hockey game. Modularizing the preferences from other aspects of the team algorithm makes it much easier to describe what happens at the start of the second period, when the two teams switch sides.

The fact that humans find an abstraction useful is evidence that an AI will as well. The notion that agents have preferences helps us predict how people will change their plans for achieving their goals when they receive new information. Same for an AI.

[-]Stuart_Armstrong5y10

Humans have a theory of mind, that makes certain types of modularizations easier. That doesn't mean that the same modularization is simple for an agent that doesn't share that theory of mind.

Then again, it might be. This is worth digging into empirically. See my post on the optimistic and pessimistic scenarios; in the optimistic scenario, preferences, human theory of mind, and all the other elements, are easy to deduce (there's an informal equivalence result; if one of those is easy to deduce, all the others are).

So we need to figure out if we're in the optimistic or the pessimistic scenario.

We might object to the arrow from observations to "preferences": preferences are not supposed to change, at least for ideal agents. But many agents are far from ideal (including humans); we don't want the whole method to fail because there was a stray bit of code or neuron going in one direction, or because two modules reused the same code or the same memory space. ↩︎
Note that I don't give a rigid distinction between syntax and semantics/meaning/"ground truth". As we accumulate more and more syntactical restrictions, the number of plausible semantic structures plunges. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

12

Learning human preferences: black-box, white-box, and structured white-box access

12

Knowing an agent

Black-box

White-box

Structured white-box

Levels of access

Access and knowledge

Multiple white boxes for a single black box

Multiple structures and tags for one white-box

Normative assumptions, tags, and structural assumptions

Almost trivial structural assumptions

Adding semantics or "thick" concepts