If I build an expected utility maximizer with a preference for the presence of some physical quantity, that surely is not a free agent. If I build some agent with the capacity to modify a program which is responsible for its conversion from states of the world to scalar utility values, I assume you would consider that a free agent.

I am reminded of E.T. Jaynes' position on the notion of 'randomization', which I will summarize as "a term to describe a process we consider too hard to model, which we then consider a 'thing' because we named it."

How is this agent any more free than the expected utility maximizer, other than for the reason that I can't conveniently extrapolate the outcome of its modification of its utility function?

It seems to me that this only shifts the problem from "how do we find a safe utility function to maximize" to "how do we find a process by which a safe utility function is learned", and I would argue the consideration of the latter is already a mainstream position in alignment.

If I have missed a key distinguishing property, I would be very interested to know.

Reply