This post is on a very important topic: how could we scale ideas about value extrapolation or avoiding goal misgeneralisation... all the way up to superintelligence? As such, its ideas are very worth exploring and getting to grips to. It's a very important idea.
However, the post itself is not brilliantly written, and is more of "idea of a potential approach" than a well crafted theory post. I hope to be able to revisit it at some point soon, but haven't been able to find or make the time, yet.
In the kinds of model-based RL AGI architectures that I normally think about (see here)…
A long time ago, Scott introduced the blue-minimising robot:
Scott then considers holographic projectors and colour-reversing glasses, where the blue robot does not act in a way that actually reduces the amount of blue, and concludes:
That's one characterisation, but what if the robot was a reinforcement-learning agent that was trained in various scenarios where they got rewards for blasting blue objects? Then it would seem that it was designed as a blue minimising utility maximiser; just not designed particularly well.
One approach would be "well, just design it better". But that's akin to saying "well, just perfectly program a friendly AI". In the spirit of model-splintering we could instead ask the algorithm to improve its own reward function as it learns more.
The improving robot
Here is a story of how that could go. Obviously this sort of behaviour would not happen naturally with a reinforcement learning agent; it has to be designed in. The key elements are in bold.
Seven key stages
There are seven key stages to this algorithm:
The question is, can all these stages be programmed or learnt by the AI? I feel that they might, since humans can achieve them ourselves, at least imperfectly. So with a mix of explicit programming, examples of humans doing these tasks, learning on these examples, examples of humans finding errors in the learning, it might be possible to design such an agent.