So I think there are basically two issues with specifying natural language goals that should be separated:
Both of these problems seem pretty hard. I don't know whether you factor the problem a similar way? I also don't see where stratification fits in; I'm not sure what it is meant to do that Paul's proposal for (1) doesn't already do. My guess is that you want "what we mean" to be defined in terms of the interaction of a human/AI system rather than just a human thinking for a really long time, but I don't see the motivation for this.
I still don't understand the motivation. Is the hope that "what <X value learning algorithm> would infer from observing humans in some hypothetical that doesn't actually happen" is easier to make inferences about than "what humans would do if they thought for a very long time"?
A putative new idea for AI control; index here.
The counterfactual approach could be used to possibly allow natural language goals for AIs.
The basic idea is that when the AI is given a natural language goal like "increase human happiness" or "implement CEV", it is not to figure out what these goals mean, but to follow what a pure learning algorithm would establish these goals as meaning.
This would be safer than a simple figure-out-the-utility-you're-currently-maximising approach. But it still doesn't solve a few drawbacks. Firstly, the learning algorithm has to be effective itself (in particular, modifying human understanding of the words should be ruled out, and the learning process must avoid concluding the simpler interpretations are always better). And secondly, humans' don't yet know what these words mean, outside our usual comfort zone, so the "learning" task also involves the AI extrapolating beyond what we know.