This is a meta-point, but I find it weird that you ask what is "caring about something" according to CS but don't ask what "corrigibility" is, despite the fact of existence of multiple examples of goal-oriented systems and some relatively-good formalisms (we disagree whether expected utility maximization is a good model of real goal-oriented systems, but we all agree that if we met expected utility maximizer we would find its behavior pretty much goal-oriented), while corrigibility is a pure product of imagination of one particular Eliezer Yudkowsky, born ...
Disagree on several points. I don't need future AIs to satisfy some mathematically simple description of corrigibility, just for them to be able to solve uploading or nanotech or whatever without preventing us from changing their goals. This laundry list by Eliezer of properties like myopia, shutdownability, etc. seems likely to make systems more controllable and less dangerous in practice, and while not all of them are fully formalized it seems like there are no barriers to achieving these properties in the course of ordinary engineering. If there is some...
I want to say "yes, but this is different", but not in the sense "I acknowledge existence of your evidence, but ignore it". My intuition tells me that we don't "induce" taskiness in the modern systems, it just happens because we build them not general enough. It probably won't hold when we start buliding models of capable agents in natural environment.
unpacking inner Eliezer model
If we live in world where superintelligent AGI can't have advantage in long-term planning over humans assisted by non-superintelligent narrow AIs (I frankly don't believe that we live in such world), then superintelligent AGI doesn't make complex long-term plans where it doesn't have advantage. It will make simple short-term plans where it has advantage, like "use superior engineering skills to hack into computer networks, infect as many computers as possible with its adapted for hidden distributed computations source code (her...
I think it's confusing because we mostly care about outcome "we mistakenly think that system is aligned, deploy it and get killed", not about particular mechanism of getting this outcome.
Dumb example: let's suppose that we train systems to report its own activity. Human raters consistently assign higher reward for more polite reports. At the end, system learns to produce so polite and smooth reports that human raters have hard time to catch any signs of misalignement in reports and take it for aligned system.
We have, on the one hand, system that superhumanly good at producing impression of being aligned, on the other hand, it's not like it's very strategically aware.