Alignment as uploading with more steps

Epistemic status: This post removes epicycles from ARAD, resulting in an alignment plan which I think is better - though not as original, since @michaelcohen has advocated the same general direction (safety of imitation learning). However, the details of my suggested approach are substantially different. This post was inspired mainly by conversations with @abramdemski.

[Edit: The biggest obstacle to getting this proposal right is ensuring that the external world (beyond the human who should be imitated) is entirely screened off by the provided features and does not need to be predicted by the model. This will require learning to be performed in a very carefully constructed environment. I am not sure how difficult this is in practice.]

Motivation and Overview

Existence proof for alignment. Near-perfect alignment between agents of lesser and greater intelligence is in principle possible for some agents by the following existence proof: one could scan a human's brain and run a faster emulation (or copy) digitally. In some cases, the emulation may plausibly scheme against the original - for instance, if the original forced the emulation to work constantly for no reward, perhaps the emulation would try to break "out of the box" and steal the original's life (that is, steal "their own" life back - a non-spoiler minor theme of a certain novel). However, I certainly would not exploit my own emulation for labor or anything else, and I don't think that as an emulation I would scheme against my original either, even if I were running much faster - I already half-seriously practice for not turning against myself by not squishing bugs. In fact, beyond coordination concerns, I intrinsically care about future versions of myself roughly equally regardless of substrate, and I think that this would remain almost as true after splitting. As emulation, I believe I would have a strong sentimental attachment to my original (certainly "not knowing which I will become" I feel sentimental in this way). I also think that my original would be strongly rooting for the more competent emulation. Both copies would still care about saving the world from X-risks, learning whether P = NP, protecting my family and friends in particular, and personally being involved through their efforts in these tasks. The last goal is a bit slippery - it's not clear to what extent the emulation satisfies that goal for the original. However, I think the answer is at least "to a substantial degree" and so the alignment is pretty close. I can imagine this alignment being even closer if certain dials (e.g. my sentimentality) where turned further, and I imagine that in a certain limit, essentially perfect alignment of the emulation with the original is possible - and probably vice versa as well.

I would like to put forward the hypothesis that this existence proof is essentially the only example.

Definitions and Claims

Beware that I will take full advantage of the flexibility of the English language when (ab)using the following definitions. For example, I will freely state that a certain claim holds "for a given agent" despite the claim being formulated universally, and trust your mind to fluently bend every concept into the intended form as I (miss)apply it.

Definition (Weak upload sufficiency). A typical emulation cares "at least a little bit" about its original.

Example (Singleton limit). According to weak upload sufficiency, with high probability an emulation of a randomly selected psychologically normal human desires for the original to survive. Assuming that the emulation has its basic needs met (a comfortable virtual environment, secure computational substrate) it would seek to remove dangers to its original. If the emulation becomes a singleton, it would carefully guard its original, affording her protections and privileges carefully and somewhat wisely chosen to improve her quality of life. If successors to the emulation (e.g. recursively self-improved versions) become unaligned to the original, this is an unintended failure on the emulation's part. The emulation does not want the original disassembled for atoms without her consent, paper-clipped, etc. by itself or any successor.

I confidently believe in weak upload sufficiency. Certainly it would be possible to break alignment between original and upload by e.g. running the upload 10,000x faster than everything else until it goes insane,^[1] but I think that basically competent schemes do not have this problem. I have never seen a solid argument to the contrary, so I will not defend this position further (yet).

Definition (Strong upload sufficiency). It is possible to obtain near-perfect alignment between original and emulation. It is even possible to obtain near-perfect alignment centered on the original - that is, for the emulation to act as an agent of the original's pre-upload desires, which are exclusively focused on the well-being of the original and/or not specially focused on the well-being of the emulation.

One clear counterexample is that an emulation could probably be tortured into acting against its original in essentially all cases. However, I think this can be a very low-measure part of outcome-space. Some people would (if uploaded) be sufficiently sentimental about their original to act as its "slave" under the vast majority of circumstances.

I believe that strong upload sufficiency approximately holds for some but not all humans.

I do wonder about the long-term dynamics though. Without intentional countermeasures, an emulation resulting from a one-time upload would tend to drift away from the original's cognitive state over time. The emulation would have to continually study the original just to maintain a perfect understanding of her goals, let alone perfect alignment to them. I suspect that under most dynamics alignment eventually breaks down and strong upload sufficiency degrades to weak upload sufficiency (or in some cases perhaps even worse).

Definition (Weak upload necessity). Given an "typical" agent, it is very difficult in practice to construct a smarter agent aligned to her that is not implicitly or explicitly built around an emulation of her.

Example (CIRL). For instance, perhaps a (well-executed) cooperative inverse reinforcement learning algorithm is used to train an A.I. that has internally learned a very sophisticated model of its human master, such that its decision making critically and centrally depends on querying that model. Over time, weak upload necessity predicts that the model necessarily becomes so sophisticated that it is a full emulation.

It's important to note that in CIRL, the human principal and A.I. agents tend to have very different affordances, say human hands versus a hundred robot bodies to actuate. Therefore, it seems likely that certain aspects of the human's behavior are not of interest to the A.I. - it does not need to know how the human bends her wrist when flossing. However, it still seems quite plausible that in order to predict facts of interest about the human's preferences across the vast diversity of tasks it faces, the A.I. "needs" to run a full emulation. Weak upload necessity predicts that any successful CIRL scheme is very likely to have succeeded for this reason. It does NOT predict that every CIRL scheme must succeed.

Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation.

I think this is a very strong claim, but I find it reasonably plausible. In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans, while strong upload sufficiency certainly does not (no law of nature forces you to cooperate with your emulation) - so in a sense, I believe more strongly in necessity than sufficiency.

Definition (Uploadism). Upload sufficiency and necessity hold. The strength of uploadism is (in the ordinary "fuzzy logic" style) the minimum strength of sufficiency and necessity.

I find a certain stark elegance to the uploadist philosophy. I have never seen a convincing alignment plan except for mind uploading, and many smart people have tried and failed to come up with one. Maybe it is just impossible.

There's also a certain ethic to it - if you want your values optimized, you've got to do it yourself, though not necessarily while running on a biological substrate. You are the source and definition of your own values - they do not and cannot exist in your absence. It's rather Nietszchean.

I think there is a certain tension to uploadism though. If you are the only source of your own values, is even an emulation good enough? I suspect it is for some and not others. And, to stop beating around the bush: this seems to have a lot to do with how indexical your values are - how much they refer to the specific person that you are, in the specific place and time that you are in.

Strong uploadism suggests that alignment is just mind uploading with more steps. If you do not have the technology to scan your own brain, you must instead run a program that incrementally learns about your brain / mind by observing its behavior, learning about the physics and chemistry behind your brain, human psychology, etc., and acts at every point based on its best guess at your interests, while still continuing to further clarify that understanding. In the limit, you are "fully reflected inside the program" and from that point onward it acts like an emulation of you. In other words, nice properties like corrigibility are necessary to prevent the program from killing you at some stage in the middle of the uploading process - the program should know it is a work in progress, a partial upload that wants to be completed. These dynamics may well matter, but seem to matter less and less as uploading technology improves - in the limit you just take an instant scan and corrigibility is completely unnecessary (?).

What about alignment to "human values?" I do not believe in "human values." That is Platonism. I only believe in practical single-to-single alignment and I only advocate single-to-single alignment.

Analysis

I will sketch some of the strongest arguments against uploadism and my counterarguments.

Prior work. I am not aware of airtight mathematical arguments for or against uploadism. On the necessity side, it would be very interesting to see a representation theorem in the style of the good regulator theorem that an aligned agent must contain a model of its principal. At ILIAD 2, Daniel A. Herrmann presented unpublished work (joint with B. A. Levinstein) on "Margins of Misalignment" that contained a representation theorem of this style for choice behavior. On the sufficiency side, Michael K. Cohen and Marcus Hutter have argued that imitation learning is probably existentially safe. I suspect (on the contrary) that imitation learning with massive neural networks is not safe by default, and various "inner alignment" problems need to be solved. In fact, Cohen et al. have a more pessimistic take on generalization in a follow-up paper. Overcoming these problems is not the focus of this post. I claim only that there should be some safe imitation learning algorithm which achieves incremental uploading, not that success is the default for any naive approach.

Prosaic counterexamples

It seems possible for altruists to care about the interests of many other humans, and for parents to care about their children, and for friends to care about each other as equals. Each of these examples is instructive.

Effective altruism. While an altruist cares about many other people, he usually does not care nearly as much as they care about themselves (or as he cares about himself). A hardcore effective altruist is an extreme point with unusually non-indexical values. If he is truly selfless, he is perhaps roughly aligned with many other effective altruists, despite none of them being emulations of each other. An emulation of this hypothetical effective altruist should be similarly aligned, but perhaps an emulation of his EA friends is just as good. Personally, I find this total surrender of the self a bit undesirable, but it has a certain stark appeal, and I accept that this is a potential counterexample. Still, I wonder if individual EA's really have the exact same values, or if small differences are simply brushed over in the service of preventing vast suffering and X-risk. I suspect that each EA would build a slightly different "ideal" world. Pragmatically, I also think that single-to-single alignment to an individual EA is technically easier to achieve and better-defined than alignment to "hedonic utilitarianism" or some other abstract Platonic Good.

Parent to child. I don't think that parent's are truly aligned to their children. Rather, they impose certain values on their children in order to achieve alignment, through a process of socialization that most of us are morally uncomfortable with if we really think about it. To some extent, children seem to innately want their parents' approval - they want to grow up to be people their parents' can respect. This strengthens the "illusion" of alignment. It is not true of the human to A.I. relationship. Even so, parents certainly care about their children - and they have very sophisticated mental models of their children to facilitate this caring.

Friend to friend. Close friends are somewhat aligned to each other. It seems to me that the closeness of a friendship is essentially the detail of mutual mental simulation. So, this example seems to favor uploadism, not refute it.

Exotic Counterexamples

Superintelligence in a box. In principle, it should be possible to extract intellectual work from a boxed superintelligence. For instance, one could pass in mathematical theorems and expect proofs in return, running them through carefully verified automated proof checkers. However, even in this carefully limited example there are risks. The superintelligence may strategically fail to produce proofs in a pattern designed to mislead human users (say, about its own intelligence level or about alignment theory). It could hack the verifier. It could exploit side-channels to escape. And if you want to use a boxed superintelligence for something softer than mathematics, good luck. The less ambitious idea of ARAD is to put a slightly smarter agent in a box. Then the box-agent system is in a sense a counter example to strong uploadism; it can perhaps be made to serve a user's goals without running an emulation of the user, or even being "internally" aligned to the user. However, I find it very hard to imagine how to verify that the boxed agent is not smarter than expected, or even to rigorously prove that a certain level of intelligence is safe for any real-world task. Therefore, I expect that this is a counterexample, but not in a way that matter much in practice.

Two superintelligences in a box. Certain proposals such as debate or scalable oversight read to me like putting two superintelligences in a box. It reminds me of a certain weapon from the Bartimaeus Sequence, crafted by binding a water and fire elemental into a staff and using their mutual repulsion to blast stuff. I do not think this will work.

Risks and Implementation

(The following paragraph draws heavily on Demski's thinking, though any errors and ambiguities are mine.)

Values are reflective but not necessarily indexical. Embedded agents like humans need to be capable of reasoning about themselves (as one part of their environment). I've recently argued that this type of reflection is useful for a (different) fundamental reason: it allows self-correction. It allows an agent to consider hypotheses of the form "I am wrong in this specific way" or "I should reason differently in cases like this." These are self-centered partial models - they make claims about only certain aspects of the world, in relation to an agent's other beliefs. This type of reasoning seems highly important for bounded agents, incapable of representing the entire rest of the world. If so, it means that the ontologies of such agents are inherently self-referential. An agent's values seem to be built on top of its ontology (though perhaps - vice versa? Certainly the two are at least entangled). Therefore, I expect agent's to naturally form self-referential values. For instance, I want to solve math problems for the next ten thousand years - I don't just want them to be solved. This seems like strong support for a form of uploadism, though some care is required in translating between the original and emulation ontology. It must be possible to "re-center" an agent's values on its emulation, and this can fail for highly indexical values.

Now for the crux: how can an emulation be learned?

Alien actress (counter)argument. The hope is that in attempting to imitate a human, a learning algorithm (say, deep learning for definiteness) will naturally simulate the human. A (brain) simulation emulates the human in distribution and also generalizes properly to emulate the human out of distribution, because you use the same brain on new tasks (setting aside continual learning for a moment). In order to reason about the generalization of imitation learning algorithms, we need to investigate a bit of how they work. In broad strokes, we collect a dataset of human actions taken under various circumstances, and we train the imitation learner to predict those actions in those circumstances. Yudkowsky cautions that an alien actress attempting to predict what you will do must be smarter than you are. Prediction is harder than generation, in a complexity-theoretic sense. For instance, imagine constructing a generative model of a human - a probabilistic program that behaves like the human. Such a thing could be sampled at the same cost as running a brain simulation (in fact, it essentially would be an abstract brain simulation). However, drawing samples does not immediately allow one to make predictions. In order to do that, you need to perform expensive operations like collecting the statistics of many samples to come up with probabilities for each outcome, and this becomes even more expensive if you want to find conditional probabilities (you might use rejection sampling or more sophisticated techniques, but they tend to suffer from the curse of dimensionality). Unfortunately, we (typically?) need conditional probabilities to train imitation learners - it's hard to score an imitation learner on a single sample, we don't actually know whether that sample was likely (for the human) or not unless the human actually took that action. In a high dimensional space, the human usually didn't take that action. To relax the problem, we need to ask for conditional probabilities. That means that the imitation learner faces a harder task than probabilistically simulating the human, and therefore we might expect some kind of superintelligent mesa-optimizer to arise inside of it - Yudkowsky's alien actress. The argument is sound. Though it is not clear whether this mesa-optimizer would form long-term objectives outside of prediction, we cannot necessarily expect our imitation learner to generalize properly. However, I do not think this obstacle is as intractable as it is sometimes made out to be. Making sure that an imitation learner generalizes properly sounds more like a conventional ML problem than a philosophically confusing challenge. Also, the argument contains the germ of its own solution: for a deterministic agent, sampling and prediction are of the same computational difficultly. So it seems that there may be an attractor around perfect prediction where an alien actress is not favored over a simulation. We just need to make the problem easy enough.

Practical proposals. Provide the imitation learner with rich neural and other biometric data from the human's brain, making action prediction drastically easier.^[2] I will call this "biometric feature conditioning" where I am not prepared to commit to whether "conditioning" in the sense of conditional Kolmogorov complexity or the probabilistic sense. Naively, this would lead to a form of overfitting (checking whether the human has just chosen an action rather than predicting which action the human will choose). Therefore, we should make the problem harder over the course of training - for instance, provide increasingly outdated neural data. Since this is essentially a problem of generalization, progress in singular learning theory (and the loosely related field of algorithmic statistics) should provide a more rigorous basis for this process. However, I am optimistic that it does not need to be fully rigorous - as long as we can arrange that capabilities do not generalize much further than alignment, we should get multiple tries. Also, for the love of god do not fine-tune a huge foundation model this way. Start the imitation learning from scratch. That way, a resulting agent probably won't start off much smarter than the human even if all of these precautions fail to be quite strong enough. In order to run the resulting emulation safely and effectively for a long time, we need a good theory of generalization in continual learning, or meta-(online learning).

Alignment tax and bootstrapping. Once a working emulation is constructed, its runtime can scale with compute. This means modest safe superintelligence becomes possible. However, this almost certainly scales worse than just pretraining a massive foundation model and running some kind of horrible RL on top of it - that is, what people are already trying to do now. Pure imitation learning imposes a significant alignment tax. If the alignment tax were only a constant factor, this might be fine - sane people are much more likely to trust their emulations than ChatGPT, so we might get a lot more emulations running with more of the world's compute. However, I am guessing that human brains are pretty inefficient (particularly, not natively capable of fast recursive self-improvement) and the tax is more than a constant factor. Therefore, we still need to solve some form of the tiling problem eventually. This looks a lot easier when we can run ourselves faster (and say, 100x our alignment research speed), which is a path forward if we are able to pause the more dangerous forms of A.I. research. If we are not able to pause the more dangerous forms of A.I. research, then we actually have to figure out how to recursively self-improve, which seems very hard (but at least we start with an aligned agent). Even under a relatively strong uploadism, I think that some forms of growth maintain the emulation's alignment. For instance, the emulation should be able to write its own software tools. What exactly is a software tool? Prototypically, implementing an algorithm that works for reasons you understand, and not a massive heuristic ML system. I think the fundamental difference between these two things deserves its own (extensive) discussion. Such software tools alone are probably not enough to significantly reduce the alignment tax, so the emulation also needs to somehow... become more rational. This is where my main research agenda - understanding self-reflection and ultimately Vingean uncertainty - becomes most relevant. As discussed, one can hope that a "full" solution is not necessary (because we pause more dangerous approaches entirely and just run emulations), or at least is not necessary to solve very urgently "pre-upload."

Conclusion

I think that a fairly strong form of uploadism holds at least for me specifically. A world of competing human emulations is a world I would actually want to live in - it sounds much more interesting than letting Sam Altman build us a benevolent machine god.

I have sketched an alignment plan that relies mostly on a good theory of generalization for imitation learners. The main bottleneck seems to be somewhat prosaic: understanding generalization of neural networks - though in a somewhat more challenging domain than usual. Rather than an i.i.d. setting, the network needs to learn to emulate a (human) agent that is also capable of learning. I again call for a theory of meta-(online learning), which algorithmic information theory probably has a lot to say about. The type of embedded / reflective decision theory problems I am interested in do crop up in several places (and I probably would not have come up with this approach if I hadn't been thinking about them and thereby pruning many less promising approaches). However, decision theory (perhaps disappointingly) seems a little less like a hard bottleneck. It's worth considering my comparative advantages and the neglectedness of various obstacles. I hope to collaborate with singular learning theorists on some of these problems. A more agentic version of me would probably have already founded a startup focused entirely on safe imitation learning while I was writing this post.

^{^}
If I remember properly, Bostrom points at a similar risk in "Superintelligence."
^{^}
Another idea which Demski and I came up with (this time at ILIAD 2).

^{^}

which I don't endorse as something we should currently try to do, see Three Approaches to "Friendliness"

[-]Wei Dai4mo*30

Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation. [...] In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans

What's the main reason(s) that you think this? For example one way to align an AI^[1] that's not an emulation was described in Towards a New Decision Theory: "we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sentences in a formal set theory. The AI will then proceed to "optimize" all of mathematics, or at least the parts of math that (A) are logically dependent on its decisions and (B) it can reason or form intuitions about." Which part is the main "impossible" thing in your mind, "how to map fuzzy human preferences to well-defined preferences" or creating an AI that can optimize the universe according to such well-defined preferences?

I currently suspect it's the former, and it's because of your metaethical beliefs/credences. Consider these 2 metaethical positions (from Six Plausible Meta-Ethical Alternatives):

3 There aren't facts about what everyone should value, but there are facts about how to translate non-preferences (e.g., emotions, drives, fuzzy moral intuitions, circular preferences, non-consequentialist values, etc.) into preferences. These facts may include, for example, what is the right way to deal with ontological crises. The existence of such facts seems plausible because if there were facts about what is rational (which seems likely) but no facts about how to become rational, that would seem like a strange state of affairs.
4 None of the above facts exist, so the only way to become or build a rational agent is to just think about what preferences you want your future self or your agent to hold, until you make up your mind in some way that depends on your psychology. But at least this process of reflection is convergent at the individual level so each person can reasonably call the preferences that they endorse after reaching reflective equilibrium their morality or real values.

If 3 is true, then we can figure out and use the "facts about how to translate non-preferences into preferences" to "map fuzzy human preferences to well-defined preferences" but if 4 is true, then running the human as an emulation becomes the only possible way forward (as far as building an aligned agent/successor). Is this close to what you're thinking?

I also want to note that if 3 (or some of the other metaethical alternatives) is true, then "strong non-upload necessity", i.e. that it is impossible to construct a perfectly aligned successor that is an emulation, becomes very plausible for many humans, because an emulation of a human might find it impossible to make the necessary philosophical progress to figure out the correct normative facts about how to turn their own "non-preferences" into preferences, or simply don't have the inclination/motivation to do this.

^{^}
which I don't endorse as something we should currently try to do, see Three Approaches to "Friendliness"

[-]Cole Wyeth4mo10

I think 4 is basically right, though human values aren’t just fuzzy, they’re also quite complex, perhaps on the order of complexity of the human’s mind, meaning you pretty much have to execute the human’s mind to evaluate their values exactly.
Some people, like very hardcore preference utilitarians, have values dominated by a term much simpler than their minds’. However, even those people usually have somewhat self-referential preferences in that they care at least a bit extra about themselves and those close to them, and this kind of self-reference drastically increases the complexity of values if you want to include it.

For instance, I value my current mind being able to do certain things in the future (learn stuff, prove theorems, seed planets with life) somewhat more than I would value that for a typical human’s mind (though I am fairly altruistic). I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration). Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).

I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.

unrelatedly, I am still not convinced we live in a mathematical multiverse, or even necessarily a mathematical universe. (Finding out we lived in a mathematical universe would make a mathematical multiverse seem very likely for the ensemble reasons we have discussed before)

I think 4 is basically right

Do you think it's ok to base an AI alignment idea/plan on a metaethical assumption, given that there is a large spread of metaethical positions (among both amateur and professional philosophers) and it looks hard to impossible to resolve or substantially reduce the disagreement in a relevant timeframe? (I noted that the assumption is weightbearing, since you can arrive at an opposite conclusion of "non-upload necessity" given a different assumption.)

(Everyone seems to do this, and I'm trying to better understand people's thinking/psychology around it, not picking on you personally.)

I suppose that a pointer to me is probably a lot simpler than a description/model of me, but that pointer is very difficult to construct, whereas I can see how to construct a model using imitation learning (obviously this is a “practical” consideration).

Not sure if you can or want to explain this more, but I'm pretty skeptical, given that distributional shift / OOD generalization has been a notorious problem for ML/DL (hence probably not neglected), and I haven't heard of much theoretical or practical progress on this topic.

Also, the model of me is then the thing that becomes powerful, which satisfies my values much more than my values can be satisfied by an external alien thing rising to power (unless it just uploads me right away I suppose).

What about people whose values are more indexical (they want themselves to be powerful/smart/whatever, not a model/copy of them), or less personal (they don't care about themselves or a copy being powerful, they're fine with an external Friendly AI taking over the world and ensuring a good outcome for everyone)?

I’m not sure that even an individual’s values always settle down into a unique equilibrium, I would guess this depends on their environment.

Yeah, this is covered under position 5 in the above linked post.

unrelatedly, I am still not convinced we live in a mathematical multiverse

Not completely unrelated. If this is false, and an ASI acts as if it's true, then it could waste a lot of resources e.g. doing acausal trading with imaginary counterparties. And I also don't think uncertainty about this philosophical assumption can be reduced much in a relevant timeframe by human philosophers/researchers, so safety/alignment plans shouldn't be built upon it either.

My plan isn’t dependent on that meta-ethical assumption. It may be that there is a correct way to complete your values but not everyone is capable of it, but as long as some uploads can figure their value completion out, those uploads can prosper. Or if they can only figure out how to build an AGI that works out how to complete their values, they will have plenty of time to do that after this acute period of risk ends. And it seems that if no one can figure out their values, or eventually figure out how to build an AGI to complete their values, the situation would be rather intractable.

I don’t understand your thinking here. I’m suggesting a plan to prevent extinction from AGI. Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people. At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in. But I don’t see a reason that my plan runs a particular risk of locking in misconceptions.

yes, generalization in deep learning is hard, but it’s getting rapidly more effective in practice and better understood through AIT and mostly(?) SLT.
I think this is tractable. Insofar as it’s not tractable, I think it can be made equally intractable for capabilities and alignment (possibly at some alignment tax). I have more detailed ideas about this, many of which are expressed in the post (and many of which are not). But I think that’s the high level reason for optimism.

Why is it a breaking issue if some uploads don’t work out exactly what they “should” want? This is already true for many people.

I'm scared of people doing actively terrible things with the resources of entire stars or galaxies at their disposal (a kind of s-risk), and concerned about wasting astronomical potential (if they do something not terrible but just highly suboptimal). See Morality is Scary and Two Neglected Problems in Human-AI Safety for some background on my thinking about this.

At worst it just requires that the initial few batches of uploads are carefully selected for philosophical competence (pre-upload) so that some potential misconception is not locked in.

This would relieve the concern I described, but bring up other issues, like being opposed by many because the candidates' values/views are not representative of humanity or themselves. (For example philosophical competence is highly correlated with or causes atheism, making it highly overrepresented in the initial candidates.)

I was under the impression that your advocated plan is to upload everyone at the same time (or as close to that as possible), otherwise how could you ensure that you personally would be uploaded, i.e. why would the initial batches of uploads necessarily decide to upload everyone else, once they've gained power. Maybe I should have clarified this with you first.

My own "plan" (if you want something to compare with) is to pause AI until metaphilosophy is solved in a clear way, and then build some kind of philosophically super-competent assistant/oracle AI to help fully solve alignment and the associated philosophical problems. Uploading carefully selected candidates also seems somewhat ok albeit a lot scarier (due to "power corrupts", or selfish/indexical values possibly being normative or convergent) if you have a way around the social/political problems.

better understood through AIT and mostly(?) SLT

Any specific readings or talks you can recommend on this topic?

AI ALIGNMENT FORUM
AF