This is an expanded version of my answer to application problem 2 for Nate Soares and Vivek Hebbar's SERI MATS stream. This Alignment idea is somehow non-standard. See Section C for a discussion of that, along with some general ideas on Alignment. The text is long because I’ve tried to include all details relevant to the discussion.
Alignment idea: For any possible AGI design, run a physics simulation calculating how much diamond it ends up producing in the universe. Build the one maximizing it.
1. How do we get an (approximately) accurate Physics simulation, and the right world model?
Build a simulation environment with the best current guess of the physical laws governing the evolution of the macroscopic[1] universe. Now, since we don't have a Theory of Everything (Quantum Mechanics and General Relativity are incompatible), we can't for instance model everything in terms of elementary particles. But we can model everything (although with a lower granularity) using the macroscopic (and approximate) laws of physics which we employ daily for dealing with molecules or electricity (including General Relativity) (this is not looking good for maximizing diamond, but see Problem 3 below). For instance, instead of specifying the quantum fluctuations governing molecule vibration, just implement some empirical facts that determine their behavior as correctly as possible. Of course, building this model (if possible) would require huge amounts of work from many physicists and engineers. Also the use of unbounded memory and compute for testing it accurately.[2] It is possible that the physicists get stuck, or can't put together a coherent macroscopic simulator without a paradigm change, but for these concerns see Problem 2.
This simulator can run many different universes. We still need to specify which universe we're in (the "initial conditions" of our world model). For this, combine two approaches:
Have Physicists put in (approximate) known facts. For example, if we know with certainty that on Earth there's an amount between X and Y of a certain material, then specify that between so and so coordinates[3] (where Earth should be in the simulation) there's such an amount of it. This fact will rule out a vast amount of possible simulations (universes). Of course, it will still leave open a huge amount of possible universes (some of which don't even have something like Earth, but in which that material happens to be there for other reasons). But adding an inordinate amount of facts like this one will reduce search space. Other such facts might be:
"between coordinates so and so there are between 5B and 10B approximate humans (an approximate human is a physical system which presents these approximate properties and approximately these materials)"
"between coordinates so and so there is such an amount of this physical process happening"
"approximately X time ago the universe was approximately like this (how it was seconds[4] after the Big Bang)"
Ideally, we also input as many facts about distant regions as possible.
Use sensors to accurately pin down some precise facts. Have some highly reliable sensors spread out across Earth, sampling random information, such as the concentration of X in the air at a certain time, or the exact frequency of light received at time Y (and give the system a tight approximation of the sensor's coordinates). Again, these precise facts will rule out a vast amount of possible universes, while still leaving open many others.
These two approaches coordinate in different scales to exactly pin down the universe we're in. The broad approximate facts reduce search space to universes where something approximately like Earth and our observable region exist. The precise facts then discard almost all universes similar to ours but contingently different in small fluctuations.
Now let's put that infinite compute to work! Have the computer determine all possible world models which satisfy all of these facts (the "boundary conditions", which need to lie in the past, see footnote 11). More specifically, have it calculate all initial conditions which are "very roughly approximately like how we think the universe was seconds after the Big Bang", run all those simulations for billions of years, and discard any which don't satisfy some boundary condition[6]. Ideally, this will leave only one world model. More realistically, I expect it to leave many world models, which are almost identical with respect to Earth and the near observable universe, but differ on whether "this molecule 3B lightyears to the right has velocity X or Y". If that's the case, the computer can just average the amount of diamond over the different simulations of these similar models.
Problem 1. What if the procedure leaves no world model? Or it leaves very dissimilar world models?
Solution: Improve Physics and iterate. If the above happens, or something else doesn't add up with the obtained model, then either some of the boundary conditions are wrong, or we are fundamentally wrong about Physics. For the first, we can revise our facts and retry. Check whether some facts discard many otherwise-acceptable models. For approximate facts, scrutinize them. For precise facts, consider whether a sensor has failed, or retry with more fail-safe (or fewer) sensors. For the second, physicists[7] can toy with the simulation and learn more about Physics (not because the simulation knows any more Physics than what we put in, but because it is an invaluable tool to check for the consequences of certain laws, where behaviors become extreme, how the early universe actually looks like according to our models, etc.).
Problem 2. What if the above iteration does not converge, or takes too long?
Partial rebuttal: I don't expect this process to quickly produce arbitrarily accurate laws of Physics nor world models. I only expect it to achieve a certain acceptable threshold of accuracy (Why might this suffice? see Section A). And I think that's very plausible. Indeed, what would it look like for physicists to get unsolvably stuck, having at their disposal infinite compute to model any laws of Physics and universe (which removes a big amount of the difficulties of usual theorizing)? I can imagine them hitting a kernel of unexplained misbehavior which requires for a complete change of framework. But with the simulator, and possibly also using infinite compute to deterministically generate new frameworks, which possibly a modest AI system (with no more than present-day compute) can check for interest, I expect them to surpass the necessary framework changes which get us to the required accuracy. I'm not even claiming Physics as a whole will be solvable, or complexity will bottom out. Only that macroscopic (or molecular-atomic) events will be accurately modelable by humans using infinite compute. In fact, even if certain macroscopic phenomena become irreducibly unexplainable/unmodelable, they might be local in nature, and the simulation can probably imperfectly work its way around them[8].
One might also argue this whole enterprise (and especially exploiting the infinite compute correctly) is too big or complex for human tackling. And while the amount of information dealt with is gigantic, the required tasks seem as highly factorable as current research, and so surmountable by a big enough team. As for time constraints, I think this process can plausibly either resolve relatively quickly or take many decades[9].
Problem 3. How will the simulator calculate the amount of diamond, if its model is no more fine-grained than molecules, or even worse?
Partial solution: It might well be that the above process finds a framework unifying sub-atomic and macroscopic effects, in which case this is solved. Even if it doesn't, we might be able to implement a hybrid model which, for instance, when dealing with any material with a high enough concentration of carbon (recognized by macroscopic properties), implements a rough approximation of an atomic model in that region, optimized to check whether the material can be considered diamond (but this is highly speculative and possibly unworkable). Even if that's not possible either, we can have our system recognize diamond by its macroscopic properties. For instance, we might implement that any material approximately like X subjected to process (conditions) Y results in diamond, among many other diamond-facts[5].
1'. An alternative approach for simulating the world
As a side note, and especially given the worry that we are fundamentally wrong about Physics, we might prefer our infinite computer to search over more general frameworks. Instead of feeding it our current laws of Physics, we might want to feed it the mathematical structure that (we think) any laws of Physics would present. If so, we can't feed it any approximate facts (their translation would vary across mathematical frameworks), but we can still feed it precise sensor information. Of course, now it won't be presented as "the frequency of this light is X", but just as "through your sensor Y you are receiving this bit of information". Broadly speaking, the computer would compute all Tegmarkian universes (within a certain enormous but bounded class, and possibly with a simplicity prior, but somehow discounting for Boltzmann brains) and check for those which include the sensor information pattern anywhere. Of course, this is already looking similar to Kosoy's Infra-Bayesian Physicalism. Also, the amount of sensory data would probably need to be enormous as well, and this would conflict with sensor reliability. Also, a big team of mathematicians, physicists and engineers would again be needed to iterate this (and eventually find out what diamond is in the resulting model), and worries about convergence reappear.
2. How do we build the AGI?
So we have our world model(s), (completely or approximately) specifying the universe up to time t (before the AGI is built), and from it we can deterministically infer the evolution of the future (or each deterministic evolution of the future in the different but similar world models). We'd like to tell our model "imagine at time t+s this physical system (our AGI) suddenly appears at coordinates X, calculate how much diamond the future universe contains".
As a small caveat, there are different ways to specify "how much diamond the future contains" (that is, what we actually want). If mol(t) is the mols of diamond at time t, it could mean maximizing mol(t+s+x) for a certain x, or maybe we care about the diamond's prevalence in time, and so we want to maximize ∫∞t+smol(t′)dt′.
The latter would be trickier for our model since we'd somehow have to specify when the simulation stops (the universe ends). If our resulting Physics model implies the universe will indeed end at some point, then we implement that. Otherwise, since the simulated universe has a finite amount of different possible states, we can run each simulation until it loops, and consider the value (diamond) contained in said loop divided by its time length. Of course, this procedure could be especially vulnerable to small errors in the simulation (for instance, entering loops which wouldn't actually happen in the universe). Also, if time turns out to be related to space and other dimensions in weirder ways, we might not be able to so neatly conceptualize our simulation as a cellular automaton, and the procedure wouldn't be possible.
Problem 4. The AGI "suddenly appearing at coordinates X" would mess with our deterministic simulator
More concretely, the computer will use the universe at time t to derive how the universe is at time t+s (when the AGI is deployed). Hardwiring the value of those coordinates at time t+s would be incompatible with other parts of the simulation, and ill-defined. On the contrary, having the simulation run as usual from t to t+s will include simulating the infinite computer making this computation (since at t+s the AGI is deployed, so the computation has ended)[10].
Solution: Black box the whole setup and remove it from the simulation to avoid circularity. Some time after t, build a big black box around the infinite computer, with two compartments: one for the computer, and another where the AGI will be built (as big as we can make it). By black box I mean a physical cover which minimizes information transfer as much as we can: adiabatic, rigid, impermeable, no gas inside, etc. (although mutual information can never be literally zero). In the AGI compartment, we will place a supercomputer with memory completely wiped (we might want this, the infinite computer and all machinery to produce only very weak electromagnetic and gravitational fields to further reduce interference, which seems difficult). A certain machine will connect the infinite computer with the supercomputer so that, upon deciding on the AGI design, this machine (in a previously set amount of time, and by following a previously set course of action) rearranges the supercomputer's memory bits so as for them to encode the physical design. The machine then becomes idle in a pre-determined state, and the infinite computer as well, so that the only variance upon later opening the black box is the supercomputer's physical state.
To determine the AGI design, the computer acts as follows. For every possible bit configuration of the supercomputer (finitely many), run the simulation from time t up to time t+r, the time at which the black box build is completed. Stop modeling the part of the universe inside the black box, continue modeling the rest as if that region was completely void (or more accurately, contained a certain concentration of materials which produce a certain small electromagnetic field, etc.) up until time t+s, in which the black box opens and the simulator refills that part of the universe with the physical layout known to be inside the black box (for the chosen AGI design). Now, this is still to a great extent hardwiring the design into the simulation, and will present some discontinuities. We could probably make the simulation work around these obstacles, by using an even coarser (probably probabilistic) physical model for the black boxed spatiotemporal region, and averaging over (or choosing one of) all the possible coarse physical developments of that region which lead to the universe being exactly as in time t+s, with the AGI built[11]. We can also implement this last strategy without the black box (doing it for the whole universe), but the black box tries to minimize sources of inaccuracy. We are basically using patches to make the discontinuity in the simulation smoother and less problematic. Then of course, the simulation after t+s proceeds as usual, and calculates our preferred diamond-metric. Choose the AGI design which maximizes it.
Even if every aspect of our world model(s) looks good to physicists, it will probably contain several small errors (not corresponding to reality). To alleviate that, we can implement variational methods in the above search. That is, for every molecule/particle/physical cluster modeled by our simulator, let's say its state in the simulation is completely determined by a certain vector. Choose one of them randomly, and introduce a small random fluctuation in one component of the vector. Instead of just simulating our world model(s), the computer can also simulate models identical except for any of these fluctuations, and average over all of them. The average should be weighted, so that the more such fluctuations that have been introduced the less weight the model has (because we do expect our world model to be broadly right). This variational search will be carried out in the last, "build AGI" step, but before that, to reduce the probability that some small variations lead to weird models which somehow dominate the weighted average, physicists can inspect this variational structure of our world model (in other ways that don't involve building an AGI). As you might notice, I am just averaging everything and hoping our accuracy is enough (see Section A).
We might think of providing certain mechanisms for interaction with the universe to the supercomputer once the black box opens, to ensure the system has immediate causal powers. For instance, we might establish a pre-determined mechanism that, very shortly after the black box opens, provides it with access to the internet, or control of a hoard of drones (let's call these systems "mechanical arms"). This is probably overkill: the search space is so wide, that the resulting physical design will almost surely trivially find mechanical arms to use (and quickly take over humanity, and so on). Now, as long as pre-fabricated mechanical arms don't somehow increase interference with the black box, we might think they could only help. But, if (without them) the search isn't even capable of finding its own mechanical arms, then something has gone immensely wrong with the search. And in that situation we might prefer the resulting system to have no mechanical arms. So not adding them might be a free (although extremely inefficient and only marginally helpful) safety measure, that allows for a re-roll.
Note also that some of the details of the above method might not work out if our Physics paradigm changes so much that it doesn't make sense anymore to talk about black boxing, variational methods or other concepts. I find it unlikely that such fundamental concepts become obsolete. But even if they do (and provided we do converge on acceptable macroscopic laws of Physics), it seems more likely for the new more accurate paradigm to provide more efficient methods of doing what we wanted to do (better "black boxing", or better "variational methods"), than for it to present fundamental limitations to what we wanted to do (but this is speculation).
Of course, this whole method needn't produce anything remotely close to what we think an AGI might look like. It just produces a physical setup that maximizes diamond. It is conceivable that this physical setup maximizes diamonds for weird reasons, and cannot be considered itself an agent (as an absurd example, maybe the setup is just a pattern that, when seen by a human, brain-hacks them into only caring about diamonds, and turns them intelligent enough so that humanity will spread across the galaxy). But if what we believe about agents and maximization is even remotely right, then the physical setup which maximizes diamonds, especially considering we are averaging over many slightly different universes, will be an AGI[12] (see Section A).
A. Why this might have a shot at working
Much of the above argument centered on maximizing model accuracy and minimizing the errors in the whole setup. For this engineer-y problem, many different ideas are implemented, most of them not completely sure to work. This might give the impression that the argument is very conjunctive and thus sure to fail. And indeed, the probability of everything working according to plan and my assessments of how optimistic to be about model convergence and accuracy being in the right ballpark, is basically zero. But I don't need that to have a shot at this working!
See, if an AGI is truly a naturally general form of maximizing things (in the sense that in most universes with huge amounts of something, there's an AGI maximizing for it), then we might expect to find such AGIs in many of the high-scoring universes. What's more, the AGIs found in different universes won't be much different from each other, and each AGI won't be overly reliant on its universe's fine details, but on the contrary will deploy a general procedure that's probably approximately as useful across many similar universes.
Here's another way to put it. In our setup, we are trying to encode something in a very small box (smaller than Earth) that can do something very big and locally complex. If the code relied on non-general local information about different parts of the universe ("this molecule 3B lightyears to the right has velocity X or Y"), then it wouldn't nearly fit in the box[13]. So our code must somehow be highly compressed, and not directly rely on almost any of those facts. So it is very likely that all such facts in which it actually relies on are correctly approximated by our model.[14]
Now, this argument can be made even if the computer considers only one world model. But in our setup, we employ variational methods (and also average over a set of acceptable world models if the search doesn't yield a unique one). This drastically biases the search towards finding general and under-specific AGIs, instead of overly specific setups! Indeed, allegedly the setups that perform great across many different fluctuated universes are those which more readily correspond to our usual concept of an AGI: an agent taking information as input and delivering actions, that thus can perform well with many different inputs. Conversely, any more deterministic system, heavily reliant on specific details of its context, will fail in all those universes which fluctuate said details.
So having an overly accurate model would be counterproductive. We only need a certain threshold accuracy to ensure basic facts about Physics, diamonds, Earth and so on are correctly captured. And after said threshold, gaining more accuracy will barely improve the situation, since we're gonna use variational methods anyway to bias the search towards general intelligences.[15]
As a further exemplification of this point, suppose we run our method (or the actually correct version of our method) twice, but the second time we tweak it so that we only care about maximizing diamonds in the closest half of the observable universe (or the simulator only implements this part of the simulation, or something similar). I expect both resulting physical designs to be extremely similar if not identical. This is because, for both instances, the playing field for maximizing diamonds is so much larger than the physical system designed, that the best strategy is building an all-purpose agent that can locally decide how to exploit each part of the playing field.
Mainly because of this reasoning, if there were no further important problems to the ones considered (above and below), I'd expect this general idea to have a moderately good (about 30%) shot at working (although many of the details would need to be polished, and surely some changed, and Problem 5 might turn out to be intractable). But it's highly likely that I'm missing out on some important problems. And that's why my probability for this working is not more than 15%.
B. Further problems
Problem 5. What if a malevolent agent inside any of the many simulations hijacks the computer?
I believe this to be the most worrisome problem. To the extent that we've conceptualized the infinite computer as an idealized oracle[16], malevolent agents might be unable to escape. Indeed, at least in the "build the AGI" search, the only output of every simulation is its diamond-score. And a malevolent agent might deliberately increase its universe's diamond-score so that we end up making that universe come to be. But of course, we only care about diamonds, so then we are glad to be in that universe!
That is as long as our diamond-score is truly a robust proxy for the amount of diamond. If the proxy implemented in our Physics simulation about what diamond is deviates even slightly from the real thing we care about, then we will get a ton of adversarial pressure, due to the immense size of the search space (although that doesn't necessarily imply that we'll end up in a low diamond universe). But despite Problem 3, I do expect our diamond-score to be a correct proxy with very high probability[17]. Especially, I think we can be strict enough so that, even if diamond-score doesn't capture all instances of diamond, we can be sure anything that increases diamond-score is an actual diamond.
Even with an idealized infinite computer, malevolent agents might be more of a problem in the search for the world model. The simulations' outputs then are much more complex facts that will be examined by the physicists. The physicists might try to average over simulations, or examine each simulation only very locally, so that each simulation can't forward a lot of information, but this immensely complicates their work. Of course, I'm not only literally worried about a single malevolent AGI existing inside one of these simulations (and at some time during the simulation acting as if it is being simulated in a world exactly like ours, and trying to exploit this by giving the physicists adversarial input), but more generally about adversarial pressures arising from the vast search, which might obstruct the whole process. For example, having the infinite computer run all simulations and present only those which seem interesting (according to some benchmark, or small AI) might be especially bad, since this "interesting" is probably a pretty natural concept (maybe "greatly dissimilar to most models"), and many adversarial inputs will fall under it (as an absurd example, consider finding a world model whose stars clearly spell out "I'm God, end this process or you'll go to Hell"). Ideally, the only searches carried out by the physicists will be to discriminate the world models which satisfy the "boundary conditions", and these are highly contingent and not natural, so they might seem safer. But this is speculative and I need to think more about this.
And if we stop idealizing the infinite computer, and materialize it in the physical world as an actual computation running on actual hardware, malevolent agents will have many routes to escape even in the last search. In fact, the search space is so big, that probably the first ever search carried out by physicists on the infinite computer would somehow hijack/disable/take over.
After all, that's the whole issue with consequentialism and big search spaces, which we had hidden under the infinite computer and is coming back to bite. Given a big enough search space, whose entries are complex enough to contain programs that hack us, we will come across such programs before finishing the search.
Tentative solution: If the physicists had a procedure to stop the search upon finding a promising model, or the computer had a way to stop the search upon finding a good enough diamond maximizer, then we'd actually have a chance of finding those before the hacking models. The first might be doable, although selecting for "promising" might be dangerous as above, but we'd be inside the limited space of worlds satisfying the "boundary conditions", and that seems safer. The second seems doable, because any physical setup that achieves more than, let's say, 80% of the diamond-score physically attainable (calculated for instance through volume constraints), is almost surely already a diamond-maximizing AGI (by the argument of Section A).
Problem 6. Might not the resulting AGI miss out on unknown Physics facts that would maximize diamond even further (for instance, by packing it more compactly)?
Yes. In a sense, once we start searching for the AGI to build, our Physics remains forever fixed. Ideally, the physicists will have found all improvements to Physics that allow for more diamonds. Realistically, it's possible that we miss some. In that case, if one of the AGI designs that the search goes over would exploit these unknown facts (and thus produce more diamond), inside our simulation it will just perform badly (trying to exploit some facts that are false in the simulation), and won't get selected[18]. It's not clear to which extent the AGI resulting from this search, that is, a diamond-maximizing AGI with fixed ontology/laws of Physics, can be considered a truly general intelligence (even if it's possible that its ontology is optimal for diamond maximization). It might seem that to build a diamond-maximizing AGI which auto-updates its ontology we need to solve STEM AI first. But actually, it might be easier to build an agent that does everything it can (including STEM) to achieve a goal, than building one that only does STEM (and we can use as a tool).
Problem 7.Won't the resulting AGI miss out on acausal trade?
If we use our method of Section 1, focusing on Physics and causality, then indeed we have no reason at all to expect our AGI participating in acausal trade, on the contrary, it almost surely won't. That is, unless physicists end up somehow discarding the current understanding of causality, in which case that method doesn't even seem applicable.
If we use the alternative method of Section 1', focusing on information and evidence, our world model might end up accommodating evidentialist views (if these arise naturally/are canonical), and so might search for AGIs that acausally trade.
Problem 8.Aren't you assuming reality can ultimately be perfectly or almost perfectly modeled by some mathematically structured laws?
Yes.
Anti-Problem 9.Whole Brain Emulation could help
This is of course common across alignment proposals. In our specific proposal, uploading physicists would allow to compress the whole process of coming up with acceptable laws of Physics and world model(s) into a single run of the infinite computer. And also let it run for as long as necessary so as to ensure high confidence in the result (or additionally have copied physicists independently and locally check through literally every part of the simulation). This can again be more dangerous because the physicists are receiving inputs from an even bigger search space. Some architectures can be implemented to try and minimize this risk. For instance, instead of just having one physicist look at the data, also include a second physicist (or psychologist, or whatever) that looks at the physicist looking at the data, and makes sure nothing weird has happened (like the first physicist getting brain-hacked). This can be iterated, and also have many physicists (instead of just one) looking at each such scenario (HCH-like tree structure). Any such implementation will prevent some failures, but leave other vulnerabilities (or even create some more, although intuitively the bigger structure should be more robust for most inputs).
C. How does this apply to actual Alignment?
Any solution to diamond alignment is already very removed from actual Alignment (and that's how the relaxations encourage fresh takes). But the one presented here is especially untranslatable. It makes such a central use of an absolutely unattainable amount of computation, that removing this relaxation leaves the approach completely inapplicable.
The approach is very non-standard in the sense that, instead of discussing the inner workings of an AGI (and forming intuitive, approximate, highly abstract pictures about how these will relate to consequences in the real world), we directly search over the space of consequences (and this requires the unattainable amount of computation), and try to find methods that make this search possible and safe.
But this solution proves very useful for another purpose of diamond alignment: pointing at the actual difficult kernels of the problem. Removing all bounds to the search's feasibility makes apparent how the vastness of the search space itself is the enemy. This is the enemy of consequentialist thinking of any kind, with or without AGI. But it turns out that in vast and complex enough search spaces AGI occurs very naturally (or so we think), and so many dangers arise through it.
Here's a framing of the Alignment problem inspired by that idea:
Building an AGI is a very chaotic action, in the Chaos Theory sense that small tweaks to it will result in huge differences in the future of the universe. To ensure we don't screw up, we'd like to search through all (or the most important) possible future paths (or action-consequence relations, where the action involves building an AGI). Humans can't efficiently do that, due to fundamental constraints on our architecture and computation power (and because we don't have a method to distinguish the most important paths, because "important" is not some objective feature of the universe, but only defined contingently as "important to us"). But if we build something to do that search for us, or somehow delegate through other mechanisms, the resulting thing or mechanism will have much more searching power than we do, and so if we haven't specified completely correctly (robustly) what it has to search for, it will Goodhart our proxy. That is, the thing or mechanism is itself very chaotic, and we're back at the start. This is a problem, because specifying completely correctly is almost impossible for humans, because the world is very messy and we don't have the search power to explore all consequences of our specification (we don't even have the right Physics).
Between the extremes of "have humans do the search" and "build an AGI to do the search", there are many intermediate solutions. All of these solutions try to satisfy two constraints: being powerful enough so as to efficiently do the search, and not so complex that humans can't specify the objective correctly (because they can't explore the consequences of each specification). This is hard, because being powerful is usually related to being complex. But they are not literally equivalent, so the search space is not literally linear between those two extremes, and some clever tricks surely exist. Ultimately, it's not clear whether solutions satisfying the two constraints exist, or are numerous or natural enough for humans to find them.
Another meta-level useful feature of the solution here presented is that it presses on the boundaries of the diamond alignment problem, stressing how much of the problem is really captured or obscured by which assumptions/relaxations, and to which extent they are reconcilable with fundamental properties of reality. Throughout the text, we find that many details under-determined by the diamond problem's statement are crucial to the viability of some strategies:
whether the computer can be used once or many times
whether time is a concern
whether the infinite computer is a physical system
even what we're satisfied to call a diamond-maximizing AGI (whether it needs to be able to auto-update its ontology, etc.)
Of course, all of these can just be defined away (even the last one) for the sake of concreteness (although having them under-determined at least helps consider a wider range of strategies). And even doing so in the most optimistic way possible won't assure this solution will work.
But what I'm getting at is that, in these fringe under-determinations, we find expressed many irreconcilable tensions between the idealized relaxations and the reality we're actually thinking about when trying to solve the problem. Some pedagogical/theoretical benefits of tackling the diamond problem are obvious, and so I'm not arguing against doing so. But one might raise the question, to which extent does it make sense to, for instance, consider an idealized non-physical computer, when at the same time we're trying to get around the messiness of the rest of reality?[19][20]
That is, might considering that nonsensical situation not encourage a dangerous double-think, which might later make its way (unnoticed) to our actual opinions about real world Alignment? After all, our real worry is what are the physical consequences of embedded hardware running a certain computation. When dividing up the problem into such neat compartments as the diamond problem does, might our intuitions not later forget what the real problem was about, and point us in non-obviously mistaken (but nonetheless mistaken) directions? That is: does the diamond problem really capture everything that is fundamental to Alignment?
I don't have strong arguments for the diamond problem missing some fundamental core of the issue. After all, it certainly does capture its most obvious aspects. And I know it is consciously obvious to everyone that its relaxations are nonsensical. But I've just come off the other side with the feeling that we might want to be more attentive of how these relaxations make our intuitions about reality bend in weird, incoherent ways.
Throughout the text I use macroscopic loosely (for instance, molecules might be included) to mean as fine a granularity as our current paradigm permits, without entering into quantum or other troubles.
I speak of coordinates, but these of course can't be solely spatial. They should be spatiotemporal to account for relativity, or include whatever further dimensions our preferred Physics requires.
Of course this again makes our AGI potentially lose out on other weird processes that produce the atomic structure of diamond, and so we might end up with a "process Y on material X" maximizer instead of a diamond maximizer (even if the two usually coincide in the universe).
If the simulator was able to deterministically infer the state at time t from the state at time t+1, it might be better (or more informative to Physicists) for the simulation to start with the present, very prohibitive boundary conditions, and make its way back to something like the Big Bang.
It might seem worrisome that I'm invoking physicists so much, since that usually signals a part of the argument which I can't complete. But on this instance, I do think I have a generally good feel for what these physicists would actually be doing, and moderately informed opinions and intuitions as to whether this process would converge, how long it might take, etc.
Although this of course induces some risk of failure in all its predictions, and even without failure our resulting AGI might be missing on some opportunities to exploit these weird phenomena for diamonds.
I'm not sure whether timeline concerns are supposed to apply to the diamond alignment problem. Maybe they aren't usually considered just because most proposals only use the infinite computer once.
Of course the premise of having an unboundedly fast computer with unbounded memory be a bounded physical system is already nonsensical, but straight up computing the uncomputable (this infinite nested regress) seems categorically even worse. This is also the reason why, when using infinite compute to find the world models which fit the facts (in Section 1), these facts must all lie in the past, and the simulation must not arrive at the moment in time where the computation begins.
This approach doesn't run into the uncomputable infinite nested regress because the coarse model wouldn't be precise enough to model the computer's computation exactly.
Since the smallness of the box is what protects us against overly specific setups prone to fail under the smallest misadjustment, one might wonder whether we truly want the box to be "as big as we can make it". But I think we do, because increasing its size drastically augments the search space for our AGI, and almost negligibly augments the probability that we find an overly specific setup (those probably have size much greater than Earth).
It is conceivable (or even likely) that, if we really did know our world model with arbitrary accuracy, then some (at least partially) deterministic setup creates more diamonds than a general intelligence (because of contingent quirks of our universe). But I guess both achievements (either building a diamond maximizing AGI or somehow maximizing diamond even harder) are sufficient to pass this problem. After all, even an AGI is not omnipotent and will inevitably leave some utility on the table.
Of course, the diamond problem could specify that the infinite computer is physical. But is that, in some relevant sense, less nonsensical than a non-physical computer?
My non-standard solution suffered more than most solutions when dropping, for instance, the relaxation of the computer being idealized (see Problem 6), and that might be why I'm giving so much weight to this issue. But the next paragraph tries to get at how this mismatch, in a way less obvious manner, could also happen in more standard solutions (that don't even press that hard on the problem's boundaries).
This is an expanded version of my answer to application problem 2 for Nate Soares and Vivek Hebbar's SERI MATS stream. This Alignment idea is somehow non-standard. See Section C for a discussion of that, along with some general ideas on Alignment. The text is long because I’ve tried to include all details relevant to the discussion.
Alignment idea: For any possible AGI design, run a physics simulation calculating how much diamond it ends up producing in the universe. Build the one maximizing it.
1. How do we get an (approximately) accurate Physics simulation, and the right world model?
Build a simulation environment with the best current guess of the physical laws governing the evolution of the macroscopic[1] universe. Now, since we don't have a Theory of Everything (Quantum Mechanics and General Relativity are incompatible), we can't for instance model everything in terms of elementary particles. But we can model everything (although with a lower granularity) using the macroscopic (and approximate) laws of physics which we employ daily for dealing with molecules or electricity (including General Relativity) (this is not looking good for maximizing diamond, but see Problem 3 below). For instance, instead of specifying the quantum fluctuations governing molecule vibration, just implement some empirical facts that determine their behavior as correctly as possible. Of course, building this model (if possible) would require huge amounts of work from many physicists and engineers. Also the use of unbounded memory and compute for testing it accurately.[2] It is possible that the physicists get stuck, or can't put together a coherent macroscopic simulator without a paradigm change, but for these concerns see Problem 2.
This simulator can run many different universes. We still need to specify which universe we're in (the "initial conditions" of our world model). For this, combine two approaches:
These two approaches coordinate in different scales to exactly pin down the universe we're in. The broad approximate facts reduce search space to universes where something approximately like Earth and our observable region exist. The precise facts then discard almost all universes similar to ours but contingently different in small fluctuations.
Now let's put that infinite compute to work! Have the computer determine all possible world models which satisfy all of these facts (the "boundary conditions", which need to lie in the past, see footnote 11). More specifically, have it calculate all initial conditions which are "very roughly approximately like how we think the universe was seconds after the Big Bang", run all those simulations for billions of years, and discard any which don't satisfy some boundary condition[6]. Ideally, this will leave only one world model. More realistically, I expect it to leave many world models, which are almost identical with respect to Earth and the near observable universe, but differ on whether "this molecule 3B lightyears to the right has velocity X or Y". If that's the case, the computer can just average the amount of diamond over the different simulations of these similar models.
Problem 1. What if the procedure leaves no world model? Or it leaves very dissimilar world models?
Solution: Improve Physics and iterate. If the above happens, or something else doesn't add up with the obtained model, then either some of the boundary conditions are wrong, or we are fundamentally wrong about Physics. For the first, we can revise our facts and retry. Check whether some facts discard many otherwise-acceptable models. For approximate facts, scrutinize them. For precise facts, consider whether a sensor has failed, or retry with more fail-safe (or fewer) sensors. For the second, physicists[7] can toy with the simulation and learn more about Physics (not because the simulation knows any more Physics than what we put in, but because it is an invaluable tool to check for the consequences of certain laws, where behaviors become extreme, how the early universe actually looks like according to our models, etc.).
Problem 2. What if the above iteration does not converge, or takes too long?
Partial rebuttal: I don't expect this process to quickly produce arbitrarily accurate laws of Physics nor world models. I only expect it to achieve a certain acceptable threshold of accuracy (Why might this suffice? see Section A). And I think that's very plausible. Indeed, what would it look like for physicists to get unsolvably stuck, having at their disposal infinite compute to model any laws of Physics and universe (which removes a big amount of the difficulties of usual theorizing)? I can imagine them hitting a kernel of unexplained misbehavior which requires for a complete change of framework. But with the simulator, and possibly also using infinite compute to deterministically generate new frameworks, which possibly a modest AI system (with no more than present-day compute) can check for interest, I expect them to surpass the necessary framework changes which get us to the required accuracy. I'm not even claiming Physics as a whole will be solvable, or complexity will bottom out. Only that macroscopic (or molecular-atomic) events will be accurately modelable by humans using infinite compute. In fact, even if certain macroscopic phenomena become irreducibly unexplainable/unmodelable, they might be local in nature, and the simulation can probably imperfectly work its way around them[8].
One might also argue this whole enterprise (and especially exploiting the infinite compute correctly) is too big or complex for human tackling. And while the amount of information dealt with is gigantic, the required tasks seem as highly factorable as current research, and so surmountable by a big enough team. As for time constraints, I think this process can plausibly either resolve relatively quickly or take many decades[9].
Problem 3. How will the simulator calculate the amount of diamond, if its model is no more fine-grained than molecules, or even worse?
Partial solution: It might well be that the above process finds a framework unifying sub-atomic and macroscopic effects, in which case this is solved. Even if it doesn't, we might be able to implement a hybrid model which, for instance, when dealing with any material with a high enough concentration of carbon (recognized by macroscopic properties), implements a rough approximation of an atomic model in that region, optimized to check whether the material can be considered diamond (but this is highly speculative and possibly unworkable). Even if that's not possible either, we can have our system recognize diamond by its macroscopic properties. For instance, we might implement that any material approximately like X subjected to process (conditions) Y results in diamond, among many other diamond-facts[5].
1'. An alternative approach for simulating the world
As a side note, and especially given the worry that we are fundamentally wrong about Physics, we might prefer our infinite computer to search over more general frameworks. Instead of feeding it our current laws of Physics, we might want to feed it the mathematical structure that (we think) any laws of Physics would present. If so, we can't feed it any approximate facts (their translation would vary across mathematical frameworks), but we can still feed it precise sensor information. Of course, now it won't be presented as "the frequency of this light is X", but just as "through your sensor Y you are receiving this bit of information". Broadly speaking, the computer would compute all Tegmarkian universes (within a certain enormous but bounded class, and possibly with a simplicity prior, but somehow discounting for Boltzmann brains) and check for those which include the sensor information pattern anywhere. Of course, this is already looking similar to Kosoy's Infra-Bayesian Physicalism. Also, the amount of sensory data would probably need to be enormous as well, and this would conflict with sensor reliability. Also, a big team of mathematicians, physicists and engineers would again be needed to iterate this (and eventually find out what diamond is in the resulting model), and worries about convergence reappear.
2. How do we build the AGI?
So we have our world model(s), (completely or approximately) specifying the universe up to time t (before the AGI is built), and from it we can deterministically infer the evolution of the future (or each deterministic evolution of the future in the different but similar world models). We'd like to tell our model "imagine at time t+s this physical system (our AGI) suddenly appears at coordinates X, calculate how much diamond the future universe contains".
As a small caveat, there are different ways to specify "how much diamond the future contains" (that is, what we actually want). If mol(t) is the mols of diamond at time t, it could mean maximizing mol(t+s+x) for a certain x, or maybe we care about the diamond's prevalence in time, and so we want to maximize ∫∞t+smol(t′)dt′.
The latter would be trickier for our model since we'd somehow have to specify when the simulation stops (the universe ends). If our resulting Physics model implies the universe will indeed end at some point, then we implement that. Otherwise, since the simulated universe has a finite amount of different possible states, we can run each simulation until it loops, and consider the value (diamond) contained in said loop divided by its time length. Of course, this procedure could be especially vulnerable to small errors in the simulation (for instance, entering loops which wouldn't actually happen in the universe). Also, if time turns out to be related to space and other dimensions in weirder ways, we might not be able to so neatly conceptualize our simulation as a cellular automaton, and the procedure wouldn't be possible.
Problem 4. The AGI "suddenly appearing at coordinates X" would mess with our deterministic simulator
More concretely, the computer will use the universe at time t to derive how the universe is at time t+s (when the AGI is deployed). Hardwiring the value of those coordinates at time t+s would be incompatible with other parts of the simulation, and ill-defined. On the contrary, having the simulation run as usual from t to t+s will include simulating the infinite computer making this computation (since at t+s the AGI is deployed, so the computation has ended)[10].
Solution: Black box the whole setup and remove it from the simulation to avoid circularity. Some time after t, build a big black box around the infinite computer, with two compartments: one for the computer, and another where the AGI will be built (as big as we can make it). By black box I mean a physical cover which minimizes information transfer as much as we can: adiabatic, rigid, impermeable, no gas inside, etc. (although mutual information can never be literally zero). In the AGI compartment, we will place a supercomputer with memory completely wiped (we might want this, the infinite computer and all machinery to produce only very weak electromagnetic and gravitational fields to further reduce interference, which seems difficult). A certain machine will connect the infinite computer with the supercomputer so that, upon deciding on the AGI design, this machine (in a previously set amount of time, and by following a previously set course of action) rearranges the supercomputer's memory bits so as for them to encode the physical design. The machine then becomes idle in a pre-determined state, and the infinite computer as well, so that the only variance upon later opening the black box is the supercomputer's physical state.
To determine the AGI design, the computer acts as follows. For every possible bit configuration of the supercomputer (finitely many), run the simulation from time t up to time t+r, the time at which the black box build is completed. Stop modeling the part of the universe inside the black box, continue modeling the rest as if that region was completely void (or more accurately, contained a certain concentration of materials which produce a certain small electromagnetic field, etc.) up until time t+s, in which the black box opens and the simulator refills that part of the universe with the physical layout known to be inside the black box (for the chosen AGI design). Now, this is still to a great extent hardwiring the design into the simulation, and will present some discontinuities. We could probably make the simulation work around these obstacles, by using an even coarser (probably probabilistic) physical model for the black boxed spatiotemporal region, and averaging over (or choosing one of) all the possible coarse physical developments of that region which lead to the universe being exactly as in time t+s, with the AGI built[11]. We can also implement this last strategy without the black box (doing it for the whole universe), but the black box tries to minimize sources of inaccuracy. We are basically using patches to make the discontinuity in the simulation smoother and less problematic. Then of course, the simulation after t+s proceeds as usual, and calculates our preferred diamond-metric. Choose the AGI design which maximizes it.
Even if every aspect of our world model(s) looks good to physicists, it will probably contain several small errors (not corresponding to reality). To alleviate that, we can implement variational methods in the above search. That is, for every molecule/particle/physical cluster modeled by our simulator, let's say its state in the simulation is completely determined by a certain vector. Choose one of them randomly, and introduce a small random fluctuation in one component of the vector. Instead of just simulating our world model(s), the computer can also simulate models identical except for any of these fluctuations, and average over all of them. The average should be weighted, so that the more such fluctuations that have been introduced the less weight the model has (because we do expect our world model to be broadly right). This variational search will be carried out in the last, "build AGI" step, but before that, to reduce the probability that some small variations lead to weird models which somehow dominate the weighted average, physicists can inspect this variational structure of our world model (in other ways that don't involve building an AGI). As you might notice, I am just averaging everything and hoping our accuracy is enough (see Section A).
We might think of providing certain mechanisms for interaction with the universe to the supercomputer once the black box opens, to ensure the system has immediate causal powers. For instance, we might establish a pre-determined mechanism that, very shortly after the black box opens, provides it with access to the internet, or control of a hoard of drones (let's call these systems "mechanical arms"). This is probably overkill: the search space is so wide, that the resulting physical design will almost surely trivially find mechanical arms to use (and quickly take over humanity, and so on). Now, as long as pre-fabricated mechanical arms don't somehow increase interference with the black box, we might think they could only help. But, if (without them) the search isn't even capable of finding its own mechanical arms, then something has gone immensely wrong with the search. And in that situation we might prefer the resulting system to have no mechanical arms. So not adding them might be a free (although extremely inefficient and only marginally helpful) safety measure, that allows for a re-roll.
Note also that some of the details of the above method might not work out if our Physics paradigm changes so much that it doesn't make sense anymore to talk about black boxing, variational methods or other concepts. I find it unlikely that such fundamental concepts become obsolete. But even if they do (and provided we do converge on acceptable macroscopic laws of Physics), it seems more likely for the new more accurate paradigm to provide more efficient methods of doing what we wanted to do (better "black boxing", or better "variational methods"), than for it to present fundamental limitations to what we wanted to do (but this is speculation).
Of course, this whole method needn't produce anything remotely close to what we think an AGI might look like. It just produces a physical setup that maximizes diamond. It is conceivable that this physical setup maximizes diamonds for weird reasons, and cannot be considered itself an agent (as an absurd example, maybe the setup is just a pattern that, when seen by a human, brain-hacks them into only caring about diamonds, and turns them intelligent enough so that humanity will spread across the galaxy). But if what we believe about agents and maximization is even remotely right, then the physical setup which maximizes diamonds, especially considering we are averaging over many slightly different universes, will be an AGI[12] (see Section A).
A. Why this might have a shot at working
Much of the above argument centered on maximizing model accuracy and minimizing the errors in the whole setup. For this engineer-y problem, many different ideas are implemented, most of them not completely sure to work. This might give the impression that the argument is very conjunctive and thus sure to fail. And indeed, the probability of everything working according to plan and my assessments of how optimistic to be about model convergence and accuracy being in the right ballpark, is basically zero. But I don't need that to have a shot at this working!
See, if an AGI is truly a naturally general form of maximizing things (in the sense that in most universes with huge amounts of something, there's an AGI maximizing for it), then we might expect to find such AGIs in many of the high-scoring universes. What's more, the AGIs found in different universes won't be much different from each other, and each AGI won't be overly reliant on its universe's fine details, but on the contrary will deploy a general procedure that's probably approximately as useful across many similar universes.
Here's another way to put it. In our setup, we are trying to encode something in a very small box (smaller than Earth) that can do something very big and locally complex. If the code relied on non-general local information about different parts of the universe ("this molecule 3B lightyears to the right has velocity X or Y"), then it wouldn't nearly fit in the box[13]. So our code must somehow be highly compressed, and not directly rely on almost any of those facts. So it is very likely that all such facts in which it actually relies on are correctly approximated by our model.[14]
Now, this argument can be made even if the computer considers only one world model. But in our setup, we employ variational methods (and also average over a set of acceptable world models if the search doesn't yield a unique one). This drastically biases the search towards finding general and under-specific AGIs, instead of overly specific setups! Indeed, allegedly the setups that perform great across many different fluctuated universes are those which more readily correspond to our usual concept of an AGI: an agent taking information as input and delivering actions, that thus can perform well with many different inputs. Conversely, any more deterministic system, heavily reliant on specific details of its context, will fail in all those universes which fluctuate said details.
So having an overly accurate model would be counterproductive. We only need a certain threshold accuracy to ensure basic facts about Physics, diamonds, Earth and so on are correctly captured. And after said threshold, gaining more accuracy will barely improve the situation, since we're gonna use variational methods anyway to bias the search towards general intelligences.[15]
As a further exemplification of this point, suppose we run our method (or the actually correct version of our method) twice, but the second time we tweak it so that we only care about maximizing diamonds in the closest half of the observable universe (or the simulator only implements this part of the simulation, or something similar). I expect both resulting physical designs to be extremely similar if not identical. This is because, for both instances, the playing field for maximizing diamonds is so much larger than the physical system designed, that the best strategy is building an all-purpose agent that can locally decide how to exploit each part of the playing field.
Mainly because of this reasoning, if there were no further important problems to the ones considered (above and below), I'd expect this general idea to have a moderately good (about 30%) shot at working (although many of the details would need to be polished, and surely some changed, and Problem 5 might turn out to be intractable). But it's highly likely that I'm missing out on some important problems. And that's why my probability for this working is not more than 15%.
B. Further problems
Problem 5. What if a malevolent agent inside any of the many simulations hijacks the computer?
I believe this to be the most worrisome problem. To the extent that we've conceptualized the infinite computer as an idealized oracle[16], malevolent agents might be unable to escape. Indeed, at least in the "build the AGI" search, the only output of every simulation is its diamond-score. And a malevolent agent might deliberately increase its universe's diamond-score so that we end up making that universe come to be. But of course, we only care about diamonds, so then we are glad to be in that universe!
That is as long as our diamond-score is truly a robust proxy for the amount of diamond. If the proxy implemented in our Physics simulation about what diamond is deviates even slightly from the real thing we care about, then we will get a ton of adversarial pressure, due to the immense size of the search space (although that doesn't necessarily imply that we'll end up in a low diamond universe). But despite Problem 3, I do expect our diamond-score to be a correct proxy with very high probability[17]. Especially, I think we can be strict enough so that, even if diamond-score doesn't capture all instances of diamond, we can be sure anything that increases diamond-score is an actual diamond.
Even with an idealized infinite computer, malevolent agents might be more of a problem in the search for the world model. The simulations' outputs then are much more complex facts that will be examined by the physicists. The physicists might try to average over simulations, or examine each simulation only very locally, so that each simulation can't forward a lot of information, but this immensely complicates their work. Of course, I'm not only literally worried about a single malevolent AGI existing inside one of these simulations (and at some time during the simulation acting as if it is being simulated in a world exactly like ours, and trying to exploit this by giving the physicists adversarial input), but more generally about adversarial pressures arising from the vast search, which might obstruct the whole process. For example, having the infinite computer run all simulations and present only those which seem interesting (according to some benchmark, or small AI) might be especially bad, since this "interesting" is probably a pretty natural concept (maybe "greatly dissimilar to most models"), and many adversarial inputs will fall under it (as an absurd example, consider finding a world model whose stars clearly spell out "I'm God, end this process or you'll go to Hell"). Ideally, the only searches carried out by the physicists will be to discriminate the world models which satisfy the "boundary conditions", and these are highly contingent and not natural, so they might seem safer. But this is speculative and I need to think more about this.
And if we stop idealizing the infinite computer, and materialize it in the physical world as an actual computation running on actual hardware, malevolent agents will have many routes to escape even in the last search. In fact, the search space is so big, that probably the first ever search carried out by physicists on the infinite computer would somehow hijack/disable/take over.
After all, that's the whole issue with consequentialism and big search spaces, which we had hidden under the infinite computer and is coming back to bite. Given a big enough search space, whose entries are complex enough to contain programs that hack us, we will come across such programs before finishing the search.
Tentative solution: If the physicists had a procedure to stop the search upon finding a promising model, or the computer had a way to stop the search upon finding a good enough diamond maximizer, then we'd actually have a chance of finding those before the hacking models. The first might be doable, although selecting for "promising" might be dangerous as above, but we'd be inside the limited space of worlds satisfying the "boundary conditions", and that seems safer. The second seems doable, because any physical setup that achieves more than, let's say, 80% of the diamond-score physically attainable (calculated for instance through volume constraints), is almost surely already a diamond-maximizing AGI (by the argument of Section A).
Problem 6. Might not the resulting AGI miss out on unknown Physics facts that would maximize diamond even further (for instance, by packing it more compactly)?
Yes. In a sense, once we start searching for the AGI to build, our Physics remains forever fixed. Ideally, the physicists will have found all improvements to Physics that allow for more diamonds. Realistically, it's possible that we miss some. In that case, if one of the AGI designs that the search goes over would exploit these unknown facts (and thus produce more diamond), inside our simulation it will just perform badly (trying to exploit some facts that are false in the simulation), and won't get selected[18]. It's not clear to which extent the AGI resulting from this search, that is, a diamond-maximizing AGI with fixed ontology/laws of Physics, can be considered a truly general intelligence (even if it's possible that its ontology is optimal for diamond maximization). It might seem that to build a diamond-maximizing AGI which auto-updates its ontology we need to solve STEM AI first. But actually, it might be easier to build an agent that does everything it can (including STEM) to achieve a goal, than building one that only does STEM (and we can use as a tool).
Problem 7. Won't the resulting AGI miss out on acausal trade?
If we use our method of Section 1, focusing on Physics and causality, then indeed we have no reason at all to expect our AGI participating in acausal trade, on the contrary, it almost surely won't. That is, unless physicists end up somehow discarding the current understanding of causality, in which case that method doesn't even seem applicable.
If we use the alternative method of Section 1', focusing on information and evidence, our world model might end up accommodating evidentialist views (if these arise naturally/are canonical), and so might search for AGIs that acausally trade.
Problem 8. Aren't you assuming reality can ultimately be perfectly or almost perfectly modeled by some mathematically structured laws?
Yes.
Anti-Problem 9. Whole Brain Emulation could help
This is of course common across alignment proposals. In our specific proposal, uploading physicists would allow to compress the whole process of coming up with acceptable laws of Physics and world model(s) into a single run of the infinite computer. And also let it run for as long as necessary so as to ensure high confidence in the result (or additionally have copied physicists independently and locally check through literally every part of the simulation). This can again be more dangerous because the physicists are receiving inputs from an even bigger search space. Some architectures can be implemented to try and minimize this risk. For instance, instead of just having one physicist look at the data, also include a second physicist (or psychologist, or whatever) that looks at the physicist looking at the data, and makes sure nothing weird has happened (like the first physicist getting brain-hacked). This can be iterated, and also have many physicists (instead of just one) looking at each such scenario (HCH-like tree structure). Any such implementation will prevent some failures, but leave other vulnerabilities (or even create some more, although intuitively the bigger structure should be more robust for most inputs).
C. How does this apply to actual Alignment?
Any solution to diamond alignment is already very removed from actual Alignment (and that's how the relaxations encourage fresh takes). But the one presented here is especially untranslatable. It makes such a central use of an absolutely unattainable amount of computation, that removing this relaxation leaves the approach completely inapplicable.
The approach is very non-standard in the sense that, instead of discussing the inner workings of an AGI (and forming intuitive, approximate, highly abstract pictures about how these will relate to consequences in the real world), we directly search over the space of consequences (and this requires the unattainable amount of computation), and try to find methods that make this search possible and safe.
But this solution proves very useful for another purpose of diamond alignment: pointing at the actual difficult kernels of the problem. Removing all bounds to the search's feasibility makes apparent how the vastness of the search space itself is the enemy. This is the enemy of consequentialist thinking of any kind, with or without AGI. But it turns out that in vast and complex enough search spaces AGI occurs very naturally (or so we think), and so many dangers arise through it.
Here's a framing of the Alignment problem inspired by that idea:
Another meta-level useful feature of the solution here presented is that it presses on the boundaries of the diamond alignment problem, stressing how much of the problem is really captured or obscured by which assumptions/relaxations, and to which extent they are reconcilable with fundamental properties of reality. Throughout the text, we find that many details under-determined by the diamond problem's statement are crucial to the viability of some strategies:
Of course, all of these can just be defined away (even the last one) for the sake of concreteness (although having them under-determined at least helps consider a wider range of strategies). And even doing so in the most optimistic way possible won't assure this solution will work.
But what I'm getting at is that, in these fringe under-determinations, we find expressed many irreconcilable tensions between the idealized relaxations and the reality we're actually thinking about when trying to solve the problem. Some pedagogical/theoretical benefits of tackling the diamond problem are obvious, and so I'm not arguing against doing so. But one might raise the question, to which extent does it make sense to, for instance, consider an idealized non-physical computer, when at the same time we're trying to get around the messiness of the rest of reality?[19][20]
That is, might considering that nonsensical situation not encourage a dangerous double-think, which might later make its way (unnoticed) to our actual opinions about real world Alignment? After all, our real worry is what are the physical consequences of embedded hardware running a certain computation. When dividing up the problem into such neat compartments as the diamond problem does, might our intuitions not later forget what the real problem was about, and point us in non-obviously mistaken (but nonetheless mistaken) directions? That is: does the diamond problem really capture everything that is fundamental to Alignment?
I don't have strong arguments for the diamond problem missing some fundamental core of the issue. After all, it certainly does capture its most obvious aspects. And I know it is consciously obvious to everyone that its relaxations are nonsensical. But I've just come off the other side with the feeling that we might want to be more attentive of how these relaxations make our intuitions about reality bend in weird, incoherent ways.
Throughout the text I use macroscopic loosely (for instance, molecules might be included) to mean as fine a granularity as our current paradigm permits, without entering into quantum or other troubles.
Maybe having humans continuously interact with the infinite computer (instead of using it only once) is considered cheating.
I speak of coordinates, but these of course can't be solely spatial. They should be spatiotemporal to account for relativity, or include whatever further dimensions our preferred Physics requires.
Maybe longer to avoid quantum interferences.
Of course this again makes our AGI potentially lose out on other weird processes that produce the atomic structure of diamond, and so we might end up with a "process Y on material X" maximizer instead of a diamond maximizer (even if the two usually coincide in the universe).
If the simulator was able to deterministically infer the state at time t from the state at time t+1, it might be better (or more informative to Physicists) for the simulation to start with the present, very prohibitive boundary conditions, and make its way back to something like the Big Bang.
It might seem worrisome that I'm invoking physicists so much, since that usually signals a part of the argument which I can't complete. But on this instance, I do think I have a generally good feel for what these physicists would actually be doing, and moderately informed opinions and intuitions as to whether this process would converge, how long it might take, etc.
Although this of course induces some risk of failure in all its predictions, and even without failure our resulting AGI might be missing on some opportunities to exploit these weird phenomena for diamonds.
I'm not sure whether timeline concerns are supposed to apply to the diamond alignment problem. Maybe they aren't usually considered just because most proposals only use the infinite computer once.
Of course the premise of having an unboundedly fast computer with unbounded memory be a bounded physical system is already nonsensical, but straight up computing the uncomputable (this infinite nested regress) seems categorically even worse. This is also the reason why, when using infinite compute to find the world models which fit the facts (in Section 1), these facts must all lie in the past, and the simulation must not arrive at the moment in time where the computation begins.
This approach doesn't run into the uncomputable infinite nested regress because the coarse model wouldn't be precise enough to model the computer's computation exactly.
This paragraph is informative, but of course, strictly speaking, who cares if the resulting system is not an AGI? We only care about diamonds.
This resonates with John's Natural Abstraction Hypothesis.
Since the smallness of the box is what protects us against overly specific setups prone to fail under the smallest misadjustment, one might wonder whether we truly want the box to be "as big as we can make it". But I think we do, because increasing its size drastically augments the search space for our AGI, and almost negligibly augments the probability that we find an overly specific setup (those probably have size much greater than Earth).
It is conceivable (or even likely) that, if we really did know our world model with arbitrary accuracy, then some (at least partially) deterministic setup creates more diamonds than a general intelligence (because of contingent quirks of our universe). But I guess both achievements (either building a diamond maximizing AGI or somehow maximizing diamond even harder) are sufficient to pass this problem. After all, even an AGI is not omnipotent and will inevitably leave some utility on the table.
And maybe I am allowed to do that for the diamond alignment problem.
And I guess the whole point of the diamond alignment problem is to trivialize proxy concerns away.
We can also understand this as a failure of our diamond-score as a proxy, now caused by the incomplete framework in which it is formulated.
Of course, the diamond problem could specify that the infinite computer is physical. But is that, in some relevant sense, less nonsensical than a non-physical computer?
My non-standard solution suffered more than most solutions when dropping, for instance, the relaxation of the computer being idealized (see Problem 6), and that might be why I'm giving so much weight to this issue. But the next paragraph tries to get at how this mismatch, in a way less obvious manner, could also happen in more standard solutions (that don't even press that hard on the problem's boundaries).