AI ALIGNMENT FORUM
AF

Fundamental question: What determines a mind's effects? — AI Alignment Forum

[Metadata: crossposted from https://tsvibt.blogspot.com/2023/04/fundamental-question-what-determines.html. First completed April 9, 2023.]

A mind has some effects on the world. What determines which effects a mind has? To eventually create minds that have large effects that we specify, this question has to first be answered.

Slippery questions

A slippery question is a question that, when you try to approach it looking for an answer, it slips out of your grasp. It runs away or withdraws, it cloaks itself in false expressions. It leads you astray, replacing itself with a similar-sounding question, maybe one that's easier to answer. It's hard to hug a slippery question. The question withdraws.

Or worse, it pretends to be made out of ungrounded abstract concepts, so that you're not sure anymore that there's even a real question there. This slipperiness is especially likely for questions that involve "big" things, such as minds.

The question might seem obvious and even tautological. Or it might seem simplistic, failing to carve reality at the joints. These appearances usually somewhat accurately reflect the reality——but they do not imply necessarily that there is not a real question behind the appearances. A question that can be pointed at, but only imperfectly, inchoately, ostensively, or preliminarily, is at risk of being question-substituted or being ignored.

The discomfort of not having an answer or even a clear question pushes people to run ahead of the question. When they make an error that would have been prevented by meditating on the question, even if they notice the error, they don't notice that they are in general not focusing on the question. If the question asks: There is something, it has to do with concrete things A, B, and C, and it is important——what is it? Then someone will analyze A and B in detail, gain some understanding, and declare the whole question——including C, and the inexplicit core question——solved.

This essay will ask a slippery question over and over in different words, hoping to arrive at a firmer grasp on the real questions. The starting point:

What determines the effects of a mind?

The word "effects"

Suppose there is a mind M. As a result of M existing, the cosmos (the entirety of the world, in any aspect or abstraction) is different from how it would have been if M hadn't existed. Those differences are the effects of M.

What determines the difference in the cosmos that results from the presence of the mind?

The effects of a mind depend on how strong the mind is. But the question wants to ask about the sort of effects, not their size:

What determines the direction of a mind's effects?

Minds pursue instrumental subgoals, which affect the world. These effects might matter, but they are subordinate to their supergoals. Any reversible effect might be reversed. The question asks about the final effects of the mind:

What determines the direction of a mind's ultimate effects on the cosmos?

"Direction" is only a metaphor. It suggests that there's something compactly understandable about the effects of the mind. The metaphor may not hold well. To nod in that direction, the question can at least be rephrased to suggest multi-dimensionality:

What determines the directions of a mind's ultimate effects on the cosmos?

The word "mind"

What does "mind" mean here? The choice of the word "mind" reflects the guess that things that have large effects will be integrated——will speak in internal languages, pass information, correct errors, and in general become coherent.

Rather than "mind", the question could refer to [things that have large effects], reading:

What determines the effects of a large-effect-haver?

But this is too broad. The cosmos as a whole is a large-effect-haver, in the sense that whatever happens, happens as a result of the cosmos being however it is. So the question wants to point at [things that have especially large effects given their size]:

What determines the effects of a dense-effect-haver?

But a powerful bomb is a dense-effect-haver, and it seems off-topic. How to be more specific? Maybe it's just not dense enough:

What determines the effects of a very-dense-effect-haver?

What about a hypothetical very dense bomb, or a device that could cause vacuum collapse? Maybe we could say, "the sort of very-dense-effect-haver that we're likely to encounter"? That is a reasonable question. But, "the sort of very-dense-effect-haver that we're likely to encounter" is not a natural category; it looks like {nations, AGIs, CAIS, superviruses, vacuum-collapse bombs, ...}.

So the question can be factored as, for any sort of thing X:

How likely is it that a given very-dense-effect-haver that we'll encounter will be an X?

and

What determines the effects of an X?

With X = "mind", "mind" is still ambiguous. It might indicate some specific assumptions, such as integratedness or probabilities or search. Or, it might be an ostensive term, defined for example by pointing at humanity and saying "whatever makes it possible for humanity to reshape the face of the Earth, that thing but more so, is what the question is about". If "mind" is defined ostensively like that, then reasoning about minds as though they e.g. do search, would be relying on an additional proposition (that humanity reshapes the Earth by some route that involves search, or something). Or it could be ostensively defined as "where all this stuff that people call AI research is going". The question at hand is:

What determines the effects of something that has large effects by the same route that humanity reshapes the Earth?

The word "determines"

Causal determination

¿Is the answer to the question just: The entirety of the mind's building blocks (its neurons or transistors and so on) and its environment (the data it sees, the tasks it undertakes) is what determines the mind's effects.

Of course, that is a sufficient cause of the mind's effects. Maybe the question meant to ask about a minimal causal set? But if the mind, like most objects in our world, has the property that everything affects everything else, sometimes a lot, then the entirety of the mind and its world might have to be included in the minimal sufficient cause of the mind's effects.

In any case, this causal sort of answer is not what the question is asking for. The answer doesn't give any hint about how to specify the mind's effects, beyond "somehow change something about the mind's building blocks and environment". Being shown a complicated squiggly subset of the material substrate of a mind doesn't tell you what's going on with that squiggly subset and how it interfaces with the rest of the mind and how modifying it would alter the mind's effects.

What sort of answer would address the question? Maybe the question should be reexpressed like this:

How can a mind's effects be specified by an external specifier?

The trivial self-determination

¿What about the tautological answer: The mind's effects are determined by the mind's effects.

This answer actually does answer the question, in the case where it's possible to run the mind, observe its effects, evaluate whether those effects were the effects-to-be-specified, tweak the mind, run it again, and so on, until the mind's effects are the effects-to-be-specified. Or to put it another way, the question didn't need an answer in this case.

When full direct verification can't be done, the question needs an answer. And indeed full direct verification can't be done; it can't even be done for many pretty simple computer programs, such as a function that sorts lists, because there are too many cases to check. What to do in this case?

What the specifier has to do

The question asks what determines the mind's effects. Or, it asks what an external specifier has to do, to specify the mind's effects. The external specifier is brought into the question. If the external specifier can run unboundedly many experiments, the question doesn't need answering. If the external specifier cannot do so, then the question needs answering. The question depends on the external specifier.

How can we, ourselves, specify a mind's effects?

This is now a much more subjective or indexical question (that is: the answer depends on who's asking). Have we abandoned the asker-independence of

What determines a mind's effects?

The question asks:

given who we are——that is, given what we can see and recognize in the mind, given what we can understand, given what problems we can solve, given what we can manipulate and design, and so on;
and furthermore, given what the mind is——that is, given how the mind is structured, how the mind makes decisions and plans, how optimization power flows in the mind, what structures the mind exposes, what is explicit in the mind, and so on;
how can we specify the mind's effects?

This question depends on the asker, but it might depend on the asker less than it seems. The specifier is bidden to see structure in the mind, to understand and manipulate and design and problem-solve about structure in the mind. But these are instrumental goals, and they don't seem to be specific to the specifier's ultimate goals (or effects), or specific to the specifier's structure apart from the specifier's capabilities. Almost any specifier, trying to specify almost any effects for the mind to have, has many of the same challenges as almost any other specifier. Each specifier has its own deficiencies in understanding, but the task bids almost any specifier to have the same body of understanding to finally be able to specify the mind's effects.

(Or maybe not. It's at least a little not right: some effects are harder to specify than others. Also, it may be that there are many really distinct ways to exogenous specify the effects of a mind——one specifier alters the training data, one alters the prior over minds, another reaches into the mind and tweaks some machinery, another bolts on some other machinery... Each one of them solves different problems using different understanding.)

To the extent that the task of exogenously specifying a mind's effects doesn't depend on the specifier, it can be viewed as a property of the mind. So we can reformulate the question:

How can a mind's effects be specified?

But it's confusing to fold the activity of the specifier into the mind as a property. Better: the question can be again factored as:

What within a mind can specify the mind's effects?

and for any X within a mind that can specify the mind's effects,

How can an external specifier specify the effects of the mind via X?

Counterfactual determination

Having brought in an external specifier and then cordoned it off again, are we any closer to saying what "determines" is, i.e. what we want to know about the origins of a mind's effects? If causal explanations of a mind's effects are rejected, what else could there be?

Causal explanations aren't false, they're just (probably) irrelevant. They're at the wrong level of description. E.g. acausal bargaining isn't causality-violating, it just doesn't require causal interaction——it's something that happens at a different level of description from causal sensory-action loops. If someone asks "Why did you move your hand like that?", no amount of detailing which neurons were firing when will give them what they want. They want to hear something like "I was trying to make a snapping noise by whip-flopping my pointer finger against my middle finger.". Detailing the neurons does tell them something——if they were superintelligent, they could extract the answer to their question from the neuron details——but the neuron details still wouldn't themselves constitute the answer.

Facts or properties (or in general any elements) of interest in a mind don't have to correspond to a nicely demarcated region of spacetime. Where, exactly, is "the addition" in your computer? The ALU? What if you're adding really big numbers——now the addition happens in RAM too. And we have to include the code as well. What if you're adding big arrays? So we include the GPU sometimes. Do we include the whole GPU, or just the metal parts? And so on. If the query "Here's a computer; please point to where or how the addition happens in it." has any point to it, responding by handing back the whole computer and saying "It happens here." does not address the point.

The question asks:

What within a mind can specify the mind's effects?

How is "specify" different from "cause"? It gestures at counterfactuals in general:

For what element X in the mind is it the case that, were X different, the mind would have different effects?

There are counterfactuals that aren't causal counterfactuals. The world has to proceed according to causality, but it isn't always best understood in terms of causality. To design something, minds deploy counterfactual reasoning that isn't especially causal, even if the design will be implemented through causal channels.

What within a mind can counterfactually determine the mind's effects?

The sort of counterfactual mentioned here needs more explication. It would include, for example, "what if the mind lets the specifier veto actions" or "what if the mind tried to manufacture a strawberry" or "what if the mind converges towards what such-and-such decision procedure would do".

Comprehensive determination

If everything at least a little bit influences everything else, then everything determines everything at least a little bit. ¿So is the answer just: Anything within a mind can counterfactually determine the mind's effects.

In general, mental elements tend to be noncomprehended: their implications aren't easily circumscribed in the understanding, aren't easily understood as cordoned off from everything else. An element that strongly determines the mind-external effects of a mind would be comprehensive in a sense, because it would screen off the detailed elements of the mind from the mind-external effects.

Such an element would induce a convergence, where the mind's structure is put to the purpose of bringing about those effects, and not put to the purpose of bringing about other contradictory effects. It would act as a "locus of control". It would be a fixed element, stable under reflective self-modification——whereas something "hardwired" in the mind tends to eventually be modified. It would provide the ultimate criterion for the mind's intercourse with the world. Its influence wouldn't be overridden or eroded by other elements——whereas something that directly specifies some actions of the mind will tend to be overridden by a deeper organizing force in the mind, not being competent to defend itself in full generality against general strategicness.

So the question can be put as:

What within a mind can comprehensively counterfactually determine the mind's effects?

¿Can't the answer then just be: The whole of the mind can comprehensively counterfactually determine the mind's effects.

This trivial answer does not make anything available in the mind for an external specifier to specify the mind's effects.

Some elements determine the mind's effects a lot more than others. Those elements are a smaller, less unwieldy channel for the specifier to specify the mind's effects. They carry determination more densely, more comprehensively determining effects while having a smaller footprint in the mind.

What within a mind can densely counterfactually determine the mind's effects?

It may be that there is no answer. It may be that no small assembly of small elements of the mind comprehensively determines the mind's effects. Intuitions about "values" or "intents", and also studies of "decision theory", seem to point at an intuition that dense comprehensiveness is feasible or natural.

"In kind" determination

Has the question now been refined so that that the specifier is appropriately held at arm's length, but included in the criterion for an effect-determiner, so that effect-determiners in a mind would provide a channel for the specifier to specify the mind's effects?

The question says that a determiner should be dense——that is, small and comprehensively determinative. What does "small" mean? "Wieldy" is a better term.

A determiner should be wieldable by the specifier. A determiner should speak or be spoken in a language that the specifier will come to speak and hear, a language that can be translated into the future specifier's language, maybe a universal language, maybe a language that demands that the specifier learn new ideas. A determiner should be of a type complementary to how the specifier can exert control, picking up the slack where the specifier doesn't know how to interface with the sprawling whole of the mind. Maybe a determiner should open up a broadband channel with the specifier, so that the specifier doesn't have to truncate the specifier's own struggle to decide between effects of different kinds——the determiner should be "in kind", it should open a channel that lets the specifier's own determiners determine the mind's effects using their own language, or a translation of their own language. A determiner should be the kind of thing a specifier wants to specify, that determines something about effects that a specifier wants to determine. (For example, {weights, neurons, transistors, probability distributions, control systems, ...} probably can't be determiners: they aren't the sort of thing that a human's caring-about-caring is about.)

So:

What wieldy-to-human-caring element of a mind can counterfactually determine the mind's effects?

Knowable determination

The mind should be such that the specifier can know that some element specifies the mind's effects.

What can knowably determine a mind's effects?

The word "a"

Manufacturable minds with specifiable effects

The question asks about a mind. Which mind? All minds? A given one? Really what's asked is to make a mind (a thing that greatly reshapes the world by the same route as humanity) that exposes ways to specify its effects. So nothing about the mind has to be fixed, except that it has large effects:

What is any manufacturable mind M, and possible way of exogenously specifying M's effects, such that the effects of M are specifiable in that way?

Any mind and any determination

Without first understanding how the effects of minds in general are determined, designing a mind to have specifiable effects is out of order. If the designer understands how to design the mind so that its effects are specifiable, then the designer also understands how the specification channel determines the mind's effects. So the question asks to understand:

What determines the effects of minds in general?

It may be that some minds have effects that are more clearly, densely determined than the effects of other minds. Less broadly, a relaxed version of the question asks:

What is any mind M, and possible way that M's effects might be densely determined, such that the effects of M are determined in that way?

Natural minds

The question about minds in general wants to ask about natural minds. Minds that are likely to exist, minds that are easy to design, minds of a species that takes up a large portion of mindspace, minds that have properties that are the default properties that minds have, minds that are shaped by the default forces present in minds, minds that don't have properties that are strongly in tension with each other, minds that aren't balanced on a razor's edge in some dimensions.

What determines the effects of natural minds in general?

Growable minds

If the specifier has already been brought into the question, ¿why not ask:

What specifies a mind's effects?

"Specifies" implies an external specifier. If the mind doesn't have an external specifier, if it's entirely autonomous, then this question doesn't have an answer. (Edit: this is incorrect: the mind can specify its own effects, and maybe even has to. Values have to be chosen: a mind starts off incoherent, so it will undergo conceptual revisions, and will have to choose how to care in the new context.)

If a mind is designed and manufactured and ongoingly governed——through and through, in every detail and aspect, from low-level to high-level——by an external specifier, then nothing within the mind ultimately determines the mind's effects. Such a mind is also probably nearly useless: it has no creativity, and the point of making a mind is to wield its understanding, planning, and in general its capacity to have effects. The easiest way to make a creative mind is to make processes that find novelty on their own.

If the need for autonomy is out of view, then the earlier question of manufacturable minds with specifiable effects gets a false answer: "Since the mind is manufactured, it's effects are specified by the manufacturer.". This is a shell game. If there's no method given for specifying the effects of an autonomous mind, then either the mind is not autonomous and is nearly useless, or the mind is autonomous and what ultimately determines its effects hasn't been specified. The shell game hides that at no point has creativity been comprehensively interfaced with effects.

So the question asks about the sort of mind that can grow, mostly under its own power, from a feasibly-manufactured core or seed. Such a mind, growing under its own power, is autonomous in that way. But despite the partial autonomy, which implies that much about the mind is not exogenously specified, the question still asks:

What determines the effects of a mostly-autonomously growing mind?

The word "What"

Elements

The word "What" seems to ask for a thing, an object. Answers to the question arguably should be Things in a general sense, but that doesn't imply that they are especially object-like, such as a box containing the utility function, or a homunculus. What determines the effects of a mind might be some other sort of element, such as a property or a tendency or a dynamic equilibrium or a supervening thing. For example: a mind might have a concentrated locus of control that continually hands off control to another concentrated locus; the mind might hold to a virtue; the mind might have self-reinforcing tendencies; the mind might act according to some criterion; the mind might follow parliamentary deliberation.

What about a mind, if different, would render different the mind's effects?

In general, the "what" asks for some mental element, without further preclusionary restriction:

What elements determine a mind's effects?

The sort of element

The question seems to ask for a specific determiner——a particular mind, and particular elements of that mind. But, that question can't be answered without having some preliminary pointer to what sort of element is asked for. So the question also asks:

What sort of element can determine a mind's effects?

Presumption of an answer

.The word "What" seems to say: There is something which determines a mind's effects; what is it?

There may be many answers. There also may be no answers. Some minds may not have any elements that densely determine the mind's effects.

Revision

Several questions have been asked. A central instantiation of the fundamental question:

What sort of element can knowably, wieldily-to-human-caring, densely [make available for exogenous comprehensive specification] the [directions of the ultimate effects on the cosmos] had by a natural, autonomously-growing [entity which we will encounter that has large effects by the same route that humanity reshapes the Earth]?