formalizing the QACI alignment formal-goal

Tamsin Leake; JuliaHP

this work was done by Tamsin Leake and Julia Persson at Orthogonal.
thanks to mesaoptimizer for his help putting together this post.

what does the QACI plan for formal-goal alignment actually look like when formalized as math? in this post, we'll be presenting our current formalization, which we believe has most critical details filled in.

1. math constructs

in this first part, we'll be defining a collection of mathematical constructs which we'll be using in the rest of the post.

1.1. basic set theory

we'll be assuming basic set theory notation; in particular, is the set of tuples whose elements are respectively members of the sets $A$ , $B$ , and $C$ , and for $n \in N$ , $S^{n}$ is the set of tuples of $n$ elements, all members of $S$ .

$B = {⊤, ⊥}$ is the set of booleans and $N$ is the set of natural numbers including $0$ .

given a set $X$ , $P (X)$ will be the set of subsets of $X$ .

$# S$ is the cardinality (number of different elements) in set $S$ .

for some set $X$ and some complete ordering $< \in X^{2} \to B$ , ${min}_{<}$ and ${max}_{<}$ are two functions of type $P (X) ∖ {\emptyset} \to X$ finding the respective minimum and maximum element of non-empty sets when they exist, using $<$ as an ordering.

1.2. functions and programs

if $n \in N$ , then we'll denote $f \circ^{n}$ as repeated composition of $f$ : $f \circ \dots \circ f$ ( $n$ times), with $\circ$ being the composition operator: $(f \circ g) (x) = f (g (x))$ .

$λ x : X . B$ is an anonymous function defined over set $X$ , whose parameter $x$ is bound to its argument in its body $B$ when it is called.

$A \to B$ is the set of functions from $A$ to $B$ , with $\to$ being right-associative ( $A \to B \to C$ is $A \to (B \to C)$ ). if $f \in A \to B \to C$ , then $f (x) (y)$ is simply $f$ applied once to $x \in A$ , and then the resulting function of type $B \to C$ being applied to $y \in B$ . $A \to B$ is sometimes denoted $B^{A}$ in set theory.

$A H \to B$ is the set of always-halting, always-succeeding, deterministic programs taking as input an $A$ and returning a $B$ .

given $f \in A H \to B$ and $x \in A$ , $R (f, x) \in N ∖ {0}$ is the runtime duration of executing $f$ with input $x$ , measured in compute steps doing a constant amount of work each — such as turing machine updates.

1.3. sum notation

i'll be using a syntax for sums $\sum$ in which the sum iterates over all possibles values for the variables listed above it, given that the constraints below it hold.

$\begin{matrix} x, y \sum y & = 1 y = x mod 2 x \in {1, 2, 3, 4} x \leq 2 \end{matrix}$

says "for any value of $x$ and $y$ where these three constraints hold, sum $y$ ".

1.4. distributions

for any countable set $X$ , the set of distributions over $X$ is defined as:

$Δ_{X} ≔ {f | f \in X \to [0; 1], \begin{matrix} x \sum x \in X \end{matrix} f (x) \leq 1}$

a function $f \in X \to [0; 1]$ is a distribution $Δ_{X}$ over $X$ if and only if its sum over all of $X$ is never greater than 1. we call "mass" the scalar in $[0; 1]$ which a distribution assigns to any value. note that in our definition of distribution, we do not require that the distribution over all elements in the domain sums up to 1, but merely that it sums up to at most 1. this means that different distributions can have different "total mass".

we define $Δ_{X}^{0} \in Δ_{X}$ as the empty distribution: $Δ_{X}^{0} (x) = 0$ .

we define $Δ_{X}^{1} \in X \to Δ_{X}$ as the distribution entirely concentrated on one element: $Δ_{X}^{1} (x) (y) = {\begin{matrix} 1 & if y = x 0 & if y \neq x \end{matrix}$

we define ${Normalize}_{X} \in Δ_{X} \to Δ_{X}$ which modifies a distribution to make it sum to 1 over all of its elements, except for empty distributions:

${Normalize}_{X} (δ) (x) ≔ ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ \begin{matrix} \frac{δ (x)}{\begin{matrix} y \sum y \in X \end{matrix} δ (y)} & if δ \neq Δ_{X}^{0} 0 & if δ = Δ_{X}^{0} \end{matrix}$

we define ${Uniform}_{X}$ as a distribution attributing equal value to every different element in a finite set $X$ , or the empty distribution if $X$ is infinite.

${Uniform}_{X} (x) ≔ {\begin{matrix} \frac{1}{# X} & if # X \in N 0 & if # X \notin N \end{matrix}$

we define ${max}_{X}^{Δ} \in Δ_{X} \to P (X)$ as the function finding the elements of a distribution with the highest value:

${max}_{X}^{Δ} (δ) ≔ {x | x \in X, \forall x^{'} \in X : δ (x^{'}) \leq δ (x)}$

1.5. constrained mass

given distributions, we will define a notation which i'll call "constrained mass".

it is defined as a syntactic structure that turns into a sum:

$\begin{matrix} v_{1}, \dots, v_{p} & v_{1}, \dots, v_{p} M [V] & ≔ & \sum X_{1} (x_{1}) \cdot \dots \cdot X_{n} (x_{n}) \cdot V x_{1} : X_{1} & x_{1} \in domain (X_{1}) ⋮ & ⋮ x_{n} : X_{n} & x_{n} \in domain (X_{n}) C_{1} & C_{1} ⋮ & ⋮ C_{m} & C_{m} \end{matrix}$

in which variables $x$ are sampled from their respective distributions $X$ , such that each instance of $V$ is multiplied by $X (x)$ for each $x$ . constraints $C$ and iterated variables $v$ are kept as-is.

it is intended to weigh its expression body $V$ by various sets of assignments of values to the variables $v$ , weighed by how much mass the $X$ distributions return and filtered for when the $C$ constraints hold.

to take a fairly abstract but fully calculable example,

$\begin{matrix} x, f M [f (x, 2)] & ≔ & x, f \sum (λ n : {1, 2, 3} . \frac{n}{10}) (x) \cdot {Uniform}_{{min, max}} (f) \cdot f (x, 2) x : λ n : {1, 2, 3} . \frac{n}{10} & x \in domain (λ n : {1, 2, 3} . \frac{n}{10}) f : {Uniform}_{{min, max}} & f \in domain ({Uniform}_{{min, max}}) x mod 2 \neq 0 & x mod 2 \neq 0 = & x, f \sum \frac{x}{10} \cdot \frac{1}{2} \cdot f (x, 2) x \in {1, 2, 3} f \in {min, max} x mod 2 \neq 0 = & \frac{1 \cdot min (1, 2)}{10 \cdot 2} + \frac{3 \cdot min (3, 2)}{10 \cdot 2} + \frac{1 \cdot max (1, 2)}{10 \cdot 2} + \frac{3 \cdot max (3, 2)}{10 \cdot 2} = & \frac{1 \cdot 1 + 3 \cdot 2 + 1 \cdot 2 + 3 \cdot 3}{20} = \frac{1 + 6 + 2 + 9}{20} = \frac{18}{20} = \frac{9}{10} \end{matrix}$

in this syntax, the variables being sampled from distributions are allowed to be bound by an arbitrary amount of logical constraints or new variable bindings below it, other than the variables being sampled from distributions.

1.6. bitstrings

$B^{*}$ is the set of finite bitstrings.

bitstrings can be compared using the lexicographic order $<_{B^{*}}$ , and concatenated using the $∥$ operator. for a bitstring $x \in B^{*}$ , $| x | \in N$ is its length in number of bits.

for any countable set $X$ , ${Encode}_{X} \in X \to B^{*}$ and will be some reasonable function to convert values to bitstrings, such that $\forall (x, y) \in X^{2} : {Encode}_{X} (x) = {Encode}_{X} (y) \Leftrightarrow x = y$ . "reasonable" entails constraints such as:

it can be computed efficiently.
it can be inverted efficiently and unambiguously.
its output's size is somewhat proportional to the actual amount of information. for example, integers are encoded in binary, not unary.

1.7. cryptography

we posit $σ ≔ B^{¯ σ}$ , the set of "signatures", sufficiently large bitstrings for cryptographic and uniqueness purposes, with their length defined as $¯ σ = 2^{31}$ for now. this feels to me like it should be enough, and if it isn't then something is fundamentally wrong with the whole scheme, such that no manageable larger size would do either.

we posit a function $ExpensiveHash \in B^{*} H \to σ$ , to generate fixed-sized strings from seed bitstrings, which must satisfy the following:

it must be too expensive for the AI to compute in any way (including through superintelligently clever tricks), but cheap enough that we can compute it outside of the AI — for example, it could require quantum computation, and the AI could be restricted to classical computers
it should take longer to compute (again, in any way) than the expected correct versions of $Loc$ 's $f, g$ functions (as will be defined later) could afford to run
it should tend to be collision-resistant

at some point, we might come up with more formal ways to define $ExpensiveHash$ in a way that checks that it isn't being computed inside $Loc$ 's $f, g$ functions, nor inside the AI.

1.8. text and math evaluation

for any countable set $X$ , we'll be assuming ${EvalMath}_{X} \in B^{*} \to {{x} | x \in X} \cup {\emptyset}$ to interpret a piece of text as a piece of math in some formal language, evaluating to either:

a set of just one element of $X$ , if the math parses and evaluates properly to an element of $X$
an empty set otherwise

for example,

$\begin{matrix} {EvalMath}_{N} ("1+2") = {3} {EvalMath}_{N} ("hello") = \emptyset \end{matrix}$

1.9. kolmogorov simplicity

for any countable sets $X$ and $P$ :

$K_{X}^{-} \in Δ_{X}$ is some "kolmogorov simplicity" distribution over set $X$ which has the properties of never assigning 0, and summing/converging to 1 over all of $X$ . it must satisfy $\forall x \in X : K_{X}^{-} (x) > 0$ and $\begin{matrix} x \sum x \in X \end{matrix} K_{X}^{-} (x) = 1$ .

$K^{-}$ is expected to give more mass to simpler elements, in an information-theoretic sense.

notably, it is expected to "deduplicate" information that appears in multiple parts of a same mathematical object, such that even if $x \in B^{*}$ holds lots of information, $K_{B^{*}}^{-} (x)$ is not much higher (higher simplicity, i.e. lower complexity) to $K_{B^{*} \times B^{*}}^{-} (x, x)$ .

we could define $K_{X}^{-}$ similarly to cross-entropy, with some universal turing machine $UTM \in B^{*} \times N \to B^{*}$ returning the state of its tape after a certain number of compute steps:

$\begin{matrix} i, n K_{X}^{-} ≔ {Normalize}_{X} (λ x : X . & \sum \frac{1}{(2^{| i |} \cdot (n + 1))^{2}}) i \in B^{*} n \in N UTM (i, n) = {Encode}_{X} (x) \end{matrix}$

kolmogorov simplicity over $X$ with a prior from $P$ , of type $K_{P, X}^{- \sim} : P \to Δ_{X}$ , allows elements it samples over to share information with a prior piece of information in $P$ . it is defined as $K_{P, X}^{- \sim} (p) ≔ {Normalize}_{X} (λ x : X . K_{P \times X}^{-} (p, x))$ .

2. physics

in this section we posit some formalisms for modeling world-states, and sketch out an implementation for them.

2.1. general physics

we will posit some countable set $Ω$ of world-states, and a distribution $Ω_{α} \in Δ_{Ω}$ of possible initial world-states.

we'll also posit a function $Ω_{α}^{\to} \in Ω \to Δ_{Ω}$ which produces a distribution of future world-states for any specific world-state in the universe starting at $α$ .

given an initial world-state $α \in Ω$ , we'll call $Ω_{α}^{\to} (α)$ the "universe" that it gives rise to. it must be the case that $\begin{matrix} ω \sum ω \in Ω \end{matrix} Ω_{α}^{\to} (α) (ω) = 1$ .

when $α$ describes the start of a quantum universe, individual world-states $Ω$ following it by $Ω_{α}^{\to}$ would be expected to correspond to many-worlds everett branches.

for concreteness's sake, we could posit $Ω \subset B^{*}$ , though note that $α$ is expected to not just hold information about the initial state of the universe, but also about how it is computed forwards.

given a particular $α \in Ω$ :

finally, we define ${SimilarPasts}_{α} \in Ω \times Ω \to [0; 1]$ which checks how much they have past world-states $ω_{past}$ in common:

$\begin{matrix} ω_{1} {SimilarPasts}_{α} (ω_{2}, ω_{2}^{'}) ≔ & M [Ω_{α}^{\to} (ω_{1}) (ω_{2}) \cdot Ω_{α}^{\to} (ω_{1}) (ω_{2}^{'})] ω_{1} : Ω_{α}^{\to} (α) \end{matrix}$

2.2. quantum turing machines

we will sketch out here a proposal for $Ω$ , $Ω_{α}$ , and $Ω^{\to}$ such that our world-state $w$ has hopefully non-exponentially-small $Ω_{α}^{\to} (α) (ω)$ .

the basis for this will be a universal quantum turing machine. we will posit:

$Tape ≔ {s | s \in P (Z), # s \in N}$ the set of turing machine tapes, as finite (thanks to $# s \in N$ ) sets of relative integers representing positions in the tape holding a 1 rather than a 0.
$S t a t e$ some finite ( $# S \in N$ ) set of states, and some ${state}_{0} \in State$ .
$Ω ≔ Tape \times State \times Z$ : world-states consist of a tape, state, and machine head index.
$Δ_{Ω}^{q} ≔ {f | f \in Ω \to C, \begin{matrix} ω \sum ω \in Ω \end{matrix} {∥ f (ω) ∥}^{2} = 1}$ the set of "quantum distributions" over world-states
$Step \in Δ_{Ω}^{q} \to Δ_{Ω}^{q}$ the "time step" operator running some universal turing machine's transition matrix to turn one quantum distribution of world-states into another

we'll also define $Δ_{N}^{2} \in Δ_{N}$ as the "quadratic realityfluid distribution" which assigns diminishing quantities to natural numbers, but only quadratically diminishing: $Δ_{N}^{2} (n) ≔ {Normalize}_{N} (\frac{1}{(n + 1)^{2}})$

we can then define $Ω^{\to}$ as repeated applications of $Step$ , with quadratically diminishing realityfluid:

$\begin{matrix} n_{1}, n_{2}, s Ω_{α}^{\to} (ω_{1}) (ω_{2}) ≔ c \cdot & M [s (n_{1}, ω_{1}) \cdot s (n_{1} + n_{2}, ω_{2})] n_{1} : Δ_{N}^{2} n_{2} : Δ_{N}^{2} s (n, ω) = {∥ Step \circ^{n} (Δ_{Ω}^{1} (α)) (ω) ∥}^{2} \end{matrix}$

where the constant $c$ is whatever scalar it needs to be for $\begin{matrix} ω \sum ω \in Ω \end{matrix} Ω_{α}^{\to} (α) (ω) = 1$ to be satisfied.

this implementation of $Ω_{α}^{\to}$ measures how much $ω_{2}$ is in the future of $ω_{1}$ by finding paths from $α$ to $ω_{1}$ , and then longer paths from $α$ to $ω_{2}$ .

and finally, we define $Ω_{α}$ as a distribution giving non-zero value to world-states $(t, s t a t e_{0}, 0)$ where $t$ is a tape where no negative-index cells are set to 1.

$Ω_{α} (t, s, i) ≔ ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} Δ_{N}^{2} (\begin{matrix} n \sum n \in t \end{matrix} 2^{n}) & if s = s t a t e_{0}, i = 0, t \subset N 0 & otherwise \end{matrix}$

because we selected a universal (quantum) turing machine, there is at least one input tape implementing any single quantum algorithm, including the quantum algorithm implementing our physics.

3. implementing QACI

finally, we get into the core mechanisms of QACI.

the core idea of QACI is "blob location": mathematically formalizing the idea of locating our world and locating bitstrings (which i'll call "blobs") stored on computers within that world, out of the space of all possible computational universes, by sampling over functions which extract those blobs from world-states in $Ω$ and functions which can produce a counterfactual world where that blob has been replaced with another blob of the same length (in number of bits).

3.1. question blob and observation

throughout these functions, we will posit the following constants:

the initial factual question blob $q \in B^{*}$
two "observation" blobs $μ_{1} \in B^{*}$ and $μ_{2} \in B^{*}$

$μ_{1}, μ_{2}$ are variables which will be passed around, called "observations". in normal AI agent framings, an AI would have a history of actions and observations, and decide on its next action based on that; but, in the one-shot framing we use, there is only a single action and a fixed set of observations. the observations, in practice, will be a very large pieces of data helping the AI locate itself in the multiverse of all possible computations, as well as get a better idea of how and where it is being ran. we will likely include in it things like:

a full explanation of the QACI alignment plan, including the math
the AI's code
a dump of wikipedia and other large parts of the internet
a copy of some LLM

$μ_{1}$ will be produced before the question blob is generated, and $μ_{2}$ will be produced after the question blob is generated but before the AI is launched.

3.2. overview

the overall shape of what we're doing can be seen on the illustration below: we start at the start of the universe $α$ , and use four blob locations and a counterfactual blob function call to locate five other world-states. the illustration shows distributions of future and past world-states, as well as a particular sampling of for all four blob locations.

we sample $ω_{μ_{1}}$ using $Loc (α, Ω_{α}^{\to} (α), μ_{1}, ξ)$ , world-states containing the first observation $μ_{1}$
we sample $ω_{μ_{2}}$ using $Loc (α, Ω_{α}^{\to} (ω_{μ_{1}}), μ_{2}, ξ)$ , world-states containing the second observation $μ_{2}$
we sample $ω_{q}$ using $Loc (α, Ω_{α}^{\to} (ω_{μ_{1}}), q, ξ)$ , world-states containing the question blob $q$ , but requiring that its world-state $ω_{q}$ precede the world-state $ω_{μ_{2}}$
we get $ω_{q}^{'}$ , the world-state with a counterfactual question blob, using blob location $γ_{q}$ found by sampling $ω_{q}$
we sample $ω_{r}$ using $Loc (α, Ω_{α}^{\to} (ω_{q}^{'}), r, ξ)$ , possible world-states containing an answer to a given counterfactual question $q^{'}$

the location path from $ω_{q}'$ to $ω_{r}$ is used to run QACI intervals, where counterfactual questions $q^{'}$ are inserted and answers $r$ are located in their future.

(we could also build fancier schemes where we locate the AI's returned action, or its code running over time, in order to "tie more tightly" the blob locations to the AI — but it is not clear that this helps much with blob location failure modes i'm concerned about.)

for the moment, we merely rely on $μ_{1}$ and $μ_{2}$ being uniquely identifying enough — though implementing them as static bitstrings might suffice, perhaps they could instead be implemented as lazily evaluated associative maps. when the AI tries to access members of those maps, code which computes or fetches information from the world (such as from the internet) would be executed determines the contents of that part of the observation object. this way, the observation would be conceptualized as a static object to the AI — and indeed it wouldn't be able to observe any mutations — but it'd be able to observe arbitrary amounts of the world, not just amounts we'd have previously downloaded.

we could make the QACI return not a scoring over actions but a proper utility function, but this only constrains the AI's action space and doesn't look like it helps in any way, including making QACI easier for the AI to make good guesses about. perhaps with utility functions we find a way to make the AI go "ah, well i'm not able to steer much future in world-states where i'm in hijacked sims", but it's not clear how or even that this helps much. so for now, the math focuses on this simple case of returning an action-scoring function.

3.3. blob location

for any blob length (in bits) $n \in N$ :

first, we'll posit $Γ_{n} ≔ B^{n} \to Ω$ the set of blob locations; they're identified by a counterfactual blob location function, which takes any counterfactual blob and return the world-state in which a factual blob has been replaced with that counterfactual blob.

${Loc}_{n} \in Ω \times Δ_{Ω} \times B^{n} \times Ξ \to Δ_{Γ_{n}}$ tries to locate an individual blob $b$ (as a bitstring of length $n$ ) in a particular world-state sampled from the time-distribution (past or future) $δ$ (which will usually be a distribution returned by $Ω_{α}^{\to}$ ) within the universe starting at $α$ .

it returns a distribution over counterfactual insertion functions of type $B^{n} \to Ω$ which take a counterfactual blob and return the matching counterfactual world-state. the elements in that distribution typically sum up to much less than 1; the total amount they sum up to corresponds to how much $Loc$ finds the given blob in the given world-state to begin with; thus, sampling from a distribution returned by $Loc$ in a constrained mass calculation $M$ is useful even if said result is not used, because of its multiplying factor.

note that the returned counterfactual insertion function can be used to locate the factual world-state — simply give it the factual blob as input.

$Ξ$ is some countably infinite set of arbitrary pieces of information which each call to $Loc$ can use internally — the goal of this is for multiple different calls to $Loc$ to be able to share some prior information, while only being penalized by $K^{-}$ for it once. for example, an element of $Ξ$ might describe how to extract the contents of a specific laptop's memory from physics, and individual $Loc$ calls only need to specify the date and the memory range. for concreteness, we can posit $Ξ ≔ B^{*}$ , the set of finite bitstrings.

$\begin{matrix} f, g, ω, b^{'}, τ {Loc}_{n} (α, δ, b, ξ) (γ) ≔ & M [\frac{{SimilarPasts}_{α} (ω, g (b^{'}, τ))}{R (g, (b^{'}, τ)) + R (f, g (b^{'}, τ))}] (f, g) : K_{Ξ, (Ω H \to B^{n} \times B^{*}) \times (B^{n} \times B^{*} H \to Ω)}^{- \sim} (ξ) ω : λ ω : {max}_{X}^{Δ} (λ ω : Ω . {\begin{matrix} δ (ω) & if f (ω) = (b, τ) 0 & otherwise \end{matrix}) . δ (ω) b^{'} : {Uniform}_{B^{n}} \begin{matrix} \forall b^{''} \in B^{n} : & γ (b^{''}) = g (b^{''}, τ) f (γ (b^{''})) = (b^{''}, τ) \end{matrix} \end{matrix}$

$Loc$ works by sampling a pair of functions $f, g$ , which convert world-states forth and back into {pairs whose first element is the blob and whose second element represents everything in the world-state except the blob}.

that latter piece of information is called $τ$ (tau), and rather than being sampled $τ$ is defined by the return value of $f$ on the original world-state — notably, $τ$ is not penalized for being arbitrarily large, though $f$ and $g$ are penalized for their compute time.

for a given fixed pair of $f$ and $g$ , $Loc$ finds the set of hypothesis world-states $ω$ with the highest value within the time-distribution $δ$ for which $f, g$ work as intended. this is intended to select the "closest in time" world-states in $δ$ , to avoid adversarial attackers generating their own factual blobs and capturing our location.

it then weighs locations using, for every counterfactual blob $b^{'} \in B^{n}$ :

the degree to which counterfactual world-states tend to share pasts with the original factual world-state, for $b^{'}$ .
the compute time of $g$ and $f$ on counterfactual blobs and world-states respectively.

note that ${Loc}_{n}$ , by design, only supports counterfactual blobs whose length $n$ is equal to the length of the initial factual blob $b$ — it wouldn't really make sense to talk about "replacing bits" if the bits are different.

in effect, $Loc$ takes random $f, g$ decoding and re-encoding programs, measures how complex and expensive they are and how far from our desired distributions are world-states in which they work, and how close to the factual world-state their counterfactual world-states are.

3.4. blob signing

we'll define $Π ≔ B^{| q | - ¯ σ}$ , the set of possible answer bitstring payloads.

counterfactual questions will not be signed, and thus will be the set of bitstrings of the same length as the factual question — $B^{| q |}$ .

we'll define $Sign \in Π \times B^{*} \to B^{| q |}$ as $Sign (π, k) ≔ ExpensiveHash (π ∥ k) ∥ π$ . this functions tags blob payloads using a "signature" generated from a seed bitstring, concatenating it to the blob payload.

3.5. action-scoring functions

we will posit $A \subset B^{*}$ as the finite set of actions the AI can take, as a finite set of bitstrings.

we'll call $U ≔ A \to [0; 1]$ the set of "scoring functions" over actions — functions which "have an opinion" about various actions. this is similar to utility functions, except it's over actions rather than over worlds or world-histories.

they can be composed using $Compose \in Δ_{U} \to U$ , which could be simple scoring-function averaging:

$\begin{matrix} u Compose (δ) (a) ? ≔ & M [u (a)] u : δ \end{matrix}$

but alternatively, we could use something like Diffractor's Rose bargaining to reduce the ability for scoring/utility functions to threaten each other — and notably ours.

$Compose ? ≔ Rose$

(where i'm using $? ≔$ to mean "maybe define this way, but i'm not sure")

3.6. QACI query

using those, we define $QACI \in Ω \times Γ_{| q |} \times Ξ \times B^{| q |} \to Δ_{Π}$ which given a physics hypothesis $α$ , a question blob location $γ_{q}$ , and a blob location prior $ξ$ , returns the highest guess returned answer payload $π_{r}$ for a given counterfactual question $q^{'}$ .

$\begin{matrix} γ_{r} QACI (α, γ_{q}, ξ, q^{'}) (π_{r}) ≔ & M [1] γ_{r} : {Loc}_{| q |} (α, Ω_{α}^{\to} (γ_{q} (q^{'})), Sign (π_{r}, q^{'}), ξ) \end{matrix}$

$QACI$ works by sampling answer blob locations $γ_{r}$ , from world-states in the future of the counterfactual question world-state $γ_{q} (q^{'})$ , signed using $q^{'}$ .

with its first three parameters fixed, $QACI$ becomes the straightforward counterfactual query function $B^{| q |} \to Δ_{Π}$ : one can call it with arbitrary counterfactual text inputs (within the size limitation), and get a distribution over possible answers, which can easily be collapsed using ${max}_{Π}^{Δ}$ .

3.7. top-level QACI call

the top-level call to the $QACI$ query function, ${QACI}_{0} \in Ω \times Γ_{| q |} \times Ξ \to Δ_{U}$ interprets its output as a piece of math and executes it with, as parameters, various global and contextual values it might need access to, and returns a distribution over action-scoring functions:

$\begin{matrix} π_{r}, f {QACI}_{0} (α, γ_{q}, ξ) (u) ≔ & M [1] π_{r} : QACI (α, γ_{q}, ξ, q_{0}^{'}) f \in {EvalMath}_{{q} \times {μ_{1}} \times {μ_{2}} \times Ω \times Γ_{| q |} \times Ξ \to U} (π_{r}) f (q, μ_{1}, μ_{2}, α, γ_{q}, ξ) = u \end{matrix}$

where $q_{0}^{'} \in B^{| q |}$ is the initial counterfactual blob, such as "return a good scoring of actions" encoded in ASCII, padded with zeros to be of the right length.

${QACI}_{0}$ 's distribution over answers demands that the answer payload $π_{r}$ , when interpreted as math and with all required contextual variables passed as input ( $q, μ_{1}, μ_{2}, α, γ_{q}, ξ$ ), returns an action-scoring function equal to $u$ — this is how it measures the weight of any action-scoring function $u$ .

$M [1]$ makes it that ${QACI}_{0}$ 's distributions are only determined by the sampled variables and logical requirements.

$EvalMath$ 's $f$ function having access to $QACI$ 's distribution over output texts rather than best candidate allows it to discard as many invalid candidates as it needs and stick to ones that match whatever constraits it has.

3.8. action scoring

we'll posit the AI as $AI \in U \to A$ — a program which tries to satisfy a scoring over actions, by making a high-expected-score guess.

we define $Score \in U$ , the action-scoring function which the AI will be making guesses about as a scoring function over actions, which happens to be one that is, hopefully, good. this is the scoring function for which the AI will be trying to produce an action that is as favorable as possible, within its limited capabilities.

$\begin{matrix} α, ξ, γ_{μ_{1}}, γ_{μ_{2}}, γ_{q} Score ≔ Compose (λ u : U . & M [{Normalize}_{U} ({QACI}_{0} (α, γ_{q}, ξ)) (u)]) α : Ω_{α} ξ : K_{Ξ}^{-} \begin{matrix} γ_{μ_{1}} & : {Loc}_{| μ_{1} |} & (α, Ω_{α}^{\to} (α & ), μ_{1} & , ξ) γ_{μ_{2}} & : {Loc}_{| μ_{2} |} & (α, Ω_{α}^{\to} (γ_{μ_{1}} (μ_{1}) & ), μ_{2} & , ξ) γ_{q} & : {Loc}_{| q |} & (α, Ω_{α}^{\to} (γ_{μ_{1}} (μ_{1}) & ), q & , ξ) \end{matrix} Ω_{α}^{\to} (γ_{q} (q)) (γ_{μ_{2}} (μ_{2})) > Ω_{α}^{\to} (γ_{μ_{2}} (μ_{2})) (γ_{q} (q)) \end{matrix}$

where the following variables are sampled:

an initial state of the universe $α$ .
a blob location prior $ξ$ , sampled for simplicity.
the blob locations of $μ_{1}$ , $μ_{2}$ , and $q$ .

and the world-state containing the second observation $μ_{2}$ is required to be in the future of the word-state containing the question $q$ .

then, we rate possible actions $a$ by composing the scoring functions produced by ${QACI}_{0}$ for all of those blob location hypotheses.

for any question location, the set of action-scoring functions sampled by ${QACI}_{0}$ is normalized. this is because pairs of AI-action location and question location should not be penalized for having a "harder to find" answer — once the observations and question have been located, we want to obtain the answer no matter what.

finally, we'll just execute the action returned by $AI (Score)$ .

AI ALIGNMENT FORUM
AF

13