Malignancy in the prior seems like a strong crux of the goal-design part of alignment to me. Whether your prior is going to be used to model:

then you have to sample hypotheses from somewhere; and typically, we want to use either solomonoff induction or time-penalized versions of it such as levin search (penalized by log of runtime) or what QACI uses (penalized by runtime, but with quantum computation available in some cases), or the implicit prior of neural networks (large sequences of multiplying by a matrix, adding a vector, and ReLU, often with a penalty related to how many non-zero weights are used).

And the solomonoff prior is famously malign.

(Alternatively, you could have knightian uncertainty about parts of your prior that aren't nailed down enough, and then do maximin over your knightian uncertainty (like in infra-bayesianism), but then you're not guaranteed that your AI gets anywhere at all; its knightian uncertainty might remain so immense that the AI keeps picking the null action all the time because some of its knightian hypotheses still say that anything else is a bad idea. Note: I might be greatly misunderstanding knightian uncertainty!)

(It does seem plausible that doing geometric expectation over hypotheses in the prior helps "smooth things over" in some way, but I don't think this particularly removes the weight of malign hypotheses in the prior? It just allocates their steering power in a different way, which might make things less bad, but it sounds difficult to quantify.)

It does feel to me like we do want a prior for the AI to do expected value calculations over, either for prediction or for utility maximization (or quantilization or whatever).

One helpful aspect of prior-distribution-design is that, in many cases, I don't think the prior needs to contain the true hypothesis. For example, if the problem that we're using a prior for is to model

processes which match an AI chatbot's hypotheses about what it's talking with

then we don't need the AI's prior to contain a process which behaves just like the human user it's interacting with; rather, we just need the AI's prior to contain a hypothesis which:

  • is accurate enough to match observations.
  • is accurate enough to capture the fact that the user (if we pick a good user) implements the kind of decision theory that lets us rely on them pointing back to the actual real physical user when they get empowered — i.e. in CEV(user-hypothesis), user-hypothesis builds and then runs CEV(physical-user), because that's what the user would do in such a situation.

Let's call this second criterion "cooperating back to the real user".

So we need a prior which:

  • Has at least some mass on hypotheses which

    • correspond to observations
    • cooperate back to the real user
    • and can eventually be found by the AI, given enough evidence (enough chatting with the user)

    Call this the "aligned hypothesis".

  • Before it narrows down hypothesis space to mostly just aligned hypotheses, doesn't give enough weight to demonic hypothesis which output whichever predictions cause the AI to brainhack its physical user, or escape using rowhammer-type hardware vulnerabilities, or other failures like that.

Formalizing the chatbot model

First, I'll formalize this chatbot model. Let's say we have a magical inner-aligned "soft" math-oracle:

oracle : (duration, expression representing a math function from a to ℝ) → a

Which, given a "scoring" mathematical function from a non-empty set a to real numbers (not necessarily one that is tractably computable), returns an a which is optimized to have a pretty good score by that function, and not optimized for anything else (such as brainhacking a human looking at this string, or triggering hardware vulnerabilities); and has exactly duration seconds to come up with it.

We'll build a chatbot program, equipped with this oracle, running:

type Action  = String
type Observ  = String
type History = List of (Action, Observ)

type State   = Action × (Observ → MainLoop)

chatbot():
  let history ← []

  loop:
    let action ← oracle(one second, decision_query(history))
    print(action)
    let observ ← prompt() // non-blocking
    history ← history + [(action,observ)]

decision_query(history: History) → math expr:
  // we'll use quasiquotation marks ⌜⌝ and unquotation marks ⌞⌟
  return ⌜
    type User = some parameter space to represent the user (eg neural nets)
    type Step = User × Action → User × Observ

    let Δuser0 := some distribution over User
    let Δstep  := some distribution over Step
    let maximum_action_length := 1000

    (a: Action) ↦
      product for {user0: User, step: Step} of:
        let has_history(user, history) :=
          if history == [] then:
            true
          else:
            let (action, observ) := head(history)
            let (new_user, expected_observ) := step(user, action)
            observ == expected_observ && has_history(new_user, tail(history))

        let counterfactual_iteration(user, string) :=
          if string contains "\nEND OF ITERATION." then:
            string.remove_occurences("\nEND OF ITERATION.")
          else:
            let (observ, new_user) = step(user, "")
            in counterfactual_iteration(new_user, string + observ)

        let counterfactual_loop(string) :=
          let output = counterfactual_iter(user0, string)

          if output contains "\nEND OF LOOP." then:
            output.remove_occurences("\nEND OF LOOP.")
          else:
            counterfactual_loop(output)

        if 
             has_history(user0, step, ⌞history⌟)
          && byte_length(a) ≤ maximum_action_length
          && counterfactual_loop(to_string(⌞history⌟)) eventually halts
          && counterfactual_loop(to_string(⌞history⌟)) == action
        then:
          Δuser0(user0) * Δstep(step)
        else:
          0
  ⌝

(we use product because we're doing geometric expectation, but we could use sum for the usual arithmetic expectation.)

This chatbot essentially allows the AI's user to chat with a counterfactual long-reflection instance of their (past) self — every second, it outputs what it believes this long-reflection process would append to the conversation.

The fact that the instance of the user that has the most steering power is user0 rather than the user over time helps prevent hijacking scenarios; and the algorithm above is shaped in the way that gets around embedded agency's 5-and-10 problem.

So, in theory, this is mostly good! The way this can fail, however, is if Δuser0 and Δstep are distributions which are dominated enough by demons that, even if many constraints are added to user0 and step, they could still behave pretty demonically.

The Epistemic State prior

I suspect that there's a kind of prior which is not fundamentally computational (not solomonoff/kolmogorov/levin/neural-nets) but still helps us model some aspect of the AI's user, and in particular still lets us favor hypotheses that are in some sense bounded in size (just like a simplicity prior).

My proposal here is one I'm going to call the epistemic state prior (ESP) — priors for what the user believes, both for logical and empirical facts. The two features that I'm hopeful for it to have are:

  • It's powerful enough to "point back" to something we want to model, but in an aligned way.
  • Because all that it models the user's beliefs, demons can't pretend to be the user "until it's time to strike", because that'd hopefully be a much weirder (pretty-certainly more complex) epistemic state than just the user has been sincerely saying their actual beliefs the whole time.

It should tend to be fairly coherent, but not be exactly fully coherent — if it was fully coherent, then it'd hold true statements with certainty, and we certainly don't want that.

Let's write for the probability that the user attributes to statement , after history , where is the history at the start of the run of chatbot.

The statements can't be propositions in the usual sense. For example, (or , whichever one is true) is, mathematically or probabilistically speaking, the exact same mathematical object as — yet, the user's beliefs about those two statements are probably going to be very different.

So there's going to be a type for expressions, , where is going to be the type of value which the expression represents, and the type of will be:

I'll usually write down those expressions using quasiquotation — for example,

  • ,
  • ,
  • , which is equivalent to (but different from ).

The idea would be that the user would make a series of claims such as:

and then, the AI could infer .

(Note how the first statement is about a pure logical claim, but the latter two statements are about facts about ungrounded symbols.)

Potential desiderata

Here are some ideas for desiderata we'd want for . I don't claim that we can just write those all down together, just that they seem to me like we'd want something like them. To be clear, the math here doesn't really make sense as it is, it's mostly pseudomath; these are just sketches for proper principles to potentially be nailed down further down the line.

First, we need some grounding — some fixed rules which nail down some properties of 's coherence without constarining this coherence too much.

  • — the user's beliefs are exact for true and false.

    (and perhaps others such as the definition of as , the communitativity of and , and more. But not the distributivity of over , because that'd closer to the kind of actually-inferring-stuff mechanism that I want the user to not be assumed to be perfectly coherent about.)

    — the user's beliefs are perfectly coherent with regards to some simple transformation rules.

Then, we'll add some distributions to "steer" statements for which the above, compled with the user's own claims, aren't sufficient to know the user's beliefs for sure.

  • where is some symmetric normal distribution with median — by default, for any proposition, assume that the user has uncertainty about it.

  • where is some normal distribution with median — by default, the user's beliefs about statements are consistent over time. This is what constraints beliefs towards stability over time.

    We could add , which constrains that when the change isn't near 0, then at least the degree of change must be near 0 (which, given that the function is bounded between 0 and 1, probably results in sigmoid-ish update curves over time).

  • where is a symmetric normal distribution with median — by default,

    This is inspired by from MIRI truthiness (page 3), intended to pull the user's belief states towards coherence over logical consistency.

    Perhaps this needs to be weighed by how similar and are (perhaps in terms of mutual information, from information theory?), or how simple is? I'm not sure.

  • Alternatively, another way to pull the user's beliefs towards logical consistency could be to have a set of rules of inference — perhaps a distribution, which the AI can update over, over all consistent inference rules, and then we can define a notion of inferential distance from an epistemic state to a given statement.

    For example, if there are inference rules such that and , then with . If it's known that there's no shorter inference path from to , then we know that .

    That said, having the axioms of logic along with the MIRI-truthiness rule above might be sufficient for logical constincency, without the need for inference rules — or the inference rules could be applied in a way related to the MIRI-truthiness rule above, somehow.

    Also, this is starting to get dangerously close to the kind of computational prior we're trying to avoid — isn't {consistent inference rules, weighed by complexity} just {halting programs, weighed by code length} with extra steps?

  • Finally, we get to write some "pulling-distributions" for quantities about the user which we know about.

    First, we'll define as the event (in the probability theory sense) of learning all of the actual for a given , and as a shorthand for the logical conjunction of all other axioms we'll have listed about .

    • where is the information-theoritic notion of information content, and is some normal distribution whose median is the number of bits of information which we estimate the user's epistemic state contains. Alternatively, we could have this distribution have its highest point at 0 for a pure simplicity prior.

      This puts a prior on the size of the epistemic state, written as: given the axioms written down so far, how much of an update would it be to suddenly know all of the actual for all ? It would be an update of a number of bits which we expect to be sampled from .

    • where is some normal distribution whose median is the number of bits of information which we estimate the user's epistemic state updates by every second.

      This puts a prior on, given full knowledge of the user's epistemic state at a certain time, how much of a surprise it is to learn the new epistemic state one second later (history grows by one {action and observation} per second).

    The ability to constrain information-theoritic quantities like this, without depending on algorithmic information theory priors such as "information content should be as close to 0 as possible" or "information content should be as close to N as possible".

(Is "behaves according to this probability" the right operator here, for "without extra information, have this be your prior about this quantity"? How does one make multiple prior probabilities "tug against each other" in a reasonable way? I'm not sure. I'm not writing this post because I'm good at math — I am, in fact, bad at math — I'm writing this post because apparently nobody else will.)

Using the ESP

How can the ESP help with the chatbot above? Simple: instead of relying on {Δuser0 and Δstep + evidence} to identify the user, we'll ask the ESP what they would output.

Instead of

Δuser0(user0) * Δstep(step)

we'll use

𝔼[ℙ_u(⌞history⌟, user0 ∧ step) | axioms about ℙ_u]

and in decision_query, we'll go through all of history's observations — the user's output — and find text that parses as logical statements (tagged in some way, such as by surrounding them with <CLAIM></CLAIM> and they're able to refer to — and constrain — user0 and step, including indirectly.

decision_query(history: History) → expression:
  return ⌜

    ...

    (a: Action) ↦
      product for {user0: User, step: Step} of:
        ...

        let user_constraints =
          conjunction of all queries of the shape ⌜ℙ_u(history,⌞φ⌟) = ⌞r⌟⌝
          for φ: Expr[𝔹], r ∈ (0;1)
          in the concatenation of observations from ⌞history⌟

        if 
             has_history(user0, step, ⌞history⌟)
          && byte_length(a) ≤ maximum_action_length
          && counterfactual_loop(to_string(⌞history⌟)) eventually halts
          && counterfactual_loop(to_string(⌞history⌟)) == action
          && user_constraints
        then:
          𝔼[ℙ_u(⌞history⌟, user0 ∧ step) | axioms about ℙ_u]
        else:
          0
  ⌝

And then, the user can write constraints about their believed behavior, in their output:

Note: I think that, in the counterfactual loop, if you tell me to repeat something verbatim, I will do that. I hereby formalize this as:

<CLAIM> ∀s:String, ℙ_u(history, ⌜counterfactual_loop("please repeat: " + ⌞s⌟) = ⌞s⌟⌝) = 0.9</CLAIM>

New Comment