Impact measures may be a powerful safeguard for AI systems - one that doesn't require solving the full alignment problem. But what exactly is "impact", and how can we measure it?
In this post, we introduce contributions and supracontributions[1], which are basic objects from infra-Bayesianism that go beyond the crisp case (the case of credal sets). We then define supra-POMDPs, a generalization of partially observable Markov decision processes (POMDPs). This generalization has state transition dynamics that are described by supracontributions.
We use supra-POMDPs to formalize various Newcombian problems in the context of learning theory where an agent repeatedly encounters the problem. The one-shot version of these problems are well-known to highlight flaws with classical decision theories.[2] In particular, we discuss the opaque, transparent, and epsilon-noisy versions of Newcomb's problem, XOR blackmail, and counterfactual mugging.
We conclude by stating a theorem that describes when optimality for the supra-POMDP relates to optimality for the Newcombian problem. This theorem is significant because it gives...
Imagine each of us has an AI representative, aligned to us, personally. Is gradual disempowerment solved?[1] In my view, no; at the same time having AI representatives helps at the margin.
I have two deep reasons for skepticism.[2] Here is the first one.
We, as individuals, are not the only agents or “agencies” in this world. Other goal-oriented strategic entities include states, corporations, and to some extent egregores.
For the sake of argument, imagine the current distribution of agency in the world as approximately 60% humans, 20% corporations, 20% states (numbers are illustrative, egregores not included for simplicity)
Now, suppose we introduce individually aligned AI assistants for each person. At first glance, this seems empowering—each individual gains a cognitive extension, potentially leveling the playing field.
But: do corporations or states also...
What's the story for why companies end up doing things that are catastrophic for their shareholders? (Feel free to give state the key points briefly, Im familiar with many of the ideas)
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and...
Right, I think LLMs are good at imitating “Joe today”, and good at imitating “Joe + 1 month of learning Introductory Category Theory”, but not good at imitating the delta between those, i.e. the way that Joe grows and changes over that 1 month of learning—or at least, not “good at imitating” in a way that would generalize to imitating a person learning a completely different topic that’s not in the training data.
In other words, after watching 1000 people try to learn category theory over the course of a month (while keeping diaries), I claim that an LLM wo...
This is a rough write-up of ideas by Ryan Greenblatt. The core ideas are his and mistakes are mine. Thanks to him, Buck Shlegeris and to Aghyad Deeb for feedback on the draft of this post.
One way to use scheming AIs is to train them to imitate humans. When does imitation robustly avoid frequent and intentional failures?
We describe two techniques that avoid imitation failures on generation tasks:
Credal sets, a special case of infradistributions[1] in infra-Bayesianism and classical objects in imprecise probability theory, provide a means of describing uncertainty without assigning exact probabilities to events as in Bayesianism. This is significant because as argued in the introduction to this sequence, Bayesianism is inadequate as a framework for AI alignment research. We will focus on credal sets rather than general infradistributions for simplicity of the exposition.
Recall that the total-variation metric is one example of a metric on the set of probability distributions over a finite set A set is closed with respect to a metric if it contains all of its limit points with respect to the metric. For example, let The set of probability distributions over is given by
There is a bijection between and the closed interval which is...
The truth is outside of my hypothesis class, but my hypothesis class probably contains a non-trivial law that is a coarsening of the truth, which is the whole point.
For example, you can imagine that you start with some kind of intractable simplicity prior. Then, for each hypothesis you choose a tractable law that coarsens it. You end up with a probability distribution over laws.
A different way to view this is, this is just a way to force your policy to have low-regret w.r.t. all/most hypothesis while weighing complex hypotheses less. For a complex hypothe...
Halpern and Leung propose the "minimax weighted expected regret" (MWER) decision-rule, which is a generalization of the minimax-expected-regret (MER) decision-rule. In contrast, our decision rule is a weighted generalization of maximin-expected-utility (MMEU). The problem with MER is that it doesn't work very well with learning. The closest thing to doing learning with MER is adversarial bandits. However, adversarial regret is statistically intractable for Markov Decision Processes. And even with bandits there is a hidden obliviousness assumption if you try to interpret it in a principled decision-theoretic way.