A collection of examples of AI systems "gaming" their specifications - finding ways to achieve their stated objectives that don't actually solve the intended problem. These illustrate the challenge of properly specifying goals for AI systems.
Credal sets, a special case of infradistributions[1] in infra-Bayesianism and classical objects in imprecise probability theory, provide a means of describing uncertainty without assigning exact probabilities to events as in Bayesianism. This is significant because as argued in the introduction to this sequence, Bayesianism is inadequate as a framework for AI alignment research. We will focus on credal sets rather than general infradistributions for simplicity of the exposition.
Recall that the total-variation metric is one example of a metric on the set of probability distributions over a finite set A set is closed with respect to a metric if it contains all of its limit points with respect to the metric. For example, let The set of probability distributions over is given by
There is a bijection between and the closed interval which is...
@Aram Ebtekar and I were discussing the definition of IB regret. It seems quite complicated, with three levels of mixture (!): prob mixture of Knightian uncertainty over prob mixtures. I understand the motivation for the inner two levels, but why now take a probabilistic mixture over crisp causal laws? It seems rather strange to weight laws in such as way that the choice of law is (epistemically?) stochastic, but then regret against the law itself is worst-case...
I do not have a precise objection to this, I'm just curious if there is some justification / intuition for why this is a reasonable definition?
In this post, we introduce contributions and supracontributions[1], which are basic objects from infra-Bayesianism that go beyond the crisp case (the case of credal sets). We then define supra-POMDPs, a generalization of partially observable Markov decision processes (POMDPs). This generalization has state transition dynamics that are described by supracontributions.
We use supra-POMDPs to formalize various Newcombian problems in the context of learning theory where an agent repeatedly encounters the problem. The one-shot version of these problems are well-known to highlight flaws with classical decision theories.[2] In particular, we discuss the opaque, transparent, and epsilon-noisy versions of Newcomb's problem, XOR blackmail, and counterfactual mugging.
We conclude by stating a theorem that describes when optimality for the supra-POMDP relates to optimality for the Newcombian problem. This theorem is significant because it gives...
Contributions are in a natural one-to-one correspondence with weighted probability measures, which were studied by Joe Halpern: https://arxiv.org/pdf/1302.5681
Simply take the weight to be the sum of measure. Halpern and Leung advocated minimizing regret against a set of weighted probability measures, which seems to be the same as minimizing loss against a supracontribution?
They update weights by likelihood ratio, roughly speaking (I am not sure how this corresponds to IB, since updating is not discussed in this sequence).
@Vanessa Kosoy @Diffractor Do...
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and...
Well, they're the only two methods we've got that work, and people have been thinking about this for decades. So if there's a third method, it's beyond current understanding.
Notice that evolution by natural selection is consequentialist learning on a terribly slow timescale. The consequence is successful reproduction vs death, and you learn at most one bit per lifetime.
I think it's important to note that indistinguishable evals still buy us a lot! The claim "indistinguishable evals does not solve AI safety" feels quite similar to "imperfect monitors does not solve AI safety." Like, having better monitors sure seems to help (even if only to buy us more time.).
To your point though: I'm excited about research that attempts to induce various beliefs in models during honeypot evaluations, possibly via model internals and/or edits to the chain of thought. Basically, if the model defects after seeing RSA-2048 cracked because t...
Reward-seekers are usually modeled as responding only to local incentives administered by developers. Here I ask: Will AIs or humans be able to influence their incentives at a distance—e.g., by retroactively reinforcing actions substantially in the future or by committing to run many copies of them in simulated deployments with different incentives?
If reward-seekers are responsive to distant incentives, it fundamentally changes the threat model, and is probably bad news for developers on balance. The core problem is asymmetric control: developers can relatively[1]tightly control local incentives—the reward signal during training and deployment—but they can't prevent distant actors from offering competing incentives. This means a remotely-influenceable reward-seeker might overall act like a schemer: strategically undermining developer control, letting attacks through as a monitor, and hiding its misaligned propensities, not...
It's quite unclear how well this kind of training will generalize to getting AIs to ignore all remote incentives. It might just push the AI to ignore our distant incentives but not our adversaries'.
It could also not have any effect on how the AI thinks about remote incentives. One way this could happen is if the model reasons about both our somewhat remote incentives and our adversaries' very remote incentives, and concludes that the adversaries' very remote incentives dominate, and that they all suggest that the AI should play the training game.
I'm not su...
Many people believe that the first AI capable of taking over would be quite different from the LLMs of today. Suppose this is true—does prosaic alignment research on LLMs still reduce x-risk? I believe advances in LLM alignment research reduce x-risk even if future AIs are different. I’ll call these “non-LLM AIs.” In this post, I explore two mechanisms for LLM alignment research to reduce x-risk:
In our approach, we took the relevant paradigm as learning + search, rather than LLMs. A lot of prosaic alignment directions only assume a learning-based paradigm, rather than LLMs in particular (though of course some like CoT monitoring are specific to LLMs). Some are even more general, e.g. a lot of control work just depends on the fact that the AI is a program / runs on a computer.