Paul Christiano lays out how he frames various questions of "will AI cause a really bad outcome?", and gives some probabilities.
We would often like to get a qualitative sense of a target model’s behaviors in important distributions (e.g. deployment, RL training, or evals). For example, we might want to discover novel behaviors, figure out what causes some target behavior to occur, or find surprising correlations between behaviors.
In a recent short exploratory project, we tackled this problem via LLM-Driven Feature Discovery. Our method works as follows:
The goal of this post is to give a summary of classical reinforcement learning theory that primes the reader to learn about infra-Bayesianism, which is a new framework for reinforcement learning that aims to solve problems related to AI alignment. We will concentrate on basic aspects of the classical theory that have analogous concepts in infra-Bayesianism, and explain these concepts using infra-Bayesianism conventions. The more technical proofs are contained in the proof section.
For the first part of this sequence and for links to other writings, see What is Inadequate about Bayesianism for AI Alignment: Motivating Infra-Bayesianism.
One special case of reinforcement learning is the case of stochastic bandits. For example, a bird may have a choice of three levers. When the bird steps on a lever, either some...
Authors: Joshua Engels*, Callum McDougall*, Bilal Chughtai*, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue+, João Gabriel Lopes de Oliveira+, Rohin Shah+, Neel Nanda+
*Primary Contributor
+Advising
Paper here: https://arxiv.org/abs/2606.20560
In a recent collaboration between the GDM interpretability team and the GDM text diffusion team, we performed a transparency audit of DiffusionGemma, GDM's new text diffusion model.
Overall, we find that DiffusionGemma is not significantly less transparent than Gemma.
However, even...
Ah yeah, we probably should have used different language here, top p means keep any tokens with probability greater than p. It’s more like “p threshold” or “p cutoff”
GDM has published an AI Control Roadmap! From the executive summary:
...
We present the GDM AI Control Roadmap (v0.1) – our plan for implementing and adopting internal guardrails designed to catch potential adversarial behaviour by AI agents, even as they become increasingly harder to oversee and contain.We focus on system-level mitigations that limit the harm a misaligned AI system could cause. Specifically, this report provides:
• Threat modelling: Taking inspiration from cybersecurity, we adopt a conservative, worst-case approach to threat modelling throughout this paper, and assume a hypothetical AI adversary pursuing undesirable goals in internal deployment. We introduce TRAIT&R, a taxonomy of tactics and techniques available to such a hypothetical AI adversary, building on the established security framework MITRE ATT&CK. We categorise new hypothetical threats into three core areas:
This seems very promising to me and I will be impressed if GDM actually does all of these things!
(Disclaimer: I've only done a very brief skim so far)
The following is a (lightly edited version of a) series of notes I sent Joe Carlsmith about his essay, Can you control the past?. It's addressed to Joe, but it seems worth publishing here while I'm on the topic of decision theory. I’ve included some of his comments, and my replies, below.
I only recently skimmed Can you control the past?, and have a couple notes that you may or may not be interested in. (I'm not under the impression that this matters a ton, and am writing this recreationally.)
First: this is overall a great review of decision theories. Better than most I've seen. Nice.
Now, onto some more substansive points.
I think a bunch of your sense of oddness about the "magic" that "you can write on whiteboards...
There is of course another point of view, which says "the output of Joe in (say) Newcomb's problem is determined by Joe". This viewpoint is sometimes useful to give to people who are reflecting on themselves and struggling to decide between (say) CDT and LDT.
...It's perhaps useful to note that these people tend to have complicated, messy, heuristical decision-procedures, that they're currently in the process of reflecting upon, in ways that are sensitive to various details of their personality and arguments they just heard. Which is to say, someone who's waff
Thanks! It's supposed to be , with the formula in the post defining the measure via its marginal probabilities on finite prefixes.