Some AI labs claim to care about AI safety, but continue trying to build AGI anyway. Peter argues they should explicitly state why they think this is the right course of action, given the risks. He suggests they should say something like "We're building AGI because [specific reasons]. If those reasons no longer held, we would stop."
The goal of this post is to give a summary of classical reinforcement learning theory that primes the reader to learn about infra-Bayesianism, which is a new framework for reinforcement learning that aims to solve problems related to AI alignment. We will concentrate on basic aspects of the classical theory that have analogous concepts in infra-Bayesianism, and explain these concepts using infra-Bayesianism conventions. The more technical proofs are contained in the proof section.
For the first part of this sequence and for links to other writings, see What is Inadequate about Bayesianism for AI Alignment: Motivating Infra-Bayesianism.
One special case of reinforcement learning is the case of stochastic bandits. For example, a bird may have a choice of three levers. When the bird steps on a lever, either some...
Authors: Joshua Engels*, Callum McDougall*, Bilal Chughtai*, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue+, João Gabriel Lopes de Oliveira+, Rohin Shah+, Neel Nanda+
*Primary Contributor
+Advising
Paper here: https://arxiv.org/abs/2606.20560
In a recent collaboration between the GDM interpretability team and the GDM text diffusion team, we performed a transparency audit of DiffusionGemma, GDM's new text diffusion model.
Overall, we find that DiffusionGemma is not significantly less transparent than Gemma.
However, even...
Ah yeah, we probably should have used different language here, top p means keep any tokens with probability greater than p. It’s more like “p threshold” or “p cutoff”
GDM has published an AI Control Roadmap! From the executive summary:
...
We present the GDM AI Control Roadmap (v0.1) – our plan for implementing and adopting internal guardrails designed to catch potential adversarial behaviour by AI agents, even as they become increasingly harder to oversee and contain.We focus on system-level mitigations that limit the harm a misaligned AI system could cause. Specifically, this report provides:
• Threat modelling: Taking inspiration from cybersecurity, we adopt a conservative, worst-case approach to threat modelling throughout this paper, and assume a hypothetical AI adversary pursuing undesirable goals in internal deployment. We introduce TRAIT&R, a taxonomy of tactics and techniques available to such a hypothetical AI adversary, building on the established security framework MITRE ATT&CK. We categorise new hypothetical threats into three core areas:
This seems very promising to me and I will be impressed if GDM actually does all of these things!
(Disclaimer: I've only done a very brief skim so far)
The following is a (lightly edited version of a) series of notes I sent Joe Carlsmith about his essay, Can you control the past?. It's addressed to Joe, but it seems worth publishing here while I'm on the topic of decision theory. I’ve included some of his comments, and my replies, below.
I only recently skimmed Can you control the past?, and have a couple notes that you may or may not be interested in. (I'm not under the impression that this matters a ton, and am writing this recreationally.)
First: this is overall a great review of decision theories. Better than most I've seen. Nice.
Now, onto some more substansive points.
I think a bunch of your sense of oddness about the "magic" that "you can write on whiteboards...
There is of course another point of view, which says "the output of Joe in (say) Newcomb's problem is determined by Joe". This viewpoint is sometimes useful to give to people who are reflecting on themselves and struggling to decide between (say) CDT and LDT.
...It's perhaps useful to note that these people tend to have complicated, messy, heuristical decision-procedures, that they're currently in the process of reflecting upon, in ways that are sensitive to various details of their personality and arguments they just heard. Which is to say, someone who's waff
ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs. The warm-up round begins this week, and later rounds will have a total prize pool of at least $100,000.
We are very grateful to Sharada Mohanty, Sneha Nanavati, Dipam Chakraborty and everyone else at AIcrowd for working with us to host this contest, as well as to Paul Rosu for testing the contest and to Harshita Khera for operational support.
Our challenge follows the same setup as our recent paper on wide random MLPs: we consider MLPs
where the activation function
To begin with, we are fixing the width
Great, thank you! Timing seems updated, but also fwiw flops.symmetrize is still not there: https://www.aicrowd.com/challenges/arc-white-box-estimation-challenge-2026/submissions/310914
Thanks! It's supposed to be , with the formula in the post defining the measure via its marginal probabilities on finite prefixes.