We might soon be creating morally relevant AI systems with real welfare concerns. How can we help ensure good lives for AIs, especially if we don't have that many resources to allocate to it?
Alignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last one—manipulation—points to a challenge for all these desiderata: a human’s goals are themselves under-determined and manipulable, and it’s awfully hard to pin down a principled distinction between changing people’s goals in a good way (“providing counsel”, “providing information”, “sharing ideas”) versus a bad way (“manipulating”, “brainwashing”).
The manipulability of human desires is hardly a new observation in the alignment literature, but it remains unsolved (see lit review in §3 below).
In this post I will propose an explanation of how we humans intuitively conceptualize the distinction between guidance (good) vs manipulation (bad), in case it...
Thanks! Seems like the upshot is that humans have a complex soup of desires around learning from and epistemically interacting with AIs, and you’re proposing to use the word “legitimate” to describe that soup, which I agree is a helpful way of thinking about it (and probably better than “not-manipulating”), but also not a True Name in the John Wentworth sense.
The proposal in the last paragraph seems unhelpful for me-in-particular, because I’m working on how to install goals in brain-like AGIs, and I have ideas that seem promising but only work for a limite...
ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs. The warm-up round begins this week, and later rounds will have a total prize pool of at least $100,000.
We are very grateful to Sharada Mohanty, Sneha Nanavati, Dipam Chakraborty and everyone else at AIcrowd for working with us to host this contest, as well as to Paul Rosu for testing the contest and to Harshita Khera for operational support.
Our challenge follows the same setup as our recent paper on wide random MLPs: we consider MLPs
where the activation function
To begin with, we are fixing the width
Update: We have now launched Phase 1 of the Challenge with $50,000 in prizes:
For Phase 1, we have increased the depth of the network from 8 to 32 hidden layers. Our existing algorithms scale poorly with depth, and so we expect there to be significant room for improvement. Phase 1 lasts until the end of July, after which Phase 2 begins. For Phase 2, there will be a prize pool of at least $100,000, and we may change the architectural parameters agai...
We would often like to get a qualitative sense of a target model’s behaviors in important distributions (e.g. deployment, RL training, or evals). For example, we might want to discover novel behaviors, figure out what causes some target behavior to occur, or find surprising correlations between behaviors.
In a recent short exploratory project, we tackled this problem via LLM-Driven Feature Discovery. Our method works as follows:
The goal of this post is to give a summary of classical reinforcement learning theory that primes the reader to learn about infra-Bayesianism, which is a new framework for reinforcement learning that aims to solve problems related to AI alignment. We will concentrate on basic aspects of the classical theory that have analogous concepts in infra-Bayesianism, and explain these concepts using infra-Bayesianism conventions. The more technical proofs are contained in the proof section.
For the first part of this sequence and for links to other writings, see What is Inadequate about Bayesianism for AI Alignment: Motivating Infra-Bayesianism.
One special case of reinforcement learning is the case of stochastic bandits. For example, a bird may have a choice of three levers. When the bird steps on a lever, either some...
Thanks! It's supposed to be
Authors: Joshua Engels*, Callum McDougall*, Bilal Chughtai*, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue+, João Gabriel Lopes de Oliveira+, Rohin Shah+, Neel Nanda+
*Primary Contributor
+Advising
Paper here: https://arxiv.org/abs/2606.20560
In a recent collaboration between the GDM interpretability team and the GDM text diffusion team, we performed a transparency audit of DiffusionGemma, GDM's new text diffusion model.
Overall, we find that DiffusionGemma is not significantly less transparent than Gemma.
However, even...
Ah yeah, we probably should have used different language here, top p means keep any tokens with probability greater than p. It’s more like “p threshold” or “p cutoff”