Many people have an intuition like "everything is an imperfect proxy; we can never avoid Goodhart". The example of "mutual information" demonstrates that this is wrong. Agent Foundations research is worthwhile because it helps us find the "true name" of things like "agency" or "human values" – a mathematical formulation sufficiently robust that one can apply lots of optimization pressure without the formulation breaking down.
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and...
We describe the persona selection model (PSM): the idea that LLMs learn to simulate diverse characters during pre-training, and post-training elicits and refines a particular such Assistant persona. Interactions with an AI assistant are then well-understood as being interactions with the Assistant—something roughly like a character in an LLM-generated story. We survey empirical behavioral, generalization, and interpretability-based evidence for PSM. PSM has consequences for AI development, such as recommending anthropomorphic reasoning about AI psychology and introduction of positive AI archetypes into training data. An important open question is how exhaustive PSM is, especially whether there might be sources of agency external to the Assistant persona, and how this might change in the future.
What sort of thing is a modern AI assistant? One perspective holds that they are shallow,...
Really great post: in particular the discussion of all kinds of empirical evidence.
Authors: Aditya Singh*, Gerson Kroiz*, Senthooran Rajamanoharan, Neel Nanda
Aditya and Gerson are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda.
Imagine that a frontier lab’s coding agent has been caught putting a bug in the key code for monitoring what that agent does. Naively, this seems like a clear smoking gun that the agent is scheming. But LLMs often do weird things; they could easily just be confused, or have made a mistake. These all require a response, but the cause and appropriate fix are very different between a scheming and confused model. As such, it is extremely important that we have high-quality methods to be able to incriminate or exonerate a model caught taking sketchy actions, to...
The following is an edited transcript of a talk I gave. I have given this talk at multiple places, including first at Anthropic and then for ELK winners and at Redwood Research, though the version that this document is based on is the version I gave to SERI MATS fellows. Thanks to Jonathan Ng, Ryan Kidd, and others for help transcribing that talk. Substantial edits were done on top of the transcription by me. Though all slides are embedded below, the full slide deck is also available here.
Today I’m going to be talking about deceptive alignment. Deceptive alignment is something I'm very concerned about and is where I think most of the existential risk from AI comes from. And I'm going to try to make the case...
I reread the distillation of this (not the full article sadly due to time limitations), so my understanding of Evan's views might be off. Here's my retrospective:
High path dependence:
• Evan argues that before it obtains the true training objective, gradient descent likely obtains a proxy goal. This seems intuitive, though there's debate as to whether we should interpret recent results as indicating alignment seems to be happening by default or whether it remains a hard, unsolved problem.
• Evan argues that the model will gain an understanding of its own sit...
Epistemic status: intuitive speculation with scattered mathematical justification.
My goal here is to interrogate the dream of writing a beautiful algorithm for intelligence and thereby ensuring safety. For example:
I don't know precisely what alternative he had in mind, but I only seem to remember reading clean functional programs from MIRI, so that is one possibility. Whether or not anyone else endorses it, that is the prototypical picture I have in mind as the holy grail of glass box AGI implementation. In my mind, it has careful comments explaining the precisely chosen, recursive prior and decision rules that make it go "foom."
Is the "inscrutable" complexity of deep neural networks unavoidable? There is has been some prior discussion of the desire to avoid it as map-territory confusion, and I am...
Ah I see (I'll try a bit to remember to say "by so and so").
AI alignment has a culture clash. On one side, the “technical-alignment-is-hard” / “rational agents” school-of-thought argues that we should expect future powerful AIs to be power-seeking ruthless consequentialists. On the other side, people observe that both humans and LLMs are obviously capable of behaving like, well, not that. The latter group accuses the former of head-in-the-clouds abstract theorizing gone off the rails, while the former accuses the latter of mindlessly assuming that the future will always be the same as the present, rather than trying to understand things. “Alas, the power-seeking ruthless consequentialist AIs are still coming,” sigh the former. “Just you wait.”
As it happens, I’m basically in that “alas, just you wait” camp, expecting ruthless future AIs. But my camp faces a real question: what exactly is it...
I get that, and I appreciate the comments!
(Side note: in case you missed it, I expanded my last comment into a post “Why we should expect ruthless sociopath ASI”. Thanks for the inspiration, and hope you don’t mind.)
The very capabilities that make the nightmare scary (deep, nuanced understanding of how humans think and feel and operate) are exactly the kind of rich social cognition that, in our only empirical examples, comes tangled up with perspective-taking, moral awareness, and empathy.
Here are a couple reasons to think it’s not too “tangled up” in hum...
Right, I think humans have a distinction between beliefs and desires (“is versus ought”) that’s pretty disanalogous to how LLMs work (see discussion here), and our beliefs / “is”s get updated by predictive learning from sensory inputs. My dichotomy of consequentialism vs imitative learning in the OP was about the “ought” part, which predictive learning doesn’t help with. I... (read more)