Causal scrubbing is a new tool for evaluating mechanistic interpretability hypotheses. The algorithm tries to replace all model activations that shouldn't matter according to a hypothesis, and measures how much performance drops. It's been used to improve hypotheses about induction heads and parentheses balancing circuits.
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge...
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and...
Maybe a good example is that humans update on a ton of random observations we’re surprised by. This doesn’t seem like imitation, nor does it seem consequentialist enough to be very risky?
Right, I think humans have a distinction between beliefs and desires (“is versus ought”) that’s pretty disanalogous to how LLMs work (see discussion here), and our beliefs / “is”s get updated by predictive learning from sensory inputs. My dichotomy of consequentialism vs imitative learning in the OP was about the “ought” part, which predictive learning doesn’t help with. I...
We describe the persona selection model (PSM): the idea that LLMs learn to simulate diverse characters during pre-training, and post-training elicits and refines a particular such Assistant persona. Interactions with an AI assistant are then well-understood as being interactions with the Assistant—something roughly like a character in an LLM-generated story. We survey empirical behavioral, generalization, and interpretability-based evidence for PSM. PSM has consequences for AI development, such as recommending anthropomorphic reasoning about AI psychology and introduction of positive AI archetypes into training data. An important open question is how exhaustive PSM is, especially whether there might be sources of agency external to the Assistant persona, and how this might change in the future.
What sort of thing is a modern AI assistant? One perspective holds that they are shallow,...
Really great post: in particular the discussion of all kinds of empirical evidence.
Authors: Aditya Singh*, Gerson Kroiz*, Senthooran Rajamanoharan, Neel Nanda
Aditya and Gerson are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda.
Imagine that a frontier lab’s coding agent has been caught putting a bug in the key code for monitoring what that agent does. Naively, this seems like a clear smoking gun that the agent is scheming. But LLMs often do weird things; they could easily just be confused, or have made a mistake. These all require a response, but the cause and appropriate fix are very different between a scheming and confused model. As such, it is extremely important that we have high-quality methods to be able to incriminate or exonerate a model caught taking sketchy actions, to...
The following is an edited transcript of a talk I gave. I have given this talk at multiple places, including first at Anthropic and then for ELK winners and at Redwood Research, though the version that this document is based on is the version I gave to SERI MATS fellows. Thanks to Jonathan Ng, Ryan Kidd, and others for help transcribing that talk. Substantial edits were done on top of the transcription by me. Though all slides are embedded below, the full slide deck is also available here.
Today I’m going to be talking about deceptive alignment. Deceptive alignment is something I'm very concerned about and is where I think most of the existential risk from AI comes from. And I'm going to try to make the case...
I reread the distillation of this (not the full article sadly due to time limitations), so my understanding of Evan's views might be off. Here's my retrospective:
High path dependence:
• Evan argues that before it obtains the true training objective, gradient descent likely obtains a proxy goal. This seems intuitive, though there's debate as to whether we should interpret recent results as indicating alignment seems to be happening by default or whether it remains a hard, unsolved problem.
• Evan argues that the model will gain an understanding of its own sit...
Epistemic status: intuitive speculation with scattered mathematical justification.
My goal here is to interrogate the dream of writing a beautiful algorithm for intelligence and thereby ensuring safety. For example:
I don't know precisely what alternative he had in mind, but I only seem to remember reading clean functional programs from MIRI, so that is one possibility. Whether or not anyone else endorses it, that is the prototypical picture I have in mind as the holy grail of glass box AGI implementation. In my mind, it has careful comments explaining the precisely chosen, recursive prior and decision rules that make it go "foom."
Is the "inscrutable" complexity of deep neural networks unavoidable? There is has been some prior discussion of the desire to avoid it as map-territory confusion, and I am...
Ah I see (I'll try a bit to remember to say "by so and so").
Is your point trying to use the same definitions and ontology as in the post, and responding to a particular logical argument made within it?
If yes, perhaps it would help if you would pick one of the specific arguments you disagree with — say, the 5-part argument argument for "stealing is bad" as a cosmic Schelling norm — and identify what is the first statement or inference in it that you think is wrong.
If no, can you say what is your definition of "stealing" and "stealing is good"?... (read more)