TurnTrout discusses a common misconception in reinforcement learning: that reward is the optimization target of trained agents. He argues reward is better understood as a mechanism for shaping cognition, not a goal to be optimized, and that this has major implications for AI alignment approaches.
As AI systems get more capable, it becomes increasingly uncompetitive and infeasible to avoid deferring to AIs on increasingly many decisions. Further, once systems are sufficiently capable, control becomes infeasible. [1] Thus, one of the main strategies for handling AI risk is fully (or almost fully) deferring to AIs on managing these risks. Broadly speaking, when I say "deferring to AIs" [2] I mean having these AIs do virtually all of the work to develop more capable and aligned successor AIs, managing exogenous risks, and making strategic decisions. [3] If we plan to defer to AIs, I think it's safest...
My overall sense is that this behavioral testing will generally be hard. It will probably be a huge mess if we're extremely rushed and need to do all of this in a few months
Why can't we do a bunch of the work for this ahead of time? E.g., creating high-effort evaluation datasets for reward models.
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge...
In our case, if we are ultimately running on a computer, then wouldn't that mean that we are a simulation? It seems obvious that we are trying; the intention would be to simulate a pre-AGI civilization.
Could we be running on a computer and also be a vivarium?
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and...
Thanks!
In this post I was NOT talking about consequentialism as a model of an AI agent, but rather consequentialism as the power-source ultimately underlying a family of AI techniques, a family that includes most forms of RL and model-based planning.
So “perfect” consequentialism is a red herring (I guess in agreement with your final sentence).
A model-based planning or RL agent can “make mistakes” (so to speak) while also being a ruthless sociopath. “Ruthless sociopath” here means having callous indifference to the welfare of other people, and to respecting...
I think part of the problem is people think of themselves as having at least, like, a medium explicit strategy, but, the strategy routes through some judgment that conveniently keeps returning "not yet" or "only saying things in a somewhat cagey way."
i.e. this advice seems necessary but not sufficient.
Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda
Gerson and Aditya are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda.
Understanding why a model took an action is a key question in AI Safety. It is a difficult question, as often many motivations could plausibly lead to the same action. We are particularly concerned about the case of model incrimination, i.e. observing a model taking a bad action and distinguishing between benign and malign motivations, e.g. scheming vs confusion.
To practice the skill of understanding motivations, we need to build high-quality environments that serve as realistic proxy tasks for investigating future models. But we’ve found that there are many ways in which environments can be contrived. It is valuable...
I'm curious what Sonnet 4.5's score would be on your external CoT controllability eval. I.e., what Sonnet 4.5's blue bar in your Fig. 25 would look like. (I couldn't find a result like this in your paper, but I might have overlooked something)
My casual impression has been that external CoT with Sonnet 4.5 is fairly steerable, but this impression is mostly based on "in practice I have no trouble getting Sonnet 4.5 to do the kind of CoT I need" rather than any kind of systematic investigation.
It seems like at least one of the following must be true:
- There are
... (read more)