Here's a simple strategy for AI alignment: use interpretability tools to identify the AI's internal search process, and the AI's internal representation of our desired alignment target. Then directly rewire the search process to aim at the alignment target. Boom, done.
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and...
I think part of the problem is people think of themselves as having at least, like, a medium explicit strategy, but, the strategy routes through some judgment that conveniently keeps returning "not yet" or "only saying things in a somewhat cagey way."
i.e. this advice seems necessary but not sufficient.
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge...
Nice. Apropos, I've found words like "almost" or "approximately" are useful for saying something has relatively low moral worth without the fraught implication that the worth is literally zero. (Equalling precisely zero is a rare event with strong logical consequences.)
E.g.:
"grass has almost no moral value", versus just "grass has no moral value"
"grass boundaries are nearly worthless", versus just "grass boundaries are worthless".
My sense is that people don't acknowledge these caveats out of fear that someone will try to force them to debate about...
Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda
Gerson and Aditya are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda.
Understanding why a model took an action is a key question in AI Safety. It is a difficult question, as often many motivations could plausibly lead to the same action. We are particularly concerned about the case of model incrimination, i.e. observing a model taking a bad action and distinguishing between benign and malign motivations, e.g. scheming vs confusion.
To practice the skill of understanding motivations, we need to build high-quality environments that serve as realistic proxy tasks for investigating future models. But we’ve found that there are many ways in which environments can be contrived. It is valuable...
Thanks!
In this post I was NOT talking about consequentialism as a model of an AI agent, but rather consequentialism as the power-source ultimately underlying a family of AI techniques, a family that includes most forms of RL and model-based planning.
So “perfect” consequentialism is a red herring (I guess in agreement with your final sentence).
A model-based planning or RL agent can “make mistakes” (so to speak) while also being a ruthless sociopath. “Ruthless sociopath” here means having callous indifference to the welfare of other people, and to respecting... (read more)