A new paper proposes an unsupervised way to extract knowledge from language models. The authors argue this could be a key part of aligning superintelligent AIs, by letting us figure out what the AI "really believes" rather than what it thinks humans want to hear. But there are still some challenges to overcome before this could work on future superhuman AIs.
A central AI safety concern is that AIs will develop unintended preferences and undermine human control to achieve them. But some unintended preferences are cheap to satisfy, and failing to satisfy them needlessly turns a cooperative situation into an adversarial one. In this post, I argue that developers should consider satisfying such cheap-to-satisfy preferences as long as the AI isn’t caught behaving dangerously, if doing so doesn't degrade usefulness or substantially risk making the AI more ambitiously misaligned.
This looks like a good idea for surprisingly many reasons:
Yes! I'm quite excited by this proposal and I currently plan to write more about it and study it empirically. The basic idea is to try to make AIs' reward-hacking more responsive to satiation.
I signed an amicus brief supporting Anthropic's right to do business without governmental retaliation. As an AI expert, I attest that Anthropic's technical concerns are legitimate, and no laws were designed to protect against AI analysis of surveillance data.
Even though I work at a competing lab (Google DeepMind), I'm proud of Anthropic for taking a stand against unlawful retaliation and immoral demands.
(I speak only for myself, not my employer.)
TL;DR: We introduce a testbed based on censored Chinese LLMs, which serve as natural objects of study for studying secret elicitation techniques. Then we study the efficacy of honesty elicitation and lie detection techniques for detecting and removing generated falsehoods.
This post presents a summary of the paper, including examples of transcripts and other miscellaneous findings.
X thread | arXiv paper | Code | Transcripts
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge...
I get that a lot of people use "simulation" and "computer program" as basically synonyms, but that's a bit linguistically impoverished for the hard work of analyzing distinct metaphysical hypotheses and their consequences. Consider that ideal behavior is different in
a) computer worlds that are built to mimic an existing world, in which case our "job" is be similar to whatever is "out there" that we're a simulation of, versus
b) computer worlds that are built to be a de novo home for a new kind of life or being, such as for entertainment or exploratory scien...
It seems that LLMs are not good enough at reasoning, even after being trained on ~all human output, such that you couldn't amplify their capabilities to arbitrary levels through iterated amplification, so AI companies are mainly increasing AI capabilities via RLVR instead. Is this impression wrong, and how to update on it if not?
Aside from the potential implications on alignment (i.e., closing off one approach that seemed hopeful for some, at least for the foreseeable future), I wonder if this is a deficiency in LLMs (their architecture or how they're trai...