A new paper proposes an unsupervised way to extract knowledge from language models. The authors argue this could be a key part of aligning superintelligent AIs, by letting us figure out what the AI "really believes" rather than what it thinks humans want to hear. But there are still some challenges to overcome before this could work on future superhuman AIs.
I signed an amicus brief supporting Anthropic's right to do business without governmental retaliation. As an AI expert, I attest that Anthropic's technical concerns are legitimate, and no laws were designed to protect against AI analysis of surveillance data.
Even though I work at a competing lab (Google DeepMind), I'm proud of Anthropic for taking a stand against unlawful retaliation and immoral demands.
(I speak only for myself, not my employer.)
TL;DR: We introduce a testbed based on censored Chinese LLMs, which serve as natural objects of study for studying secret elicitation techniques. Then we study the efficacy of honesty elicitation and lie detection techniques for detecting and removing generated falsehoods.
This post presents a summary of the paper, including examples of transcripts and other miscellaneous findings.
X thread | arXiv paper | Code | Transcripts
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge...
I get that a lot of people use "simulation" and "computer program" as basically synonyms, but that's a bit linguistically impoverished for the hard work of analyzing distinct metaphysical hypotheses and their consequences. Consider that ideal behavior is different in
a) computer worlds that are built to mimic an existing world, in which case our "job" is be similar to whatever is "out there" that we're a simulation of, versus
b) computer worlds that are built to be a de novo home for a new kind of life or being, such as for entertainment or exploratory scien...
It seems that LLMs are not good enough at reasoning, even after being trained on ~all human output, such that you couldn't amplify their capabilities to arbitrary levels through iterated amplification, so AI companies are mainly increasing AI capabilities via RLVR instead. Is this impression wrong, and how to update on it if not?
Aside from the potential implications on alignment (i.e., closing off one approach that seemed hopeful for some, at least for the foreseeable future), I wonder if this is a deficiency in LLMs (their architecture or how they're trai...
I'm curious what Sonnet 4.5's score would be on your external CoT controllability eval. I.e., what Sonnet 4.5's blue bar in your Fig. 25 would look like. (I couldn't find a result like this in your paper, but I might have overlooked something)
My casual impression has been that external CoT with Sonnet 4.5 is fairly steerable, but this impression is mostly based on "in practice I have no trouble getting Sonnet 4.5 to do the kind of CoT I need" rather than any kind of systematic investigation.
It seems like at least one of the following must be true: