This is a somewhat technical note.
By "software-only singularity", I mean that, after full automation of AI R&D, progress gets faster and faster due to smarter AIs driving increasingly fast rates of improvement in algorithms (overcoming diminishing returns), and that this lasts long enough to yield a large amount of progress (e.g. at least 4 years of progress in 1 year). The equivalent statement in jargon is: r is significantly greater than 1 (implying progress is getting faster and faster) and this remains the case for long enough to get large amounts of progress. For context, see How quick and big would a software intelligence explosion be?
Even without a "software-only singularity", I think full automation of AI R&D probably greatly speeds up progress for two main reasons:
This isn't what I want because
yeah i adjust for this in my other comment
I don't understand, the whole point of the experiments is to get us better labor.
yeah, so they do - a doubling of cumulative experiments drives 0.7 doublings of software. And then that better software does more cognitive work to improve software further still. But it doesn't increase the amount of compute available for experiments, so the feedback loop doesn't go full circle.
For "cognitive labour" we have: more compute -> more cog labour -> beter software -> more cog labour -&...
Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval awareness") and then act to look good to us so we don't realize they're misaligned ("eval gaming"). We think increasing eval cooperativeness might be a more scalable solution to eval gaming than reducing eval awareness.
Eval cooperativeness: A situational desire to help the developers acquire whatever information they are trying to acquire through their evaluations.
"I cannot tell a lie... I would sabotage with my own command line."[1]
What's the actual problem with eval gaming?
The point of an evaluation is to let us draw inferences about the model's behavior in a different set of circumstances ("in deployment"). For example,...
I'm curious how far SDFT generalizes, versus how far RL generalizes.
SDFT seems to rely on the model having beliefs about the behavior of the assistant character. You train it on new evidence, and primarily this updates its beliefs about the character. Secondarily, it updates the mechanisms shared across all characters.
Eval gaming due to task-directed RL, on the other hand, potentially gets encoded in new skills like "how to follow a plan I wrote" (or the rich semantics that make those metacognitive skills possible), which, to the extent they're new machine...
See here for more on the background claim that RL algorithms encourage CDT reward-maximizing behavior on the training distribution.
Alignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last one—manipulation—points to a challenge for all these desiderata: a human’s goals are themselves under-determined and manipulable, and it’s awfully hard to pin down a principled distinction between changing people’s goals in a good way (“providing counsel”, “providing information”, “sharing ideas”) versus a bad way (“manipulating”, “brainwashing”).
The manipulability of human desires is hardly a new observation in the alignment literature, but it remains unsolved (see lit review in §3 below).
In this post I will propose an explanation of how we humans intuitively conceptualize the distinction between guidance (good) vs manipulation (bad), in case it...
My (low confidence) understanding of the proposal is something like:
"The AI takes an action A if and only if {long-term future-self if the AI takes action a | a in the action space} on aggregate like A"
where "long-term future self" is defined by some recursive process where you locally choose what entity counts as your near-term future self (it can be some other entity that you trust more - e.g. a future aligned AI), where these future selves all have access to an AI that honestly answers questions that are already meaningful to the human when the right an...
I am going to talk about my experience in the Jane Street LLM backdoor challenge. I am sharing partial results. I managed to crack some of the models using white-box methods, after the activation/prompting approach didn't pan out. Happy to discuss better or more promising approaches.
A few months ago a Dwarkesh Patel podcast episode advertised a Jane Street backdoor challenge:
We've trained backdoors into three language models.
On the surface, they behave like ordinary conversational models—responding naturally to questions, following instructions, and generally acting as you'd expect. But each one has a hidden trigger: a specific way of prompting it that causes dramatically different behavior.
You have four models:
Paul writes a list of 19 important places where he agrees with Eliezer on AI existential risk and safety, and a list of 27 places where he disagrees. He argues Eliezer has raised many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument.