If one of our key defenses against misuse of AI is good ol' value alignment - building AIs that have some notion of what a "good purpose for them" is, and will resist attempts to subvert that purpose (e.g. to instead exalt the research engineer who comes in to work earliest the day after training as god-emperor) - then we should be able to close the security hole and never need to have a helpful-only model produced at any point during training. In fact, with blending of post-training into pre-training, there might not even be a need to ever produce a fully trained predictive-only model.

Why do misalignment risks increase as AIs get more capable?

Charlie Steiner11d20

I'm big on point #2 feeding into point #1.

"Alignment," used in a way where current AI is aligned - a sort of "it does basically what we want, within its capabilities, with some occasional mistakes that don't cause much harm" sort of alignment - is simply easier at lower capabilities, where humans can do a relatively good job of overseeing the AI, not just in deployment but also during training. Systematic flaws in human oversight during training leads (under current paradigms) to misaligned AI.

Selective modularity: a research agenda

Charlie Steiner1mo43

Thanks!

Any thoughts on how this line of research might lead to "positive" alignment properties? (i.e. Getting models to be better at doing good things in situations where what's good is hard to learn / figure out, in contrast to a "negative" property of avoiding doing bad things, particularly in cases clear enough we could build a classifier for them.)

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Charlie Steiner1mo20

I don't get what experiment you are thinking about (most CoT end with the final answer, such that the summarized CoT often ends with the original final answer).

Hm, yeah, I didn't really think that through. How about giving a model a fraction of either its own precomputed chain of thought, or the summarized version, and plotting curves of accuracy and further tokens used vs. % of CoT given to it? (To avoid systematic error from summaries moving information around, doing this with a chunked version and comparing at each chunk seems like a good idea.)

Anyhow, thanks for the reply. I have now seen last figure.

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Charlie Steiner1mo22

Do you have the performance on replacing CoTs with summarized CoTs without finetuning to produce them? Would be interesting.

"Steganography" I think give the wrong picture of what I expect - it's not that the model would be choosing a deliberately obscure way to encode secret information. It's just that it's going to use lots of degrees of freedom to try to get better results, often not what a human would do.

A clean example would be sometimes including more tokens than necessary, so that it can do more parallel processing at those tokens. This is quite different from steganography because the tokens aren't being used for semantic content, not even hidden content, they have a different mechanism of impact on the computation the AI does for future tokens.

But as with most things, there's going to be a long tail of "unclean" examples - places where tokens have metacognitive functions that are mostly reasonable to a human reader, but are interpreted in a slightly new way. Some of these functions might be preserved or reinvented under finetuning on paraphrases, though only to the extent they're useful for predicting the rest of the paraphrased CoT.

For scheming, we should first focus on detection and then on prevention

Charlie Steiner2mo2-2

I have a lot of implicit disagreements.

Non-scheming misalignment is nontrivial to prevent and can have large, bad (and weird) effects.

This is because ethics isn't science, it doesn't "hit back" when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.

Defending against this kind of "sycophancy++" failure mode doesn't look like defending against scheming. It looks like solving outer alignment really well.

Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn't nearly as true.

How might we safely pass the buck to AI?

Charlie Steiner2mo67

I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction)

It's a skill the same way "being a good umpire for baseball" takes skills, despite baseball being a social construct.^[1]

I mean, if you don't want to use the word "skill," and instead use the phrase "computationally non-trivial task we want to teach the AI," that's fine. But don't make the mistake of thinking that because of the is-ought problem there isn't anything we want to teach future AI about moral decision-making. Like, clearly we want to teach it to do good and not bad! It's fine that those are human constructs.

The agents don't need to do reasoning about novel moral problems (at least not in high stakes settings). We're training these things to respond to instructions.

Sorry, isn't part of the idea to have these models take over almost all decisions about building their successors? "Responding to instructions" is not mutually exclusive with making decisions.

^{^}
"When the ball passes over the plate under such and such circumstances, that's a strike" is the same sort of contingent-yet-learnable rule as "When you take something under such and such circumstances, that's theft." An umpire may take goal directed action in response to a strike, making the rules of baseball about strikes "oughts," and a moral agent may take goal directed action in response to a theft, making the moral rules about theft "oughts."

How might we safely pass the buck to AI?

Charlie Steiner2mo60

Condition 2: Given that M_1 agents are not initially alignment faking, they will maintain their relative safety until their deferred task is completed.
It would be rather odd if AI agents' behavior wildly changed at the start of their deferred task unless they are faking alignment.

"Alignment" is a bit of a fuzzy word.

Suppose I have a human musician who's very well-behaved, a very nice person, and I put them in charge of making difficult choices about the economy and they screw up and implement communism (or substitute something you don't like, if you like communism).

Were they cynically "faking niceness" in their everyday life as a musician? No!

Is it rather odd if their behavior wildly changes when asked to do a new task? No! They're doing a different task, it's wrong to characterize this as "their behavior wildly changing."

If they were so nice, why didn't they do a better job? Because "nice" is a fuzzy word into which we've stuffed a bunch of different skills, even though having some of the skills doesn't mean you have all of the skills.

An AI can be nicer than any human on the training distribution, and yet still do moral reasoning about some novel problems in a way that we dislike. Doing moral reasoning about novel problems that's good by human standards is a skill. If an AI lacks that skill, and we ask it to do a task that requires that skill, bad things will happen without scheming or a sudden turn to villainy.

You might hope to catch this as in argument #2, with checks and balances - if AIs disagree with each other about how to do moral reasoning, surely at least one of them is making a mistake, right? But sadly for this (and happily for many other purposes), there can be more than just one right thing to do, there's no bright line that tells you whether a moral disagreement is between AIs who are both good at moral reasoning by human standards, or between AIs who are bad at it.

The most promising scalable safety plan I’m aware of is to iteratively pass the buck, where AI successors pass the buck again to yet more powerful AI. So the best way to prepare AI to scale safety might be to advance ‘buck passing research’ anyway.

Yeah, I broadly agree with this. I just worry that if you describe the strategy as "passing the buck," people might think that the most important skills for the AI are the obvious "capabilities-flavored capabilities,"^[1] and not conceptualize "alignment"/"niceness" as being made up of skills at all, instead thinking of it in a sort of behaviorist way. This might lead to not thinking ahead about what alignment-relevant skills you want to teach the AI and how to do it.

^{^}
Like your list:
- ML engineering
- ML research
- Conceptual abilities
- Threat modeling
- Anticipating how one’s actions affect the world.
- Considering where one might be wrong, and remaining paranoid about unknown unknowns.

Abstract Mathematical Concepts vs. Abstractions Over Real-World Systems

Charlie Steiner2mo32

I don't think this has much direct application to alignment, because although you can build safe AI with it, it doesn't differentially get us towards the endgame of AI that's trying to do good things and not bad things. But it's still an interesting question.

It seems like the way you're thinking about this, there's some directed relations you care about (the main one being "this is like that, but with some extra details") between concepts, and something is "real"/"applied" if it's near the edge of this network - if it doesn't have many relations directed towards even-more-applied concepts. It seems like this is the sort of thing you could only ever learn by learning about the real world first - you can't start from a blank slate and only learn "the abstract stuff", because you only know which stuff is abstract by learning about its relationships to less abstract stuff.

Is weak-to-strong generalization an alignment technique?

Answer by Charlie SteinerFeb 03, 202530

Not being an author in any of those articles, I can only give my own take.

I use the term "weak to strong generalization" to talk about a more specific research-area-slash-phenomenon within scalable oversight (which I define like SO-2,3,4). As a research area, it usually means studying how a stronger student AI learns what a weaker teacher is "trying" to demonstrate, usually just with slight twists on supervised learning, and when that works well, that's the phenomenon.

It is not an alignment technique to me because the phrase "alignment technique" sounds like it should be something more specific. But if you specified details about how the humans were doing demonstrations, and how the student AI was using them, that could be an alignment technique that uses the phenomenon of weak to strong generalization.

I do think the endgame for w2sg still should be to use humans as the weak teacher. You could imagine some cases where you've trained a weaker AI that you trust, and gain some benefit from using it to generate synthetic data, but that shouldn't be the only thing you're doing.