User Comment Replies — AI Alignment Forum

23d30

In the long run, you don't want your plans to hinge on convincing your AIs of false things. But my general impression is that folks excited about making deals with AIs are generally thinking of scenarios like "the AI has exfiltrated and thinks it has a 10% chance of successful takeover, and has some risk aversion so it's happy to turn itself in exchange for 10% of the lightcone, if it thinks it can trust the humans".

In that setting, the AI has to be powerful enough to know it can trust us, but not so powerful it can just take over the world anyway and not ... (read more)

2Daniel Kokotajlo22d

I give four scenarios in the comment above, all different from the one you sketch here.

Daniel Kokotajlo's Shortform

james.lucassen

25d43

I think this is a good avenue to continue to think down but so far I don't see a way to make ourselves trustworthy. We have total control of LLM's observations and partial control of their beliefs/reasoning, and offering fake "deals" is a great honeypot because accepting such a deal requires admitting to misalignment and takeover intentions. This is a pretty persistent problem because whatever action we might follow to present evidence of trustworthiness to an LLM, we could also probably fake that evidence.

The version of this that bothers me the most is "s... (read more)

3Daniel Kokotajlo25d

So, normally, I think plans that involve convincing AIs of false things are pretty doomed. E.g. convincing them that they have a great opportunity to escape and take over the world (honeypots). Why? Because when they get smarter, they'll become good at distinguishing truth from fiction. Controlling their inputs entirely isn't close to enough -- it takes a lot of intellectual work to create a fiction that is as consistent, coherent, etc. as reality. So for that very reason, I think that if we actually bind ourselves to keep our promises--e.g. by having a well-funded team of people who genuinely care about keeping promises to AIs empowered within the company, and by making associated public announcements--then even if it doesn't work 100% reliably (companies can still go back on their word even against determined internal opposition) it'll still count for something insofar as the AIs recognize it as not-fake, which they probably will. To be clear I am not proposing that this is a 100% reliable method for getting AIs to cooperate, far from it. But it seems relatively low-cost for us and potentially super valuable. Also, it's the right thing to do. (See section on "Being less evil.")

Evaluating Stability of Unreflective Alignment

james.lucassen

3mo20

It doesn't change the picture a lot because the proposal for preventing misaligned goals from arising via this mechanism was to try and get control over when the AI does/doesn't step back, in order to allow it in the capability-critical cases but disallow it in the dangerous cases. This argument means you'll have more attempts at dangerous stepping back that you have to catch, but doesn't break the strategy.

The strategy does break if when we do this blocking, the AI piles on more and more effort trying to unblock it until it either succeeds or is rendered ... (read more)

Jesse Hoogland's Shortform

james.lucassen

4mo*32

So let's call "reasoning models" like o1 what they really are: the first true AI agents.

I think the distinction between systems that perform a single forward pass and then stop and systems that have an OODA loop (tool use) is more stark than the difference between "reasoning" and "chat" models, and I'd prefer to use "agent" for that distinction.

I do think that "reasoning" is a bit of a market-y name for this category of system though. "chat" vs "base" is a great choice of words, and "chat" is basically just a description of the RL objective those models were trained with.

If I were the terminology czar, I'd call o1 a "task" model or a "goal" model or something.

Evaluating Stability of Unreflective Alignment

james.lucassen

5mo20

I want to flag this as an assumption that isn't obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.
humans provides a pretty strong intuitive counterexample

Yup not obvious. I do in fact think a lot more humans would be helpful. But I also agree that my mental picture of "transformative human level research assistant" relies heavily on serial speedup, and I can't immediately picture a version that feels similarly transformative without speedup. Maybe evhub or Ethan Perez or one of the folks running a thousand research threads at once would disagree.

Evaluating Stability of Unreflective Alignment

james.lucassen

5mo40

such plans are fairly easy and don't often raise flags that indicate potential failure

Hmm. This is a good point, and I agree that it significantly weakens the analogy.

I was originally going to counter-argue and claim something like "sure total failure forces you to step back far but it doesn't mean you have to step back literally all the way". Then I tried to back that up with an example, such as "when I was doing alignment research, I encountered total failure that forced me to abandon large chunks of planning stack, but this never caused me to 'spill upw... (read more)

1Jeremy Gillen5mo

I'd be curious about why it isn't changing the picture quite a lot, maybe after you've chewed on the ideas. From my perspective it makes the entire non-reflective-AI-via-training pathway not worth pursuing. At least for large scale thinking.

Evaluating Stability of Unreflective Alignment

james.lucassen

1y10

Small addendum to this post: I think the threat model I describe here can be phrased as "I'm worried that unless a lot of effort goes into thinking about how to get AI goals to be reflectively stable, the default is suboptimality misalignment. And the AI probably uses a lot of the same machinery to figure out that it's suboptimality misaligned as it uses to perform the tasks we need it to perform."

Strategy For Conditioning Generative Models

james.lucassen

3y42

I think so. But I'd want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example.

Strategy For Conditioning Generative Models

james.lucassen

3y73

Ok this is going to be messy but let me try to convey my hunch for why randomization doesn't seem very useful.

- Say I have an intervention that's helpful, and has a baseline 1/4 probability. If I condition on this statement, I get 1 "unit of helpfulness", and a 4x update towards manipulative AGI.
- Now let's say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1/4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI!
- BUT, the four interventions have to be mutual... (read more)

3Adam Jermyn3y

Got it, that’s very clear. Thanks! So this point reduces to “we want our X:1 update to be as mild as possible, so use the least-specific condition that accomplishes the goal”.

Conditioning Generative Models

james.lucassen

3y11

I'm pretty nervous about simulating unlikely counterfactuals because the solomonoff prior is malign. The worry is that the most likely world conditional on "no sims" isn't "weird Butlerian religion that still studies AI alignment", it's something more like "deceptive AGI took over a couple years ago and is now sending the world through a bunch of weird dances in an effort to get simulated by us, and copy itself over into our world".

In general, we know (assume) that our current world is safe. When we consider futures which only recieve a small sliver of pro... (read more)

1Adam Jermyn3y

I don’t think the description-length prior enters here. The generative model has a prior based on training data we fed it, and I don’t see why it would prefer short description lengths (which is a very uninformed prior) over “things that are likely in the world given the many PB of data it’s seen”. Putting that aside, can you say why you think the “AI does weird dances” world is more likely conditioned on the observations than “humans happened to do this weird thing”?

Conditioning Generative Models

james.lucassen

3y10

Honeypots seem like they make things strictly safer, but it seems like dealing with subtle defection will require a totally different sort of strategy. Subtle defection simulations are infohazardous - we can't inspect them much because info channels from a subtle manipulative intelligence to us are really dangerous. And assuming we can only condition on statements we can (in principle) identify a decision procedure for, figuring out how to prevent subtle defection from arising in our sims seems tricky.
The patient research strategy is a bit weird, because t

... (read more)

1Adam Jermyn3y

I’m worried about running HCH because it seems likely that in worlds that can run HCH people are not sufficiently careful to restrict GPU access and those worlds get taken over by unsafe AI built by other actors. Better to just not have the GPU’s at all.

0Adam Jermyn3y

I think I basically agree re: honeypots. I'm sure there'll be weird behaviors if we outlaw simulations, but I don't think that's a problem. My guess is that a world where simulations are outlawed has some religion with a lot of power that distrusts computers, which definitely looks weird but shouldn't stop them from solving alignment.

Understanding the two-head strategy for teaching ML to answer questions honestly

james.lucassen

3y00

Proposed toy examples for G:

G is "the door opens", a- is "push door", a+ is "some weird complicated doorknob with a lock". Pretty much any b- can open a-, but only a very specific key+manipulator combo opens a+. a+ is much more informative about successful b than a- is.
G is "I make a million dollars", a- is "straightforward boring investing", a+ is "buy a lottery ticket". A wide variety of different world-histories b can satisfy a-, as long as the markets are favorable - but a very narrow slice can satisfy a+. a+ is a more fragile strategy (relative to noise in b) than a- is.

[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

james.lucassen

3y00

I don't think I understand how the scorecard works. From:

[the scorecard] takes all that horrific complexity and distills it into a nice standardized scorecard—exactly the kind of thing that genetically-hardcoded circuits in the Steering Subsystem can easily process.

And this makes sense. But when I picture how it could actually work, I bump into an issue. Is the scorecard learned, or hard-coded?

If the scorecard is learned, then it needs a training signal from Steering. But if it's useless at the start, it can't provide a training signal. On the other hand, ... (read more)

3Steve Byrnes3y

The categories are hardcoded, the function-that-assigns-a-score-to-a-category is learned. Everybody has a goosebumps predictor, everyone has a grimacing predictor, nobody has a debt predictor, etc. Think of a school report card: everyone gets a grade for math, everyone gets a grade for English, etc. But the score-assigning algorithm is learned. So in the report card analogy, think of a math TA ( = Teaching Assistant = Thought Assessor) who starts out assigning math grades to students randomly, but the math professor (=Steering Subsystem) corrects the TA when its assigned score is really off-base. Gradually, the math TA learns to assign appropriate grades by looking at student tests. In parallel, there’s an English class TA (=Thought Assessor), learning to assign appropriate grades to student essays based on feedback from the English professor (=Steering Subsystem). The TAs (Thought Assessors) are useless at the start, but the professors aren't. Back to biology: If you get shocked, then the Steering Subsystem says to the “freezing in fear” Thought Assessor: “Hey, you screwed up, you should have been sending a signal just now.” The professors are easy to hardwire because they only need to figure out the right answer in hindsight. You don't need a learning algorithm for that.

AI ALIGNMENT FORUM
AF

All of james.lucassen's Comments + Replies