All of james.lucassen's Comments + Replies

It doesn't change the picture a lot because the proposal for preventing misaligned goals from arising via this mechanism was to try and get control over when the AI does/doesn't step back, in order to allow it in the capability-critical cases but disallow it in the dangerous cases. This argument means you'll have more attempts at dangerous stepping back that you have to catch, but doesn't break the strategy.

The strategy does break if when we do this blocking, the AI piles on more and more effort trying to unblock it until it either succeeds or is rendered ... (read more)

So let's call "reasoning models" like o1 what they really are: the first true AI agents.

I think the distinction between systems that perform a single forward pass and then stop and systems that have an OODA loop (tool use) is more stark than the difference between "reasoning" and "chat" models, and I'd prefer to use "agent" for that distinction.

I do think that "reasoning" is a bit of a market-y name for this category of system though. "chat" vs "base" is a great choice of words, and "chat" is basically just a description of the RL objective those models were trained with.

If I were the terminology czar, I'd call o1 a "task" model or a "goal" model or something.

I want to flag this as an assumption that isn't obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.

humans provides a pretty strong intuitive counterexample

Yup not obvious. I do in fact think a lot more humans would be helpful. But I also agree that my mental picture of "transformative human level research assistant" relies heavily on serial speedup, and I can't immediately picture a version that feels similarly transformative without speedup. Maybe evhub or Ethan Perez or one of the folks running a thousand research threads at once would disagree.

such plans are fairly easy and don't often raise flags that indicate potential failure

Hmm. This is a good point, and I agree that it significantly weakens the analogy.

I was originally going to counter-argue and claim something like "sure total failure forces you to step back far but it doesn't mean you have to step back literally all the way". Then I tried to back that up with an example, such as "when I was doing alignment research, I encountered total failure that forced me to abandon large chunks of planning stack, but this never caused me to 'spill upw... (read more)

1Jeremy Gillen
I'd be curious about why it isn't changing the picture quite a lot, maybe after you've chewed on the ideas. From my perspective it makes the entire non-reflective-AI-via-training pathway not worth pursuing. At least for large scale thinking.

Small addendum to this post: I think the threat model I describe here can be phrased as "I'm worried that unless a lot of effort goes into thinking about how to get AI goals to be reflectively stable, the default is suboptimality misalignment. And the AI probably uses a lot of the same machinery to figure out that it's suboptimality misaligned as it uses to perform the tasks we need it to perform."

I think so. But I'd want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example. 

Ok this is going to be messy but let me try to convey my hunch for why randomization doesn't seem very useful.

- Say I have an intervention that's helpful, and has a baseline 1/4 probability. If I condition on this statement, I get 1 "unit of helpfulness", and a 4x update towards manipulative AGI.
- Now let's say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1/4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI!
- BUT, the four interventions have to be mutual... (read more)

3Adam Jermyn
Got it, that’s very clear. Thanks! So this point reduces to “we want our X:1 update to be as mild as possible, so use the least-specific condition that accomplishes the goal”.

I'm pretty nervous about simulating unlikely counterfactuals because the solomonoff prior is malign. The worry is that the most likely world conditional on "no sims" isn't "weird Butlerian religion that still studies AI alignment", it's something more like "deceptive AGI took over a couple years ago and is now sending the world through a bunch of weird dances in an effort to get simulated by us, and copy itself over into our world".

In general, we know (assume) that our current world is safe. When we consider futures which only recieve a small sliver of pro... (read more)

1Adam Jermyn
I don’t think the description-length prior enters here. The generative model has a prior based on training data we fed it, and I don’t see why it would prefer short description lengths (which is a very uninformed prior) over “things that are likely in the world given the many PB of data it’s seen”. Putting that aside, can you say why you think the “AI does weird dances” world is more likely conditioned on the observations than “humans happened to do this weird thing”?
  • Honeypots seem like they make things strictly safer, but it seems like dealing with subtle defection will require a totally different sort of strategy. Subtle defection simulations are infohazardous - we can't inspect them much because info channels from a subtle manipulative intelligence to us are really dangerous. And assuming we can only condition on statements we can (in principle) identify a decision procedure for, figuring out how to prevent subtle defection from arising in our sims seems tricky.
  • The patient research strategy is a bit weird, because t
... (read more)
1Adam Jermyn
I’m worried about running HCH because it seems likely that in worlds that can run HCH people are not sufficiently careful to restrict GPU access and those worlds get taken over by unsafe AI built by other actors. Better to just not have the GPU’s at all.
0Adam Jermyn
I think I basically agree re: honeypots. I'm sure there'll be weird behaviors if we outlaw simulations, but I don't think that's a problem. My guess is that a world where simulations are outlawed has some religion with a lot of power that distrusts computers, which definitely looks weird but shouldn't stop them from solving alignment.

Proposed toy examples for G:

  • G is "the door opens", a- is "push door", a+ is "some weird complicated doorknob with a lock". Pretty much any b- can open a-, but only a very specific key+manipulator combo opens a+. a+ is much more informative about successful b than a- is.
  • G is "I make a million dollars", a- is "straightforward boring investing", a+ is "buy a lottery ticket". A wide variety of different world-histories b can satisfy a-, as long as the markets are favorable - but a very narrow slice can satisfy a+. a+ is a more fragile strategy (relative to noise in b) than a- is.

I don't think I understand how the scorecard works. From:

[the scorecard] takes all that horrific complexity and distills it into a nice standardized scorecard—exactly the kind of thing that genetically-hardcoded circuits in the Steering Subsystem can easily process.

And this makes sense. But when I picture how it could actually work, I bump into an issue. Is the scorecard learned, or hard-coded?

If the scorecard is learned, then it needs a training signal from Steering. But if it's useless at the start, it can't provide a training signal. On the other hand, ... (read more)

3Steve Byrnes
The categories are hardcoded, the function-that-assigns-a-score-to-a-category is learned. Everybody has a goosebumps predictor, everyone has a grimacing predictor, nobody has a debt predictor, etc. Think of a school report card: everyone gets a grade for math, everyone gets a grade for English, etc. But the score-assigning algorithm is learned. So in the report card analogy, think of a math TA ( = Teaching Assistant = Thought Assessor) who starts out assigning math grades to students randomly, but the math professor (=Steering Subsystem) corrects the TA when its assigned score is really off-base. Gradually, the math TA learns to assign appropriate grades by looking at student tests. In parallel, there’s an English class TA (=Thought Assessor), learning to assign appropriate grades to student essays based on feedback from the English professor (=Steering Subsystem). The TAs (Thought Assessors) are useless at the start, but the professors aren't. Back to biology: If you get shocked, then the Steering Subsystem says to the “freezing in fear” Thought Assessor: “Hey, you screwed up, you should have been sending a signal just now.” The professors are easy to hardwire because they only need to figure out the right answer in hindsight. You don't need a learning algorithm for that.