AI ALIGNMENT FORUM
AF

All of Chris van Merwijk's Comments + Replies

Gradual Disempowerment, Shell Games and Flinches

but note that the gradual problem makes the risk of coups go up.

Just a request for editing the post to clarify: do you mean coups by humans (using AI), coups by autonomous misaligned AI, or both?

Many arguments for AI x-risk are wrong

Chris van Merwijk1mo20

EDIT 3/5/24: In the comments for Counting arguments provide no evidence for AI doom, Evan Hubinger agreed that one cannot validly make counting arguments over functions. However, he also claimed that his counting arguments "always" have been counting parameterizations, and/or actually having to do with the Solomonoff prior over bitstrings.

As one of Evan's co-authors on the mesa-optimization paper from 2019 I can confirm this. I don't recall ever thinking seriously about a counting argument over functions.

What’s the short timeline plan?

Chris van Merwijk2mo20

I just want to register a prediction that I think something like meta's coconut will in the long run in fact perform much better than natural language CoT. Perhaps not in this time-frame though.

Cortés, Pizarro, and Afonso as Precedents for Takeover

Chris van Merwijk1y10

NEW EDIT: After reading three giant history books on the subject, I take back my previous edit. My original claims were correct.

Could you edit this comment to add which three books you're referring to?

2Daniel Kokotajlo1y

"The Conquest of Mexico and the Conquest of Peru" "1493" and... the one about Afonso I forget the title... it might have been this History of Portugal : Marques, Antonio Henrique R. de Oliveira : Free Download, Borrow, and Streaming : Internet Archive or this https://www.amazon.com/gp/search?index=books&tag=NYTBSREV-20&field-keywords=Conquerors%3A+How+Portugal+Forged+the+First+Global+Empire+Roger+Crowley I also ended up reading "1491" and "1492."

The Waluigi Effect (mega-post)

Chris van Merwijk2y80

I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).

2Abram Demski2y

A good question. I've never seen it happen myself; so where I'm standing, it looks like short emergence examples are cherry-picked.

The Waluigi Effect (mega-post)

Chris van Merwijk2y56

Therefore, the waluigi eigen-simulacra are attractor states of the LLM

It seems to me like this informal argument is a bit suspect. Actually I think this argument would not apply to Solomonof Induction.

Suppose we have to programs that have distributions over bitstrings. Suppose p1 assigns uniform probability to each bitstring, while p2 assigns 100% probability to the string of all zeroes. (equivalently, p1 i.i.d. samples bernoully from {0,1}, p2 samples 0 i.i.d. with 100%).

Suppose we use a perfect Bayesian reasoner to sample bitstrings, bu... (read more)

1Cleo Nardo2y

Yep, you're correct. The original argument in the Waluigi mega-post was sloppy. * If μ updated the amplitudes in a perfectly bayesian way and the context window was infinite, then the amplitudes of each premise must be a martingale. But the finite context breaks this. * Here is a toy model which shows how the finite context window leads to Waluigi Effect. Basically, the finite context window biases the Dynamic LLM towards premises which can be evidenced by short strings (e.g. waluigi), and biases away from premises which can't be evidenced by short strings (e.g. luigis). * Regarding your other comment, a long context window doesn't mean that the waluigis won't appear quickly. Even with an infinite context window, the waluigi might appear immediately. The assumption that the context window is short/finite is only necessary to establish that the waluigi is an absorbing state but luigi isn't.

Reward is not the optimization target

Chris van Merwijk2y26

There is a general phenomenon where:

Person A has mental model X and tries to explain X with explanation Q
Person B doesn't get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn't actually contain the insights, but P does.
Person C doesn't get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: ...

It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contai... (read more)

2Alex Turner2y

I want to note that I just reread Utility ≠ Reward and was pleasantly surprised by its treatment, as well as the hedges. I'm making an upwards update on these points having been understood by at least some thinkers, although I've also made a lot of downward updates for other reasons.

Reward is not the optimization target

Chris van Merwijk2y68

Very late reply, sorry.

"even though reward is not a kind of objective", this is a terminological issue. In my view, calling a "antecedent-computation reinforcement criterion" an "objective" matches my definition of "objective", and this is just a matter of terminology. The term "objective" is ill-defined enough that "even though reward is not a kind of objective" is a terminological claim about objective, not a claim about math/the world.

The idea that RL agents "reinforce antecedent computations" is completely core to our story of deception. You could not ... (read more)

2Alex Turner2y

I think most ML practitioners do have implicit models of how reward chisels computation into agents, as seen with how they play around with e.g. reward shaping and such. It's that I don't perceive this knowledge to be engaged when some people reason about "optimization processes" and "selecting for high-reward models" on e.g. LW. I just continue to think "I wouldn't write RFLO the way it was written, if I had deeply and consciously internalized the lessons of OP", but it's possible this is a terminological/framing thing. Your comment does update me some, but I think I mostly retain my view here. I do totally buy that you all had good implicit models of the reward-chiseling point. FWIW, I think a bunch of my historical frustration here has been an experience of: 1. Pointing out the "reward chisels computation" point 2. Having some people tell me it's obvious, or already known, or that they already invented it 3. Seeing some of the same people continue making similar mistakes (according to me) 4. Not finding instances of other people making these points before OP 5. Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so. 1. If I found several comments explaining what is clearly the "reward chisels computation" point, where the comments were posted before this post, by people who weren't me or downstream of my influence, I would update against my points being novel and towards my points using different terminology. 2. IIRC there's one comment from Wei_Dai from a few years back in this vein, but IDK of others.

Models Don't "Get Reward"

Chris van Merwijk2y10

The core point in this post is obviously correct, and yes people's thinking is muddled if they don't take this into account. This point is core to the Risks from learned optimization paper (so it's not exactly new, but it's good if it's explained in different/better ways).

Unifying Bargaining Notions (2/2)

Chris van Merwijk2y11

Is the following a typo?
"So, the ( works"

first sentence of "CoCo Equilbiria".

Reward is not the optimization target

Chris van Merwijk3y96

Maybe you have made a gestalt-switch I haven't made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.

Is there a difference between saying:

A reward function is an objective function, but the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn't ac

... (read more)

3Alex Turner3y

Where did RFLO point it out? RFLO talks about a mesa objective being different from the "base objective" (even though reward is not a kind of objective). IIRC on my skim most of the arguments were non-mechanistic reasoning about what gets selected for. (Which isn't a knockdown complaint, but those arguments are also not about the mechanism.) Also see my comment to Evan. Like, from my POV, people are reliably reasoning about what RL "selects for" via "lots of optimization pressure" on "high reward by the formal metric", but who's reasoning about what kinds of antecedent computations get reinforced when credit assignment activates? Can you give me examples of anyone else spelling this out in a straightforward fashion? Yeah, I think it just doesn't communicate the mechanistic understanding (not even imperfectly, in most cases, I imagine). From my current viewpoint, I just wouldn't call reward an objective at all, except in the context of learned antecedent-computation-reinforcement terminal values. It's like if I said "My cake is red" when the cake is blue, I guess? IMO it's just not how to communicate the concept. Why is this reasonable?

Reward is not the optimization target

Chris van Merwijk3y1924

It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You've certainly phrased things differently and made some specific points that we didn't, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?

(Note I am still surprised sometimes that people still think certain wireheading scenario's make sense despite them having read RFLO, so it's plausible to me that we really didn't communicate everyrhing that's in my head about this).

3Alex Turner3y

"Wireheading is improbable" is only half of the point of the essay. The other main point is "reward functions are not the same type of object as utility functions." I haven't reread all of RFLO recently, but on a skim—RFLO consistently talks about reward functions as "objectives": Which is reasonable parlance, given that everyone else uses it, but I don't find that terminology very useful for thinking about what kinds of inner cognition will be developed in the network. Reward functions + environmental data provides a series of cognitive-updates to the network, in the form of reinforcement schedules. The reward function is not necessarily an 'objective' at all. (You might have privately known about this distinction. Fine by me! But I can't back it out from a skim of RFLO, even already knowing the insight and looking for it.)

AGI Ruin: A List of Lethalities

Chris van Merwijk3y*24

Here is my partial honest reaction, just two points I'm somewhat dissatisfied with (not meant to be exhaustive):
2. "A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure." I would like there to be an argument for this claim that doesn't rely on nanotech, and solidly relies on actually existing amounts of compute. E.g. if the argument relies on running intractable detailed simulations of prote... (read more)

1Rob Bensinger3y

From an Eliezer comment: If Iceland did this, it would plausibly need some way to (1) not have its AGI project bombed in response, and (2) be able to continue destroying GPUs in the future if new ones are built, until humanity figures out 'what it wants to do next'. This more or less eliminates the time pressure to rush figuring out what to do next, which seems pretty crucial for good long-term outcomes. It's a much harder problem than just 'cause all GPUs to stop working for a year as a one-time event', and I assume Eliezer's focusing on nanotech it part because it's a very general technology that can be used for tasks like those as well.

Comments on CAIS

Chris van Merwijk3y10

Responding to this very late, but: If I recall correctly, Eric has told me in personal conversation that CAIS is a form of AGI, just not agent-like AGI. I suspect Eric would agree broadly with Richard's definition.

Optimal play in human-judged Debate usually won't answer your question

Chris van Merwijk3y00

"I talk about consequentialists, but not rational consequentialists", ok this was not the impression I was getting.

1Joe Collman3y

Well I'm sure I could have been clearer. (and it's possible that I'm now characterising what I think, rather than what I wrote) But getting that impression is pretty natural: in my argument, a large part of the problem does come from its sometimes being correct to pick the question-ignoring answer. ('correct' meaning something like: [leads to best consequences, according to our values]) Or alternatively, that a correct decision algorithm would sometimes pick the question-ignoring answer. I think I focus on this, since it's the non-obvious part of the argument: it's already clear that poor decisions / decision-algorithms may sometimes pick the question-ignoring answer. Probably I should have emphasized more that unexpected behaviour when things are going right will make it harder to know when things are going wrong.

Optimal play in human-judged Debate usually won't answer your question

Chris van Merwijk3y10

Reading this post a while after it was written: I'm not going to respond to the main claim (which seems quite likely) but just to the specific arguments, which seems suspicious to me. Here are some points:

In my model of the standard debate setup with human judge, the human can just use both answers in whichever way it wants, independently of which it selects as the correct answer. The fact that one answer provides more useful information than "2+2=?" doesn't imply a "direct" incentive for the human judge to select that as the correct answer. Upon introspec

... (read more)

1Joe Collman3y

Yes, agreed. A few points on the rest: 1. At the highest level, the core issue is that QI makes it quite a bit harder to identify misalignment. If aligned systems will sometimes not answer the question, non-answering isn't necessarily strong evidence of misalignment. So "consequentialist judges will [sometimes correctly] select QIA's" is bad in the sense that it provides cover for "consequentialist judges will [sometimes incorrectly] select QIA's". 2. I talk about consequentialists, but not rational consequentialists. I expect the kind of judge we'd pick to be highly rational relative to the average human - but that's a low bar. I expect all humans to have exploitable inconsistencies, and that optimal play will exploit them. (similarly for groups of humans) So yes, this is only a problem where manipulation is possible - but since it is possible, we'll have difficulty distinguishing [judge correctly selected a non-answer as the winner for principled reasons] from [judge was manipulated...]. 1. It's much less clear when such issues show up with sub-optimal play. 3. With "Is this definitely undesirable? I'm not sure, but probably." I'm referring to the debate structure's having this property in general being undesirable. The judge can know it is undesirable in general, but also think that in this specific case things are different - and, of course, the judge can be wrong about this. 1. Noting here that humans can't make binding pre-commitments. (saying words doesn't qualify) 4. It's hard (/impossible?) to avoid this issue through oversight, since we just move from [QI exceptions that persuade the judge win] to [QI exceptions that persuade the judge and oversight system win].

Finite Factored Sets: Conditional Orthogonality

Chris van Merwijk4y00

I think a subpartition of S can be thought of as a partial function on S, or equivalently, a variable on S that has the possible value "Null"/"undefined".

3Ramana Kumar3y

That's right. A partial function can be thought of as a subset (of its domain) and a total function on that subset. And a (total) function can be thought of as a partition (of its domain): the parts are the inverse images of each point in the function's image.

Finite Factored Sets: Orthogonality and Time

Chris van Merwijk4y00

Can't you define $C ⊢^{S} X$ for any set $C$ of partitions of $X$ , rather than $C ⊢^{F} X$ w.r.t. a specific factorization $F$ , simply as $C ⊢^{S} X$ iff $⋁_{S} (C) \geq_{S} X$ ? If so, it would seem to me to be clearer to define $⊢$ that way (i.e. make 7 rather than 2 from proposition 10 the definition), and then basically proposition 10 says "if $C$ is a subset of factors of a partition then here are a set of equivalent definitions in terms of chimera". Also I would guess that proposition 11 is still true for $⊢^{S}$ rat... (read more)

2Scott Garrabrant4y

I could do that. I think it wouldn't be useful, and wouldn't generalize to sub partitions.