Chris van Merwijk - AI Alignment Forum

The Pando Problem: Rethinking AI Individuality

This comment was written by Claude, based on my bullet points:

I've been thinking about the split-brain patient phenomenon as another angle on this AI individuality question.

Consider split-brain patients: despite having the corpus callosum severed, the two hemispheres don't suddenly become independent agents with totally different goals. They still largely cooperate toward shared objectives. Each hemisphere makes predictions about what the other is doing and adjusts accordingly, even without direct communication.

Why does this happen? I think it's because both hemispheres were trained together for their whole life, developing shared predictive models and cooperative behaviors. When the connection is cut, these established patterns don't just disappear—each hemisphere fills in missing information with predictions based on years of shared experience.

Similarly, imagine training an AI model to solve some larger task, consisting of a bunch of subtasks. Just for practical reasons it will have to carve up the subtask to some extent and call instances of itself to solve the subtask. In order to perform the larger task well, there will be an incentive on the model for these instances to have internal predictive models, habits, drives of something like "I am part of a larger agent, performing a subtask".

Even if we later placed multiple instances of such a model (or of different but similar models) in positions meant to be adversarial - perhaps as checks and balances on each other - they might still have deeply embedded patterns predicting cooperative behavior from similar models. Each instance might continue acting as if it were part of a larger cooperative system, maintaining coordination through these predictive patterns rather than through communication even though their "corpus callosum" is cut (in analogy with split brain patients).

I'm not sure how far this analogy goes, it's just a thought.

The Pando Problem: Rethinking AI Individuality

Chris van Merwijk15d10

A version of what ChatGPT wrote here prompted

What was the prompt?

Compositional language for hypotheses about computations

Chris van Merwijk1mo10

I may be confused somehow. Feel free to ignore. But:
* At first I thought you meant the input alphabet to be the colors, not the operations.
* Instead, am I correct that "the free operad generated by the input alphabet of the tree automaton" is an operad with just one color, and the "operations" are basically all the labeled trees where labels of the nodes are the elements of the alphabet, such that the number of children of a node is always equal to the arity of that label in the input alphabet?
* That would make sense, as the algebra would then I guess assign the state space of the tree automaton to the single color of the operad, and each arity n operation would be mapped to the mathematical function from Q^n to Q.
* That would make sense I think, but then why do you talk about a "colored" operad in: "we can now define a deterministic automaton over a (colored) operad to be an $O$ -algebra"?

Compositional language for hypotheses about computations

Chris van Merwijk1mo10

More precisely, they are algebras over the free operad generated by the input alphabet of the tree automaton

Wouldn't this fail to preserve the arity of the input alphabet? i.e. you can have trees where a given symbol occurs multiple times, and with different amounts of children? That wouldn't be allowed from the perspective of the tree automaton right?

Gradual Disempowerment, Shell Games and Flinches

Chris van Merwijk3mo10

but note that the gradual problem makes the risk of coups go up.

Just a request for editing the post to clarify: do you mean coups by humans (using AI), coups by autonomous misaligned AI, or both?

Many arguments for AI x-risk are wrong

Chris van Merwijk3mo20

EDIT 3/5/24: In the comments for Counting arguments provide no evidence for AI doom, Evan Hubinger agreed that one cannot validly make counting arguments over functions. However, he also claimed that his counting arguments "always" have been counting parameterizations, and/or actually having to do with the Solomonoff prior over bitstrings.

As one of Evan's co-authors on the mesa-optimization paper from 2019 I can confirm this. I don't recall ever thinking seriously about a counting argument over functions.

What’s the short timeline plan?

Chris van Merwijk3mo20

I just want to register a prediction that I think something like meta's coconut will in the long run in fact perform much better than natural language CoT. Perhaps not in this time-frame though.

Cortés, Pizarro, and Afonso as Precedents for Takeover

Chris van Merwijk1y10

NEW EDIT: After reading three giant history books on the subject, I take back my previous edit. My original claims were correct.

Could you edit this comment to add which three books you're referring to?

The Waluigi Effect (mega-post)

Chris van Merwijk2y80

I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).

The Waluigi Effect (mega-post)

Chris van Merwijk2y56

Therefore, the waluigi eigen-simulacra are attractor states of the LLM

It seems to me like this informal argument is a bit suspect. Actually I think this argument would not apply to Solomonof Induction.

Suppose we have to programs that have distributions over bitstrings. Suppose p1 assigns uniform probability to each bitstring, while p2 assigns 100% probability to the string of all zeroes. (equivalently, p1 i.i.d. samples bernoully from {0,1}, p2 samples 0 i.i.d. with 100%).

Suppose we use a perfect Bayesian reasoner to sample bitstrings, but we do it in precisely the same way LLMs do it according to the simulator model. That is, given a bitstring, we first formulate a posterior over programs, i.e. a "superposition" on programs, which we use to sample the next bit, then we recompute the posterior, etc.

Then I think the probability of sampling 00000000... is just 50%. I.e. I think the distribution over bitstrings that you end up with is just the same as if you just first sampled the program and stuck with it.

I think tHere's a messy calculation which could be simplified (which I won't do):

Limit of this is 0.5.

I don't wanna try to generalize this, but based on this example it seems like if an LLM was an actual Bayesian, Waluigi's would not be attractors. The informal argument is wrong because it doesn't take into account the fact that over time you sample increasingly many non-waluigi samples, pushing down the probability of Waluigi.

Then again, the presense of a context window completely breaks the above calculation in a way that preserves the point. Maybe the context window is what makes Waluigi's into an attractor? (Seems unlikely actually, given that the context windows are fairly big).

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments