James Payor

I think about AI alignment; send help.

I'm also on twitter. More links on my homepage payor.io.

Posts

Sorted by New

30Thinking about maximization and corrigibility

16Some constructions for proof-based cooperation without Löb

6A proof of inner Löb's theorem

Wikitag Contributions

Comments

Sorted by

Newest

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y*147

Also, here's a proof that a bot is never exploited. It only cooperates when its partner $B$ provably cooperates.

First, note that $A \to □ A$ , i.e. if $A$ cooperates it provably cooperates. (Proof sketch: $A \leftrightarrow □ X \to □ □ X \leftrightarrow □ A$ .)

Now we show that $A \to □ B$ (i.e. if $A$ chooses to cooperate, its partner is provably cooperating):

We get $A \to (□ □ A \to □ B)$ by distributing.
We get $A \to □ □ A$ by applying internal necessitation to $A \to □ A$ .
By (1) and (2), $A \to □ B$ .

(PS: we can strengthen this to $A \leftrightarrow □ B$ , by noticing that $□ B \to □ (□ A \to B) \leftrightarrow A$ .)

Modal Fixpoint Cooperation without Löb's Theorem

James Payor4mo40Review for 2023 Review

I continue to think there's something important in here!

I haven't had much success articulating why. I think it's neat that the loop-breaking/choosing can be internalized, and not need to pass through Lob. And it informs my sense of how to distinguish real-world high-integrity vs low-integrity situations.

William_S's Shortform

James Payor11mo44

By "gag order" do you mean just as a matter of private agreement, or something heavier-handed, with e.g. potential criminal consequences?

I have trouble understanding the absolute silence we seem to be having. There seem to be very few leaks, and all of them are very mild-mannered and are failing to build any consensus narrative that challenges OA's press in the public sphere.

Are people not able to share info over Signal or otherwise tolerate some risk here? It doesn't add up to me if the risk is just some chance of OA trying to then sue you to bankruptcy, especially since I think a lot of us would offer support in that case, and the media wouldn't paint OA in a good light for it.

I am confused. (And I grateful to William for at least saying this much, given the climate!)

LLMs Sometimes Generate Purely Negatively-Reinforced Text

James Payor2y93

Awesome, thanks for writing this up!

I very much like how you are giving a clear account for a mechanism like "negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text".

(In particular, the model isn't learning "just don't say that", it's learning "these are the things to avoid saying", which can make it easier to point at the whole cluster?)

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y*92

I tried to formalize this, using as a "poor man's counterfactual", standing in for "if Alice cooperates then so does Bob". This has the odd behaviour of becoming "true" when Alice defects! You can see this as the counterfactual collapsing and becoming inconsistent, because its premise is violated. But this does mean we need to be careful about using these.

For technical reasons we upgrade to $□ A \to B$ , which says "if Alice cooperates in a legible way, then Bob cooperates back". Alice tries to prove this, and legibly cooperates if so.

This setup gives us "Alice legibly cooperates if she can prove that, if she legibly cooperates, Bob would cooperate back". In symbols, $□ (□ A \to B) \to A$ .

Now, is this okay? What about proving $□ A \to ⊥$ ?

Well, actually you can't ever prove that! Because of Lob's theorem.

Outside the system we can definitely see cases where $A$ is unprovable, e.g. because Bob always defects. But you can't prove this inside the system. You can only prove things like " $□_{k} A \to ⊥$ " for finite proof lengths $k$ .

I think this is best seen as a consequence of "with finite proof strength you can only deny proofs up to a limited size".

So this construction works out, perhaps just because two different weirdnesses are canceling each other out. But in any case I think the underlying idea, "cooperate if choosing to do so leads to a good outcome", is pretty trustworthy. It perhaps deserves to be cached out in better provability math.

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y*82

(Thanks also to you for engaging!)

Hm. I'm going to take a step back, away from the math, and see if that makes things less confusing.

Let's go back to Alice thinking about whether to cooperate with Bob. They both have perfect models of each other (perhaps in the form of source code).

When Alice goes to think about what Bob will do, maybe she sees that Bob's decision depends on what he thinks Alice will do.

At this junction, I don't want Alice to "recurse", falling down the rabbit hole of "Alice thinking about Bob thinking about Alice thinking about--" and etc.

Instead Alice should realize that she has a choice to make, about who she cooperates with, which will determine the answers Bob finds when thinking about her.

This manouvre is doing a kind of causal surgery / counterfactual-taking. It cuts the loop by identifying "what Bob thinks about Alice" as a node under Alice's control. This is the heart of it, and imo doesn't rely on anything weird or unusual.

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y40

For the setup , it's bit more like: each member cooperates if they can prove that a compelling argument for "everyone cooperates" is sufficient to ensure "everyone cooperates".

Your second line seems right though! If there were provably no argument for straight up "everyone cooperates", i.e. $□ (□ E \to ⊥)$ , this implies $□ (□ E \to E)$ and therefore $E$ , a contradiction.

Also I think I'm a bit less confused here these days, and in case it helps:

Don't forget that " $□ P$ " means "a proof of any size of $P$ ", which is kinda crazy, and can be responsible for things not lining up with your intuition. My hot take is that Lob's theorem / incompleteness says "with finite proof strength you can only deny proofs up to a limited size, on pain of diagonalization". Which is way saner than the usual interpretation!

So idk, especially in this context I think it's a bad idea to throw out your intuition when the math seems to say something else. Since the mismatch is probably coming down to some subtlety in this formalization of provability/meta-methamatics. And I presently think the quirky nature of provability logic is often bugs due to bad choices in the formalism.

AI Will Not Want to Self-Improve

James Payor2y10

(Edit: others have made this point already, but anyhow)

My main objection to this angle: self-improvements do not necessarily look like "design a successor AI to be in charge". They can look more like "acquire better world models", "spin up more copies", "build better processors", "train lots of narrow AI to act as fingers", etc.

I don't expect an AI mind to have trouble finding lots of pathways like these (that tractably improve abilities without risking a misalignment catastrophe) that take it well above human level, given the chance.

Aggregating Utilities for Corrigible AI [Feedback Draft]

James Payor2y11

Is the following an accurate summary?

The agent is built to have a "utility function" input that the humans can change over time, and a probability distribution over what the humans will ask for at different time steps, and maximizes according a combination of the utility functions it anticipates across time steps?