AI ALIGNMENT FORUM
AF

All of James Payor's Comments + Replies

Modal Fixpoint Cooperation without Löb's Theorem

James Payor4mo40Review for 2023 Review

I continue to think there's something important in here!

I haven't had much success articulating why. I think it's neat that the loop-breaking/choosing can be internalized, and not need to pass through Lob. And it informs my sense of how to distinguish real-world high-integrity vs low-integrity situations.

1Raymond Arnold4mo

I'd be interested in a more in-depth review where you take another pass at this.

James Payor4mo52Review for 2023 Review

I think this post was and remains important and spot-on. Especially this part, which is proving more clearly true (but still contested):

It does not matter that those organizations have "AI safety" teams, if their AI safety teams do not have the power to take the one action that has been the obviously correct one this whole time: Shut down progress on capabilities. If their safety teams have not done this so far when it is the one thing that needs done, there is no reason to think they'll have the chance to take whatever would be the second-best or third-best actions either.

William_S's Shortform

James Payor11mo44

By "gag order" do you mean just as a matter of private agreement, or something heavier-handed, with e.g. potential criminal consequences?

I have trouble understanding the absolute silence we seem to be having. There seem to be very few leaks, and all of them are very mild-mannered and are failing to build any consensus narrative that challenges OA's press in the public sphere.

Are people not able to share info over Signal or otherwise tolerate some risk here? It doesn't add up to me if the risk is just some chance of OA trying to then sue you to bankruptcy, ... (read more)

LLMs Sometimes Generate Purely Negatively-Reinforced Text

James Payor2y93

Awesome, thanks for writing this up!

I very much like how you are giving a clear account for a mechanism like "negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text".

(In particular, the model isn't learning "just don't say that", it's learning "these are the things to avoid saying", which can make it easier to point at the whole cluster?)

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y*92

I tried to formalize this, using $A \to B$ as a "poor man's counterfactual", standing in for "if Alice cooperates then so does Bob". This has the odd behaviour of becoming "true" when Alice defects! You can see this as the counterfactual collapsing and becoming inconsistent, because its premise is violated. But this does mean we need to be careful about using these.

For technical reasons we upgrade to $□ A \to B$ , which says "if Alice cooperates in a legible way, then Bob cooperates back". Alice tries to prove this, and legibly cooperates if so.

This setup... (read more)

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y*82

(Thanks also to you for engaging!)

Hm. I'm going to take a step back, away from the math, and see if that makes things less confusing.

Let's go back to Alice thinking about whether to cooperate with Bob. They both have perfect models of each other (perhaps in the form of source code).

When Alice goes to think about what Bob will do, maybe she sees that Bob's decision depends on what he thinks Alice will do.

At this junction, I don't want Alice to "recurse", falling down the rabbit hole of "Alice thinking about Bob thinking about Alice thinking about--" and etc... (read more)

James Payor2y*92

For technical reasons we upgrade to $□ A \to B$ , which says "if Alice cooperates in a legible way, then Bob cooperates back". Alice tries to prove this, and legibly cooperates if so.

This setup... (read more)

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y40

For the setup $□ (□ E \to E) \to E$ , it's bit more like: each member cooperates if they can prove that a compelling argument for "everyone cooperates" is sufficient to ensure "everyone cooperates".

Your second line seems right though! If there were provably no argument for straight up "everyone cooperates", i.e. $□ (□ E \to ⊥)$ , this implies $□ (□ E \to E)$ and therefore $E$ , a contradiction.

Also I think I'm a bit less confused here these days, and in case it helps:

Don't forget that " $□ P$ " means "a proof of any size of $P$ ", which is kinda crazy, and can be r... (read more)

5Daniel Kokotajlo2y

Thanks for the reply. I'm a bit over my head here but isn't this a problem for the practicality of this approach? We only get mutual cooperation because all of the agents have the very unusual property that they'll cooperative if they find a proof that there is no such argument. Seems like a selfless and self-destructive property to have in most contexts, why would an agent self-modify into creating and maintaining this property?

AI Will Not Want to Self-Improve

James Payor2y10

(Edit: others have made this point already, but anyhow)

My main objection to this angle: self-improvements do not necessarily look like "design a successor AI to be in charge". They can look more like "acquire better world models", "spin up more copies", "build better processors", "train lots of narrow AI to act as fingers", etc.

I don't expect an AI mind to have trouble finding lots of pathways like these (that tractably improve abilities without risking a misalignment catastrophe) that take it well above human level, given the chance.

1petersalib2y

I think my response to this is similar to the one to Wei Dai above. Which is to agree that there are certain kinds of improvements that generate less risk of misalignment but it's hard to be certain. It seems like those paths are (1) less likely to produce transformational improvements in capabilities than other, more aggressive, changes and (2) not the kinds of changes we usually worry about in the arguments for human-AI risk, such that the risks remain largely symmetric. But maybe I'm missing something here!

Aggregating Utilities for Corrigible AI [Feedback Draft]

James Payor2y11

Is the following an accurate summary?

The agent is built to have a "utility function" input that the humans can change over time, and a probability distribution over what the humans will ask for at different time steps, and maximizes according a combination of the utility functions it anticipates across time steps?

Infrafunctions and Robust Optimization

James Payor2y10

If that's correct, here are some places this conflicts with my intuition about how things should be done:

I feel awkward about the randomness is being treated essential. I'd rather be able to do something other than randomness in order to get my mild optimization, and something feels unstable/non-compositional about needing randomness in place for your evaluations... (Not that I have an alternative that springs to mind!)

I also feel like "worst case" is perhaps problematic, since it's bringing maximization in, and you're then needing to rely on your convex s... (read more)

Infrafunctions and Robust Optimization

James Payor2y*21

Can I check that I follow how you recover quantilization?

Are you evaluating distributions over actions, and caring about the worst-case expectation of that distribution?

If so, proposing a particular action is evaluated badly? (Since there's a utility function in your set that spikes downward at that action.)

But proposing a range of actions to randomize amongst can be assessed to have decent worst-case expected utility, since particular downward spikes get smoothed over, and you can rely on your knowledge of "in-distribution" behaviour?

Edited to add: ... (read more)

1James Payor2y

If that's correct, here are some places this conflicts with my intuition about how things should be done: I feel awkward about the randomness is being treated essential. I'd rather be able to do something other than randomness in order to get my mild optimization, and something feels unstable/non-compositional about needing randomness in place for your evaluations... (Not that I have an alternative that springs to mind!) I also feel like "worst case" is perhaps problematic, since it's bringing maximization in, and you're then needing to rely on your convex set being some kind of smooth in order to get good outcomes. If I have a distribution over potential utility functions, and quantilize for the worst 10% of possibilities, does that do the same sort of work that "worst case" is doing for mild optimization?

Should we publish mechanistic interpretability research?

James Payor2y79

I want to say that I agree the transformer circuits work is great, and that I like it, and am glad I had the opportunity to read it! I still expect it was pretty harmful to publish.

Nerdsniping goes both ways: you also inspire things like the Hyena work trying to improve architectures based on components of what transformers can do.

I think indiscriminate hype and trying to do work that will be broadly attention-grabbing falls on the wrong side, likely doing net harm. Because capabilities improvements seem empirically easier than understanding them, and ther... (read more)

Should we publish mechanistic interpretability research?

James Payor2y42

I'm perhaps misusing "publish" here, to refer to "putting stuff on the internet" and "raising awareness of the work through company Twitter" and etc.

I mostly meant to say that, as I see it, too many things that shouldn't be published are being published, and the net effect looks plausibly terrible with little upside (though not much has happened yet in either direction).

The transformer circuits work strikes me this way, so does a bunch of others.

Also, I'm grateful to know your read! I'm broadly interested to hear this and other raw viewpoints, to get a sense of how things look to other people.

2Neel Nanda2y

Interesting, thanks for the context. I buy that this could be bad, but I'm surprised that you see little upside - the obvious upside esp for great work like transformer circuits is getting lots of researchers nerdsniped and producing excellent and alignment relevant interp work. Which seems huge if it works

Should we publish mechanistic interpretability research?

James Payor2y32

I mostly do just mean "keeping it within a single research group" in the absence of better ideas. And I don't have a better answer, especially not for independent folk or small orgs.

I wonder if we need an arxiv or LessWrong clone where you whitelist who you want to discuss your work with. And some scheme for helping independents find each other, or find existing groups they trust. Maybe with some "I won't use this for capabilities work without the permission of the authors" legal docs as well.

This isn't something I can visualize working, but maybe it has components of an answer.

Should we publish mechanistic interpretability research?

James Payor2y*1015

To throw in my two cents, I think it's clear that whole classes of "mechansitic interpretability" work are about better understanding architectures in ways that, if the research is successful, make it easier to improve their capabilities.

And I think this points strongly against publishing this stuff, especially if the goal is to "make this whole field more prestigious real quick". Insofar as the prestige is coming from folks who work on AI capabilities, that's drinking from a poisoned well (since they'll grant the most prestige to the work that helps them ... (read more)

2Owain Evans2y

Can you describe how the "local cluster" thing would work outside of keeping it within a single organization? I'd also be very interested in some case studies where people tried this.

1Neel Nanda2y

I'm surprised by this claim, can you say more? My read is weakly that people in interp under publish to wider audiences (eg getting papers into conferences), though maybe that people overpublish blog posts? (Or that I try too hard to make things go viral on Twitter lol)

Concave Utility Question

James Payor2y10

Seems right, oops! A5 is here saying that if any part of my $u$ is flat it had better stay flat!

I think I can repair my counterexample but looks like you've already found your own.

Concave Utility Question

Answer by James PayorApr 15, 2023*0-2

No on Q4? I think Alex's counterexample applies to Q4 as well.

(EDIT: Scott points out I'm wrong here, Alex's counterexample doesn't apply, and mine violates A5.)

In particular I think A4 and A5 don't imply anything about the rate of change as we move between lotteries, so we can have movements too sharp to be concave. We only have quasi-concavity.

My version of the counterexample: you have two outcomes $X$ and $\neg X$ , we prefer anything with $P (X) \leq \frac{1}{2}$ equally, and we otherwise prefer higher $P (X)$ .

If you give me a corresponding $u (p)$ , it must satisfy $u (0) = u (\frac{1}{2}) < u (1)$ ... (read more)

3Scott Garrabrant2y

Alex's counterexample as stated is not a counterexample to Q4, since it is in fact concave. I believe your counterexample violates A5, taking B=¬X, A=X, and p=12.

Concave Utility Question

James Payor2y10

yep!

Concave Utility Question

James Payor2y10

Okay, I now think A5 implies: "if moving by $Δ$ is good, then moving by any negative multiple $- n Δ$ is bad". Which checks out to me re concavity.

Concave Utility Question

James Payor2y10

Got it, thanks!

3Scott Garrabrant2y

You can also think of A5 in terms of its contrapositive: For all A,B∈L, if A≻B, then for all 0<p≤1A≻pA+(1−p)B This is basically just the strict version of A4. I probably should have written it that way instead. I wanted to use ⪰ instead of ≻, because it is closer to the base definition, but that is not how I was natively thinking about it, and I probably should have written it the way I think about it.

Concave Utility Question

James Payor2y10

The way I understand A4 is that it says "if moving by $Δ$ is good, then moving by any fraction $λ Δ$ is also good".

And A5 says "if moving by $Δ$ is good, then moving by any multiple $n Δ$ is also good", which is much stronger.

3Scott Garrabrant2y

Your understanding of A4 is right. In A5, "good" should be replaced with "bad."

Concave Utility Question

James Payor2y10

[Edit: yeah nevermind I have the inequality backwards]

A5 seems too strong?

Consider lotteries $A$ and $B$ , and a mixture $X = p A + (1 - p) B$ in between. Applying A5 twice gives:

If $u (X) \geq u (A)$ then $u (B) \geq u (A)$
If $u (X) \geq u (B)$ then $u (B) \geq u (A)$

So if $u (X) \geq u (A)$ and $u (X) \geq u (B)$ then $u (A) = u (B)$ ?

Either I'm confused or A5 is a stricter condition than concavity.

2Scott Garrabrant2y

I haven't actually thought about whether A5 implies A4 though. It is plausible that it does. (together with A1-A3, or some other simple axioms,) When A≻B, we get A4 from A5, so it suffices to replace A4 with the special case that A∼B. If A∼B, and A,B≻X, a mixture of A and B, then all we need to do is have any Y such that A≻Y≻X, then we can get Y′ between A and X by A3, and then X will also be a mixture of Y′ and B, contradicting A5, since B≻Y′. A1,A2,A3,A5 do not imply A4 directly, because you can have the function that assigns utility 0 to a fair coin flip between two options, and utility 1 to everything else. However, I suspect when we add the right axiom to imply continuity, I think that will be sufficient to also allow us to remove A4, and only have A5.

3Scott Garrabrant2y

You have the inequality backwards. You can't apply A5 when the mixture is better than the endpoint, only when the mixture is worse than the endpoint.

1James Payor2y

The way I understand A4 is that it says "if moving by Δ is good, then moving by any fraction λΔ is also good". And A5 says "if moving by Δ is good, then moving by any multiple nΔ is also good", which is much stronger.

Some constructions for proof-based cooperation without Löb

James Payor2y42

Something I'm now realizing, having written all these down: the core mechanism really does echo Löb's theorem! Gah, maybe these are more like Löb than I thought.

(My whole hope was to generalize to things that Löb's theorem doesn't! And maybe these ideas still do, but my story for why has broken, and I'm now confused.)

As something to ponder on, let me show you how we can prove Löb's theorem following the method of ideas #3 and #5:

$□ A \to A$ is assumed
We consider the loop-cutter $X \leftrightarrow □ (X \to A)$
We verify that if $X$ activates then $A$ must be true:
- $X \to (□ X \to □ A)$
- $X \to □ X$
- $X \to □ A$
- $X \to$

James Payor2y512

Also Plan B is currently being used to justify accelerating various danger tech by folks with no solid angles on Plan A...

Challenge: construct a Gradient Hacker

James Payor2y20

My troll example is a fully connected network with all zero weights and biases, no skip connections.

This isn't something that you'd reach in regular training, since networks are initialized away from zero to avoid this. But it does exhibit a basic ingredient in controlling the gradient flow.

To look for a true hacker I'd try to reconfigure the way the downstream computation works (by modifying attention weights, saturating relus, or similar) based on some model of the inputs, in a way that pushes around where the gradients go.

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y*20

It looks like you're investigating an angle that I can't follow, but here's my two cents re bounded agents:

My main idea to port this to the bounded setting is to have a bot that searches for increasingly long proofs, knowing that if it takes longer to find a proof then it is itself a bit harder to reason about.

We can instantiate this like: $A_{k} (B) \leftrightarrow \exists_{i \leq k} . □_{i} (□_{i + C} A_{k} (B) \to B (A_{k}))$

The idea is that if there is a short way to prove that the opponent $B$ would cooperate back, then it takes just a constant $C$ steps more to prove that we cooperate. So it doesn't open us up to... (read more)

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y60

A point of confusion: is it enough to prove that $A \to □ B$ ? What about $A \to B$ ? I'm not sure I can say this well, but here goes:

We might not be able to prove $A \to B$ in the theory, or even that $\neg □ ⊥$ (which would mean "there are no proofs of inconsistency"). But if we believe our theory is sound, we believe that it can't show that a copy of itself proves something false.

So $A \to □ B$ tells us that if $A$ is true, the theory would show that a copy of itself proves $B$ is true. And this is enough to convince us that we can't simultaneously have $A$ true and $B$ false.

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y*147

Also, here's a proof that a bot $A \leftrightarrow □ (□ A \to B)$ is never exploited. It only cooperates when its partner $B$ provably cooperates.

First, note that $A \to □ A$ , i.e. if $A$ cooperates it provably cooperates. (Proof sketch: $A \leftrightarrow □ X \to □ □ X \leftrightarrow □ A$ .)

Now we show that $A \to □ B$ (i.e. if $A$ chooses to cooperate, its partner is provably cooperating):

We get $A \to (□ □ A \to □ B)$ by distributing.
We get $A \to □ □ A$ by applying internal necessitation to $A \to □ A$ .
By (1) and (2), $A \to □ B$ .

(PS: we can strengthen this to $A \leftrightarrow □ B$ , by noticing that $□ B \to □ (□ A \to B) \leftrightarrow A$ .)

2Andrew Critch2y

This is cool (and fwiw to other readers) correct. I must reflect on what it means for real world cooperation... I especially like the A <-> []X -> [][]X <-> []A trick.

6James Payor2y

A point of confusion: is it enough to prove that A→□B? What about A→B? I'm not sure I can say this well, but here goes: We might not be able to prove A→B in the theory, or even that ¬□⊥ (which would mean "there are no proofs of inconsistency"). But if we believe our theory is sound, we believe that it can't show that a copy of itself proves something false. So A→□B tells us that if A is true, the theory would show that a copy of itself proves B is true. And this is enough to convince us that we can't simultaneously have A true and B false.

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y*67

Intuitively, each agent in the group verifies that "if there is some argument compelling to everyone that we all cooperate, then I think everyone cooperates".

But implicit in the math is "if I can prove that there is no such argument, then the hypothetical is moot, and I cooperate anyway". !!

Is this a feature or a bug? Well I think this is part of how the group is ruling out a consistent world in which $\neg E$ .

EDIT: Okay so Lob's theorem says you can never prove in the theory that there are no proofs of $X$ , due to the technicality in which $□ X$ means "a proof of an... (read more)

4Daniel Kokotajlo2y

So... it's part of the setup that all of these agents will: --Cooperate if they can prove that there is some argument compelling to everyone that everyone cooperates (because then they prove that everyone cooperates, and that includes them, and their proof system isn't mistaken?) --Cooperate if they can prove that there is no such argument. --Else defect. Am I getting that right?

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y67

If I follow what you mean, we can derive: $□ (□ A) \to □ B \to □ (□ B) \to □ A$

So there's a Löbian proof, in which the provability is self-fulfilling. But this isn't sufficient to avoid this kind of proof.

(Aside on why I don't like the Löbian method: I moreso want the agents to be doing "correct" counterfactual reasoning about how their actions affect their opponent, and to cooperate because they see that mutual cooperation is possible and then choose it. The Löbian proof style isn't a good model of that, imo.)

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y40

(In fact, I think "assume they will think I cooperate" turns out to be too strong, and leads to unnecessary defection. I'm still working through the details.)

0Amalthea2y

So, ideally you would like to assume only 1. □A→B 2. □B→A and conclude A and B ?

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y70

Thanks! I also wrote up my proof in another comment, which should help triangulate intuition.

7Sam Eisenstat2y

Nice, I like this proof also. Maybe there's a clearer way to say thing, but your "unrolling one step" corresponds to my going from u to v. We somehow need to "look two possible worlds deep".

Modal Fixpoint Cooperation without Löb's Theorem

James Payor2y*57

In case this helps folks' intuition, my generating idea was something like: "look at what my opponent is thinking, but assume that whenever they check if I cooperate, they think the answer is yes". This is how we break the loop.

This results in a proof of the lemma like thus:

$x \leftrightarrow □ (□ x \to x)$ (given)
$x \leftrightarrow □ (□ x \to □ (□ x \to x))$ (unrolling one step)
$□ x \to □ (□ x \to x)$ is straightforward intuitively, and we can get there by applying box-distributivity to $□ (x \to (□ x \to x))$

(EDIT: This is basically the same proof as in the post, but less simple. Maybe the interesting part is the "unroll once... (read more)

4James Payor2y

(In fact, I think "assume they will think I cooperate" turns out to be too strong, and leads to unnecessary defection. I'm still working through the details.)

AGI Ruin: A List of Lethalities

James Payor3y74

Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels.

Re (15), "Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously", which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about "fast capability gain seems possible and likely", and isn't a... (read more)

AGI Ruin: A List of Lethalities

James Payor3y56

Eliezer's post here is doing work left undone by the writing you cite. It is a much clearer account of how our mainline looks doomed than you'd see elsewhere, and it's frank on this point.

I think Eliezer wishes these sorts of artifacts were not just things he wrote, like this and "There is no fire alarm".

Also, re your excerpts for (14), (15), and (32), I see Eliezer as saying something meaningfully different in each case. I might elaborate under this comment.

James Payor3y74

AGI Ruin: A List of Lethalities

James Payor3y34

maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.

I should say that I do see this as a reasonable path forward! But we don't seem to be coordinating to do this, and AI researchers seem to love doing work on open-ended agents, which sucks.

Hm, regardless it doesn't really move the needle, so long as people are publishing all of their work. Developing overpowered patter... (read more)

AGI Ruin: A List of Lethalities

James Payor3y34

Can you visualize an agent that is not "open-ended" in the relevant ways, but is capable of, say, building nanotech and melting all the GPUs?

In my picture most of the extra sauce you'd need on top of GPT-3 looks very agenty. It seems tricky to name "virtual worlds" in which AIs manipulate just "virtual resources" and still manage to do something like melting the GPUs.

3James Payor3y

I should say that I do see this as a reasonable path forward! But we don't seem to be coordinating to do this, and AI researchers seem to love doing work on open-ended agents, which sucks. Hm, regardless it doesn't really move the needle, so long as people are publishing all of their work. Developing overpowered pattern recognizers is similar to increasing our level of hardware overhang. People will end up using them as components of systems that aren't safe.

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

James Payor3y60

I agree with your point that blobs of bayes net nodes aren't very legible, but I still think neural nets are relevantly a lot less interpretable than that! I think basically all structure that limits how your AI does its thinking is helpful for alignment, and that neural nets are pessimal on this axis.

In particular, an AI system based on a big bayes net can generate its outputs in a fairly constrained and structured way, using some sort of inference algorithm that tries to synthesize all the local constraints. A neural net lacks this structure, and is ther... (read more)

3johnswentworth3y

I'd crystallize the argument here as something like: suppose we're analyzing a neural net doing inference, and we find that its internal computation is implementing <algorithm> for Bayesian inference on <big Bayes net>. That would be a huge amount of interpretability progress, even though the "big Bayes net" part is still pretty uninterpretable. When we use Bayes nets directly, we get that kind of step for free. ... I think that's decent argument, and I at least partially buy it. That said, if we compare a neural net directly to a Bayes net (as opposed to inference-on-a-Bayes-net), they have basically the same structure: both are circuits. Both constrain locality of computation.

The Speed + Simplicity Prior is probably anti-deceptive

James Payor3y40

And a final note: none of that seems to matter for my main complaint, which is that the argument in the post seems to rely on factoring "mesaoptimizer" as "stuff + another mesaoptimizer"?

If so, I can't really update on the results of the argument.

The Speed + Simplicity Prior is probably anti-deceptive

James Payor3y50

A longer reply on the points about heuristic mesaobjectives and the switch:

I will first note here that I'm not a huge fan of the concepts/story from the mesaoptimizers paper as a way of factoring reality. I struggle to map the concepts onto my own model of what's going to happen as we fumble toward AGI.

But putting that aside, and noting that my language is imprecise and confused, here is how I think about the "switch" from directly to deceptively pursuing your training objective:

"Pursuing objective X" is an abstraction we use to think about an agent that

James Payor3y30

Two quick things to say:

(1) I think the traditional story is more that your agent pursues mostly-X while it's dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.

(2) I'm mainly arguing that your post isn't correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.

The Speed + Simplicity Prior is probably anti-deceptive

James Payor3y40

I think a contentious assumption you're making with this model is the value-neutral core of mesaoptimizer cognition, namely your mesaoptimize in the pseudocode. I think that our whole problem in practice is roughly that we don't know how to gradient-descend our way toward general cognitive primitives that have goals factored out.

A different way to point at my perceived issue: the mesaoptimizers are built out of a mesaoptimize primitive, which is itself is a mesaoptimizer that has to be learnt. This seems to me to be not well-founded, and not actually factoring a mesaoptimizer into smaller parts.

2[anonymous]3y

I think my argument only gets stronger if you assume that the mesaobjective is a large pile of heuristics built into the mesaoptimization algorithm, since that takes up much more space. In the traditional deceptive mesaoptimization story, the model needs to at some point switch from "pursuing objective X directly" to "pursuing objective Y indirectly by deceptively pursuing objective X". I agree that, if there isn't really a core "mesaoptimizer" that can have goals swapped out, the idea of seemlessly transitioning between the two is very unlikely, since you initially lack the heuristics for "pursuing objective Y". I'm not sure whether you're arguing that my post fails to imply the speed prior disincentivizes deceptive mesaoptimization, or whether you're arguing that deceptive mesaoptimization isn't likely in the first place.

A Longlist of Theories of Impact for Interpretability

James Payor3y*20

If the field of ML shifts towards having a better understanding of models ...

I think this would be a negative outcome, and not a positive one.

Specifically, I think it means faster capabilities progress, since ML folks might run better experiments. Or worse yet, they might better identify and remove bottlenecks on model performance.

AI Alignment Open Thread August 2019

James Payor6y50

It wasn't meant as a reply to a particular thing - mainly I'm flagging this as an AI-risk analogy I like.

On that theme, one thing "we don't know if the nukes will ignite the atmosphere" has in common with AI-risk is that the risk is from reaching new configurations (e.g. temperatures of the sort you get out of a nuclear bomb inside the Earth's atmosphere) that we don't have experience with. Which is an entirely different question than "what happens with the nukes after we don't ignite the atmosphere in a test explosion".

I like thinking about coordination from this viewpoint.

AI Alignment Open Thread August 2019

James Payor6y70

There is a nuclear analog for accident risk. A quote from Richard Hamming:

Shortly before the first field test (you realize that no small scale experiment can be done—either you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, "It is the probability that the test bomb will ignite the whole atmosphere." I decided I would check it myself! The next day when he came for the answers I remarked to him, "The arithmeti

... (read more)

1Rohin Shah6y

I don't really know what this is meant to imply? Maybe you're answering my question of "did that happen with nukes?", but I don't think an affirmative answer means that the analogy starts to work. I think the nukes-AI analogy is used to argue "people raced to develop nukes despite their downsides, so we should expect the same with AI"; the magnitude/severity of the accident risk is not that relevant to this argument.

Coherent behaviour in the real world is an incoherent concept

James Payor6y60

First problem with this argument: there are no coherence theories saying that an agent needs to maintain the same utility function over time.

This seems pretty false to me. If you can predict in advance that some future you will be optimizing for something else, you could trade with future "you" and merge utility functions, which seems strictly better than not. (Side note: I'm pretty annoyed with all the use of "there's no coherence theorem for X" in this post.)

As a separate note, the "further out" your goal is and th... (read more)

1Richard Ngo6y

I agree that this problem is not a particularly important one, and explicitly discard it a few sentences later. I hadn't considered your objection though, and will need to think more about it. Mind explaining why? Is this more a stylistic preference, or do you think most of them are wrong/irrelevant? Also true if you make world states temporally extended.

Diagonalization Fixed Point Exercises

James Payor6y20

Q7 (Python):

Y = lambda s: eval(s)(s)
Y('lambda s: print("Y = lambda s: eval(s)(s)\\nY({s!r})")')

Q8 (Python):

Not sure about the interpretation of this one. Here's a way to have it work for any fixed (python function) f:

f = 'lambda s: "\\n".join(s.splitlines()[::-1])'

go = 'lambda s: print(eval(f)(eval(s)(s)))'

eval(go)('lambda src: f"f = {f!r}\\ngo = {go!r}\\neval(go)({src!r})"')