AI ALIGNMENT FORUM
AF

All of Andrew_Critch's Comments + Replies

LLM chatbots have ~half of the kinds of "consciousness" that humans believe in. Humans should avoid going crazy about that.

Andrew Critch4mo130

The evidence you present in each case is outputs generated by LLMs.

The total evidence I have (and that everyone has) is more than behavioral. It includes

a) the transformer architecture, in particular the attention module,

b) the training corpus of human writing,

c) the means of execution (recursive calling upon its own outputs and history of QKV vector representations of outputs),

d) as you say, the model's behavior, and

e) "artificial neuroscience" experiments on the model's activation patterns and weights, like mech interp research.

When I think about how... (read more)

My motivation and theory of change for working in AI healthtech

Andrew Critch5mo50

A patient can hire us to collect their medical records into one place, to research a health question for them, and to help them prep for a doctor's appointment with good questions about the research. Then we do that, building and using our AI tool chain as we go, without training AI on sensitive patient data. Then the patient can delete their data from our systems if they want, or re-engage us for further research or other advocacy on their behalf.

A good comparison is the company Picnic Health, except instead of specifically matching patients with clinical trials, we do more general research and advocacy for them.

My motivation and theory of change for working in AI healthtech

Andrew Critch5mo70

Do you have a mostly disjoint view of AI capabilities between the "extinction from loss of control" scenarios and "extinction by industrial dehumanization" scenarios?

a) If we go extinct from a loss of control event, I count that as extinction from a loss of control event, accounting for the 35% probability mentioned in the post.

b) If we don't have a loss of control event but still go extinct from industrial dehumanization, I count that as extinction caused by industrial dehumanization caused by successionism, accounting for the additional 50% probabilit... (read more)

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

Andrew Critch9mo414

I very much agree with human flourishing as the main value I most want AI technologies to pursue and be used to pursue.

In that framing, my key claim is that in practice no area of purely technical AI research — including "safety" and/or "alignment" research — can be adequately checked for whether it will help or hinder human flourishing, without a social model of how the resulting techologies will be used by individuals / businesses / governments / etc..

Richard Ngo's Shortform

Andrew Critch10mo50

I may be missing context here, but as written / taken at face value, I strongly agree with the above comment from Richard. I often disagree with Richard about alignment and its role in the future of AI, but this comment is an extremely dense list of things I agree with regarding rationalist epistemic culture.

Acausal normalcy

Andrew Critch2y*30

That is, norms do seem feasible to figure out, but not the kind of thing that is relevant right now, unfortunately.

From the OP:

for most real-world-prevalent perspectives on AI alignment, safety, and existential safety, acausal considerations are not particularly dominant [...]. In particular, I do not think acausal normalcy provides a solution to existential safety, nor does it undermine the importance of existential safety in some surprising way.

I.e., I agree.

we are so unprepared that the existing primordial norms are unlikely to matter

... (read more)

Acausal normalcy

Andrew Critch2y*10

For 18 examples, just think of 3 common everyday norms having to do with each of the 6 boundaries given as example images in the post :) (I.e., cell membranes, skin, fences, social group boundaries, internet firewalls, and national borders). Each norm has the property that, when you reflect on it, it's easy to imagine a lot of other people also reflecting on the same norm, because of the salience of the non-subjectively-defined actual-boundary-thing that the norm is about. That creates more of a Schelling-nature for that norm, relative to... (read more)

Acausal normalcy

Andrew Critch2y*62

To your first question, I'm not sure which particular "the reason" would be most helpful to convey. (To contrast: what's "the reason" that physically dispersed human societies have laws? Answer: there's a confluence of reasons.). However, I'll try to point out some things that might be helpful to attend to.

First, committing to a policy that merges your utility function with someone else's is quite a vulnerable maneuver, with a lot of boundary-setting aspects. For instance, will you merge utility functions multiplicatively (as in Nas... (read more)

Modal Fixpoint Cooperation without Löb's Theorem

Andrew Critch2y20

This is cool (and fwiw to other readers) correct. I must reflect on what it means for real world cooperation... I especially like the A <-> []X -> [][]X <-> []A trick.

Modal Fixpoint Cooperation without Löb's Theorem

Andrew Critch2y20

I'm working on it :) At this point what I think is true is the following:

If ShortProof(x \leftrightarrow LongProof(ShortProof(x) \to x)), then MediumProof(x).

Apologies that I haven't written out calculations very precisely yet, but since you asked, that's roughly where I'm at :)

2James Payor2y

It looks like you're investigating an angle that I can't follow, but here's my two cents re bounded agents: My main idea to port this to the bounded setting is to have a bot that searches for increasingly long proofs, knowing that if it takes longer to find a proof then it is itself a bit harder to reason about. We can instantiate this like: Ak(B)↔∃i≤k.□i(□i+CAk(B)→B(Ak)) The idea is that if there is a short way to prove that the opponent B would cooperate back, then it takes just a constant C steps more to prove that we cooperate. So it doesn't open us up to exploitation to assume that our own cooperation is provable in i+C steps. The way in which this works at all is by cutting the loop at the point where the opponent is thinking about our own behaviour. This bot cuts it rather aggressively: it assumes that no matter the context, when B thinks about whether A cooperates, it's provable that A does cooperate. (I think this isn't great and can be improved to a weaker assumption that would lead to more potential cooperation.) If you construct Bn similarly, I claim that Ak and Bn mutually cooperate if k and n are large enough, and mutually defect otherwise. Similarly, I claim Ak can mutually cooperate with other bots like Bn(A)=□nA(Bn).

A Löbian argument pattern for implicit reasoning in natural language: Löbian party invitations

Andrew Critch2y10

Based on a potential misreading of this post, I added the following caveat today:

Important Caveat: Arguments in natural language are basically never "theorems". The main reason is that human thinking isn't perfectly rational in virtually any precisely defined sense, so sometimes the hypotheses of an argument can hold while its conclusion remains unconvincing. Thus, the Löbian argument pattern of this post does not constitute a "theorem" about real-world humans: even when the hypotheses of the argument hold, the argument will not always play out... (read more)

A Löbian argument pattern for implicit reasoning in natural language: Löbian party invitations

Andrew Critch2y10

Thanks! Added a note to the OP explaining that hereby means "by this utterance".

A Löbian argument pattern for implicit reasoning in natural language: Löbian party invitations

Andrew Critch2y*10

Hat tip to Ben Pace for pointing out that invitations are often self-referential, such as when people say "You are hereby invited", because "hereby" means "by this utterance":
https://www.lesswrong.com/posts/rrpnEDpLPxsmmsLzs/open-technical-problem-a-quinean-proof-of-loeb-s-theorem-for?commentId=CFvfaWGzJjnMP8FCa

That comment was like 25% of my inspiration for this post :)

2Ustice2y

I was confused for a while by trying to understand why invitations that are self-referential. It wasn’t until I read the inspirational post that I realized that you are referring to is the word “hereby.” I guess I could have used that to be explicit, despite it being implicitly stated.

Löb's Lemma: an easier approach to Löb's Theorem

Andrew Critch2y10

I've now fleshed out the notation section to elaborate on this a bit. Is it better now?

In short, $⊢$ is our symbol for talking about what PA can prove, and $□$ is shorthand for PA's symbols for talking about what (a copy of) PA can prove.
" $⊢$ 1+1=2" means "Peano Arithmetic (PA) can prove that 1+1=2". No parentheses are needed; the " $⊢$ " applies to the whole line that follows it. Also, $⊢$ does not stand for an expression in PA; it's a symbol we use to talk about what PA can prove.
" $□ (1+1=2)$ " basically means the sam

Andrew Critch2y20

Well, the deduction theorem is a fact about PA (and, propositional logic), so it's okay to use as long as $⊢$ means "PA can prove".

But you're right that it doesn't mix seamlessly with the (outer) necessitation rule. Necessitation is a property of " $⊢$ ", but not generally a property of " $X ⊢$ ". When PA can prove something, it can prove that it can prove it. By contrast, if PA+X can prove Y, that does mean that PA can prove that PA+X can prove Y (because PA alone can work through proofs in a Gödel encoding), but it doesn't mean that PA+... (read more)

Löb's Lemma: an easier approach to Löb's Theorem

Andrew Critch2y*10

Well, $A \to B$ is just short for $\neg A \lor B$ , i.e., "(not A) or B". By contrast, $A ⊢ B$ means that there exists a sequence of (very mechanical) applications of modus ponens, starting from the axioms of Peano Arithmetic (PA) with $A$ appended, ending in $B$ . We tried hard to make the rules of $⊢$ so that it would agree with $\to$ in a lot of cases (i.e., we tried to design $⊢$ to make the deduction theorem true), but it took a lot of work in the design of Peano Arithmetic and can't be taken for gr... (read more)

Löb's Lemma: an easier approach to Löb's Theorem

Andrew Critch2y20

It's true that the deduction theorem is not needed, as in the Wikipedia proof. I just like using the deduction theorem because I find it intuitive (assume $A$ , prove $B$ , then drop the assumption and conclude $A \to B$ ) and it removes the need for lots of parentheses everywhere.

I'll add a note about the meaning of $⊢$ so folks don't need to look it up, thanks for the feedback!

«Boundaries», Part 3b: Alignment problems in terms of boundaries

Andrew Critch2y21

I agree this is a big factor, and might be the main pathway through which people end up believing what people believe the believe. If I had to guess, I'd guess you're right.

E.g., if there's a evidence E in favor of H and evidence E' against H, if the group is really into thinking about and talking about E as a topic, then the group will probably end up believing H too much.

I think it would be great if you or someone wrote a post about this (or whatever you meant by your comment) and pointed to some examples. I think the LessWrong community is somewhat plagued by attentional bias leading to collective epistemic blind spots. (Not necessarily more than other communities; just different blind spots.)

«Boundaries», Part 3b: Alignment problems in terms of boundaries

Andrew Critch2y10

Ah, thanks for the correction! I've removed that statement about "integrity for consequentialists" now.

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew Critch2y21

Thanks for raising this! I assume you're talking about this part?

They explore a pretty interesting set-up, but they don't avoid the narrowly-self-referential sentence Ψ:

So, I don't think their motivation was the same as mine. For me, the point of trying to use a quine is to try to get away from that sentence, to create a different perspective on the foundations for people that find that kind of sentence confusing, but who find self-referential documents less confusing. I added a section "Further meta-motivation (added Nov 26)" about this ... (read more)

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew Critch2y10

Noice :)

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew Critch2y10

At this point I'm more interested in hashing out approaches that might actually conform to the motivation in the OP. Perhaps I'll come back to this discussion with you after I've spent a lot more time in a mode of searching for a positive result that fits with my motivation here. Meanwhile, thanks for thinking this over for a bit.

2Nate Soares2y

well, in your search for that positive result, i recommend spending some time searching for a critch!simplified alternative to the Y combinator :-p. not every method of attaining self-reference in the λ-calculus will port over to logic (b/c in the logical setting lots of things need to be quoted), but the quotation sure isn't making the problem any easier. a solution to the OP would yield a novel self-reference combinator in the λ-calculus, and the latter might be easier to find (b/c you don't need to juggle quotes). if you can lay bare the self-referential property that you're hoping for in the easier setting of λ-calculus, then perhaps others will have an easier time understanding what you want and helping out (and/or you'll have an easier time noticing why your desires are unsatisfiable). (and if it's still not clear that löb's theorem is tightly connected to the Y combinator, such that any solution to the OP would immediately yield a critch!simplified self-reference combinator in the λ-calculus, then I recommend spending a little time studying the connection between the Y combinator, löb's theorem, and lawvere's fixpoint theorem.)

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew Critch2y10

True! "Hereby" covers a solid contingent of self-referential sentences. I wonder if there's a "hereby" construction that would make the self-referential sentence Ψ (from the Wikipedia poof) more common-sense-meaningful to, say, lawyers.

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew Critch2y20

this suggests that you're going to be hard-pressed to do any self-reference without routing through the nomal machinery of löb's theorem, in the same way that it's hard to do recursion in the lambda calculus without routing through the Y combinator

If by "the normal machinery", you mean a clever application of the diagonal lemma, then I agree. But I think we can get away with not having the self-referential sentence, by using the same y-combinator-like diagonal-lemma machinery to make a proof that refers to itself (instead of a proof about sentences t... (read more)

1Nate Soares2y

which self-referential sentence are you trying to avoid? it keeps sounding to me like you're saying "i want a λ-calculus combinator that produces the fixpoint of a given function f, but i don't want to use the Y combinator". do you deny the alleged analogy between the normal proof of löb and the Y combinator? (hypothesis: maybe you see that the diagonal lemma is just the type-level Y combinator, but have not yet noticed that löb's theorem is the corresponding term-level Y combinator?) if you follow the analogy, can you tell me what λ-term should come out when i put in f, and how it's better than (λ s. f (s s)) (λ s. f (s s))? or (still assuming you follow the analogy): what sort of λ-term representing the fixpoint of f would constitute "referring to itself (instead of being a term about types that refer to themselves)"? in what sense is the term (λ s. f (s s)) (λ s. f (s s)) failing to "refer to itself", and what property are you hoping for instead? (in case it helps with communication: when i try myself to answer these questions while staring at the OP, my best guess is that you're asking "instead of the Y combinator, can we get a combinator that goes like f ↦ f ????", and the two obvious ways to fill in the blanks are f ↦ f (Y f) and f ↦ f (f (f (.... i discussed why both of those are troublesome here, but am open to the possibility that i have not successfully understood what sort of fixpoint combinator you desire.) (ETA: also, ftr, in the proof-sketch of löb's theorem that i gave above, the term "g "g"" occurs as a subterm if you do enough substitution, and it refers to the whole proof of löb's theorem. just like how, in the version of the Y combinator given above, the term g g occurs as a subterm if you do enough β-reduction, and it refers to the whole fixpoint. which i note b/c it seems to me that you might have misunderstood a separate point about where the OP struggles as implying that the normal proof isn't self-referring.) ((the OP is gonna struggle

Ben Pace2y80

This sentence is an exception, but there aren't a lot of naturally occurring examples.

No strong claim either way, but as a datapoint I do somewhat often use the phrase "I hereby invite you to <event>" or "I hereby <request> something of you" to help move from 'describing the world' to 'issuing an invitation/command/etc'.

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew Critch2y51

Thanks for your attention to this! The happy face is the outer box. So, line 3 of the cartoon proof is assumption 3.

If you want the full []([]C->C) to be inside a thought bubble, then just take every line of the cartoon and put into a thought bubble, and I think that will do what you want.

LMK if this doesn't make sense; given the time you've spent thinking about this, you're probably my #1 target audience member for making the more intuitive proof (assuming it's possible, which I think it is).

ETA: You might have been asking if th... (read more)

3Eliezer Yudkowsky2y

Okay, that makes much more sense. I initially read the diagram as saying that just lines 1 and 2 were in the box.

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew Critch2y40

Yes to both of you on these points:

Yes to Alex that (I think) you can use an already-in-hand proof of Löb to make the self-referential proof work, and
Yes to Eliezer that that would be cheating wouldn't actually ground out all of the intuitions, because then the "santa clause"-like sentence is still in use in already-in-hand proof of Löb.

(I'll write a separate comment on Eliezer's original question.)

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Andrew Critch2y10

That thing is hilarious and good! Thanks for sharing it. As for the relevance, it explains the statement of Gödel's theorem, but not the proof it. So, it could be pretty straightforwardly reworked to explain the statement of Löb's theorem, but not so easily the proof of Löb's theorem. With this post, I'm in the business of trying to find a proof of Löb that's really intuitive/simple, rather than just a statement of it that's intuitive/simple.

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Andrew Critch2y10

Why is it unrealistic? Do you actually mean it's unrealistic that the set I've defined as "A" will be interpretable at "actions" in the usual coarse-grained sense? If so I think that's a topic for another post when I get into talking about the coarsened variables $V^{c}, A^{c}, P^{c}, E^{c}$ ...

2Scott Garrabrant2y

I mean, the definition is a little vague. If your meaning is something like "It goes in A if it is more accurately described as controlled by the viscera, and it goes in P if it is more accurately described as controlled by the environment," then I guess you can get a bijection by definition, but it is not obvious these are natural categories. I think there will be parts of the boundary that feel like they are controlled by both or neither, depending on how strictly you mean "controlled by."

Boundaries vs Frames

Andrew Critch2y10

3Scott Garrabrant2y

My default plan is to not try to rename Cartesian frames, mostly because the benefit seems small, and I care more about building up the FFS ontology over the Cartesian frame one.

Boundaries vs Frames

Andrew Critch2y39

Scott, thanks for writing this! While I very much agree with the distinctions being drawn, I think the word "boundary" should be usable for referring to factorizations that do not factor through the physical separation of the world into objects. In other words, I want the technical concept of «boundaries» that I'm developing to be able to refer to things like social boundaries, which are often not most-easily-expressed in the physics factorization of the world into particles (but are very often expressible as Markov blankets in a more abstract ... (read more)

3Scott Garrabrant2y

I agree completely. I am not really happy with any of the language in this post, and I want it to have scope limited to this post. I will for the most part say boundary for both the additive and multiplicative variants.

1Andrew Critch2y

Going further, my proposed convention also suggests that "Cartesian frames" should perhaps be renamed to "Cartesian factorizations", which I think is a more immediately interpretable name for what they are. Then in your equation S=A×E, you can refer to A and E as "Cartesian factors", satisfying your desire to treat A and E as interchangeable. And, you leave open the possibility that the factors are derivable from a "Cartesian partition" r=a⊔e of the world into the "Cartesian parts" a and e. There is of course the problem that for some people "Cartesian" just means "factoring into coordinates" (e.g., "Cartesian plane"), in which case "Cartesian factorization" will sound a bit redundant, but for those people "Cartesian frame" is already not very elucidating.

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Andrew Critch2y40

Thanks, Scott!

I think the boundary factorization into active and passive is wrong.

Are you sure? The informal description I gave for A and P allow for the active boundary to be a bit passive and the passive boundary to be a bit active. From the post:

the active boundary, A — the features or parts of the boundary primarily controlled by the viscera, interpretable as "actions" of the system— and the passive boundary, P — the features or parts of the boundary primarily controlled by the environment, interpretable as "perceptions" of the

... (read more)

2Scott Garrabrant2y

Forcing the AxP bijection is an interesting idea, but it feels a little too approximate to my taste.

4Scott Garrabrant2y

Oh yeah, oops, that is what it says. Wasn’t careful, and was responding to reading an old draft. I agree that the post is already saying roughly what I want there. Instead, I should have said that the B=AxP bijection is especially unrealistic. Sorry.

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Andrew Critch2y11

Thanks, fixed!

«Boundaries», Part 1: a key missing concept from utility theory

Andrew Critch3y11

Jan, I agree with your references, especially Friston et al. I think those kinds of understanding, as you say, have not adequately made their way into utility utility-theoretic fields like econ and game theory, so I think the post is valid as a statement about the state of understanding in those utility-oriented fields. (Note that the post is about "a missing concept from the axioms of game theory and bargaining theory" and "a key missing concept from utility theory", and not "concepts missing from the mind of all of humanity".)

«Boundaries», Part 1: a key missing concept from utility theory

Andrew Critch3y312

8. (Unscoped) Consequentialism — the problem that an AI system engaging in consequentialist reasoning, for many objectives, is at odds with corrigibility and containment (Yudkowsky, 2022, no. 23).

«Boundaries», Part 1: a key missing concept from utility theory

Andrew Critch3y415

7. Preference plasticity — the possibility of changes to the preferences of human preferences over time, and the challenge of defining alignment in light of time-varying preferences (Russell, 2019, p.263).

«Boundaries», Part 1: a key missing concept from utility theory

Andrew Critch3y414

6. Mesa-optimizers — instances of learned models that are themselves optimizers, which give rise to the so-called inner alignment problem (Hubinger et al, 2019).

«Boundaries», Part 1: a key missing concept from utility theory

Andrew Critch3y46

5. Counterfactuals in decision theory — the problem of defining what would have happened if an AI system had made a different choice, such as in the Twin Prisoner's Dilemma (Yudkowsky & Soares, 2017).

«Boundaries», Part 1: a key missing concept from utility theory

Andrew Critch3y36

4. Impact regularization — the problem of formalizing "change to the environment" in a way that can be effectively used as a regularizer penalizing negative side effects from AI systems (Amodei et al, 2016).

«Boundaries», Part 1: a key missing concept from utility theory

Andrew Critch3y48

3. Mild optimization — the problem of designing AI systems and objective functions that, in an intuitive sense, don’t optimize more than they have to (Taylor et al, 2016).

«Boundaries», Part 1: a key missing concept from utility theory

Andrew Critch3y45

2. Corrigibility — the problem of constructing a mind that will cooperate with what its creators regard as a corrective intervention (Soares et al, 2015).

«Boundaries», Part 1: a key missing concept from utility theory

Andrew Critch3y37

1. AI boxing / containment — the method and challenge of confining an AI system to a "box", i.e., preventing the system from interacting with the external world except through specific restricted output channels (Bostrom, 2014, p.129).

«Boundaries», Part 1: a key missing concept from utility theory

Andrew Critch3y95

In Part 3 of this series, I plan to write a shallow survey of 8 problems relating to AI alignment, and the relationship of the «boundary» concept to formalizing them. To save time, I'd like to do a deep dive into just one of the eight problems, based on what commenters here would find most interesting. If you have a moment, please use the "agree" button (and where desired, "disagree") to vote for which of the eight topics I should go into depth about. Each topic is given as a subcomment below (not looking for karma, just agree/disagree votes). Thanks!

3Andrew Critch3y

8. (Unscoped) Consequentialism — the problem that an AI system engaging in consequentialist reasoning, for many objectives, is at odds with corrigibility and containment (Yudkowsky, 2022, no. 23).

4Andrew Critch3y

7. Preference plasticity — the possibility of changes to the preferences of human preferences over time, and the challenge of defining alignment in light of time-varying preferences (Russell, 2019, p.263).

4Andrew Critch3y

6. Mesa-optimizers — instances of learned models that are themselves optimizers, which give rise to the so-called inner alignment problem (Hubinger et al, 2019).

4Andrew Critch3y

5. Counterfactuals in decision theory — the problem of defining what would have happened if an AI system had made a different choice, such as in the Twin Prisoner's Dilemma (Yudkowsky & Soares, 2017).

3Andrew Critch3y

4. Impact regularization — the problem of formalizing "change to the environment" in a way that can be effectively used as a regularizer penalizing negative side effects from AI systems (Amodei et al, 2016).

4Andrew Critch3y

3. Mild optimization — the problem of designing AI systems and objective functions that, in an intuitive sense, don’t optimize more than they have to (Taylor et al, 2016).

4Andrew Critch3y

2. Corrigibility — the problem of constructing a mind that will cooperate with what its creators regard as a corrective intervention (Soares et al, 2015).

3Andrew Critch3y

1. AI boxing / containment — the method and challenge of confining an AI system to a "box", i.e., preventing the system from interacting with the external world except through specific restricted output channels (Bostrom, 2014, p.129).

Encultured AI Pre-planning, Part 1: Enabling New Benchmarks

Andrew Critch3y10

> First, great news on founding an alignment organization on your own.

Actually I founded it with my cofounder, Nick Hay!
https://www.encultured.ai/#team

Unifying Bargaining Notions (1/2)

Andrew Critch3y10

Thanks for signal-boosting the coco solution! I'm actually planning workshop near CMU this fall with Adam Kalai :)

AGI Ruin: A List of Lethalities

Andrew Critch3y1515

Eliezer, thanks for sharing these ideas so that more people can be on the lookout for failures. Personally, I think something like 15% of AGI dev teams (weighted by success probability) would destroy the world more-or-less immediately, and I think it's not crazy to think the fraction is more like 90% or higher (which I judge to be your view).

FWIW, I do not agree with the following stance, because I think it exposes the world to more x-risk:

So far as I'm concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering

Andrew Critch3y81

John, it seems like you're continuing to make the mistake-according-to-me of analyzing the consequences of a pivotal act without regard for the consequences of the intentions leading up to the act. The act can't come out of a vacuum, and you can't built a project compatible with the kind of invasive pivotal acts I'm complaining about without causing a lot of problems leading up to the act, including triggering a lot of fear and panic for other labs and institutions. To summarize from the post title: pivotal act intentions directly have negative consequences fox x-safety, and people thinking about the acts alone seem to be ignoring the consequences of the intentions leading up to the act, which is a fallacy.

johnswentworth3y57

I see the argument you're making there. I still think my point stands: the strategically relevant question is not whether unilateral pivotal act intentions will cause problems, the question is whether aiming for a unilateral pivotal act would or would not reduce the chance of human extinction much more than aiming for a multilateral pivotal act. The OP does not actually attempt to compare the two, it just lists some problems with aiming for a unilateral pivotal act.

I do think that aiming for a unilateral act increases the chance of successfully executing the pivotal act by multiple orders of magnitude, even accounting for the part where other players react to the intention, and that completely swamps the other considerations.

2Ben Pace3y

Just as a related idea, in my mind, I often do a kind of thinking that HPMOR!Harry would call “Hufflepuff Bones”, where I look for ways a problem is solvable in physical reality at all, before considering ethical and coordination and even much in the way of practical concerns.

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Andrew Critch3y*45

Eliezer, from outside the universe I might take your side of this bet. But I don't think it's productive to give up on getting mainstream institutions to engage in cooperative efforts to reduce x-risk.

A propos, I wrote the following post in reaction to positions-like-yours-on-this-issue, but FYI it's not just you (maybe 10% you though?):
https://www.lesswrong.com/posts/5hkXeCnzojjESJ4eB

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Andrew Critch4y30

> Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes. This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values. Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins.
> In other words, the humans and human-aligned institutions no

... (read more)

4Ben Pace4y

Sounds great! I was thinking myself about setting aside some time to write a summary of this comment section (as I see it).

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Andrew Critch4y10

> My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens.
I don't think we can get to convergence on many of these discussions, so I'm happy to just leave it here for the reader to think through.

Yeah I agree we probably can't reach convergence on how alignment affects deployment time, at least not in this medium (especially since a lot of info about company policies / plans / standards are covered under NDAs), so I also think it's good to leave this... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Andrew Critch4y30

Got it, thanks!