Gurkenglas — AI Alignment Forum

AI ALIGNMENT FORUM
AF

With enough knowledge, any conscious agent acts morally

Let's sharpen A6. Consider this stamp collector construction: It sends and receives internet data, it has a magically accurate model of reality, it calculates how many stamps would result from each sequence of outputs, and then it outputs the one that results in the most stamps.

By definition it knows everything about reality, including any facts about what is morally correct, and that stamps are not particularly morally important. It knows how to self-modify, and how many stamps any such self-modification will result in.

I'd like to hear how this construction fares as we feed it through your proof. I think it gums up the section "Rejecting nihilistic alternatives". I think that section assumes the conclusion: You expect it to choose its biases on the basis of what is moral, instead of on the basis of its current biases.

Prover-Estimator Debate: A New Scalable Oversight Protocol

Gurkenglas6mo*30

Sure, and I expect there are reasons for assuming that A can't hide a flaw from B, e.g. maybe one can prove that honest and dishonest can't be distinguished if it can, but I don't see this mentioned anywhere.

Prover-Estimator Debate: A New Scalable Oversight Protocol

Gurkenglas6mo50

What about when A knows where the flaw is but B doesn't?

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Gurkenglas3y10

we fix the BOS token attention to 0.5

1.5 does conjunction without my dummy.

When at most one clause is true, 0.5 does disjunction instead.

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Gurkenglas3y10

(To show my idea compatible with Boolean attention.)

I use a single head, and the ranks add up linearly.

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Gurkenglas3y10

Appendix C attempts to discard the softargmax, but it's an integral part of my suggestion. If the inputs to softargmax take values {0,1000,2000}, the outputs will take only two values.

From the RASP paper: https://arxiv.org/pdf/2106.06981

The uniform weights dictated by our selectors reflect an attention pattern in which ‘unselected’ pairs are all given strongly negative scores, while the selected pairs all have higher, similar, scores.

Addition of such attention patterns corresponds to anding of such selectors.

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Gurkenglas3y20

Isn't (select(a, b, ==) and select(c, d, ==)) just the sum of the two QK matrices? It'll produce NxN entries in {0,1,2}, and softargmax with low temp treats 1s like 0s in the presence of a dummy token's 2.

Löb's Lemma: an easier approach to Löb's Theorem

Gurkenglas3y10

It seems important to clarify the difference between From ⊢A, conclude ⊢□A and ⊢□A→□□A. I don't feel like I get why we don't just set conclude := → and ⊢:=□.

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Gurkenglas3y77

Similarly:

löb = □ (□ A → A) → □ A
□löb = □ (□ (□ A → A) → □ A)
□löb -> löb:
  löb premises □ (□ A → A).
  By internal necessitation, □ (□ (□ A → A)).
  By □löb, □ (□ A).
  By löb's premise, □ A.

What does it take to defend the world against out-of-control AGIs?

Gurkenglas3y20

The recursive self-improvement isn't necessarily human-out-of-the-loop: If an AGI comes up with simpler math, everything gets easier.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments