I operate by Crocker's rules.
I try to not make people regret telling me things. So in particular:
- I expect to be safe to ask if your post would give AI labs dangerous ideas.
- If you worry I'll produce such posts, I'll try to keep your worry from making them more likely even if I disagree. Not thinking there will be easier if you don't spell it out in the initial contact.
Sure, and I expect there are reasons for assuming that A can't hide a flaw from B, e.g. maybe one can prove that honest and dishonest can't be distinguished if it can, but I don't see this mentioned anywhere.
What about when A knows where the flaw is but B doesn't?
we fix the BOS token attention to 0.5
1.5 does conjunction without my dummy.
When at most one clause is true, 0.5 does disjunction instead.
(To show my idea compatible with Boolean attention.)
I use a single head, and the ranks add up linearly.
Appendix C attempts to discard the softargmax, but it's an integral part of my suggestion. If the inputs to softargmax take values {0,1000,2000}, the outputs will take only two values.
From the RASP paper: https://arxiv.org/pdf/2106.06981
The uniform weights dictated by our selectors reflect an attention pattern in which ‘unselected’ pairs are all given strongly negative scores, while the selected pairs all have higher, similar, scores.
Addition of such attention patterns corresponds to anding of such selectors.
Isn't (select(a, b, ==) and select(c, d, ==)) just the sum of the two QK matrices? It'll produce NxN entries in {0,1,2}, and softargmax with low temp treats 1s like 0s in the presence of a dummy token's 2.
It seems important to clarify the difference between From ⊢A, conclude ⊢□A
and ⊢□A→□□A
. I don't feel like I get why we don't just set conclude := →
and ⊢:=□
.
Similarly:
löb = □ (□ A → A) → □ A
□löb = □ (□ (□ A → A) → □ A)
□löb -> löb:
löb premises □ (□ A → A).
By internal necessitation, □ (□ (□ A → A)).
By □löb, □ (□ A).
By löb's premise, □ A.
The recursive self-improvement isn't necessarily human-out-of-the-loop: If an AGI comes up with simpler math, everything gets easier.
Let's sharpen A6. Consider this stamp collector construction: It sends and receives internet data, it has a magically accurate model of reality, it calculates how many stamps would result from each sequence of outputs, and then it outputs the one that results in the most stamps.
By definition it knows everything about reality, including any facts about what is morally correct, and that stamps are not particularly morally important. It knows how to self-modify, and how many stamps any such self-modification will result in.
I'd like to hear how this construction fares as we feed it through your proof. I think it gums up the section "Rejecting nihilistic alternatives". I think that section assumes the conclusion: You expect it to choose its biases on the basis of what is moral, instead of on the basis of its current biases.