AI ALIGNMENT FORUM
AF

2543
Gurkenglas
Ω8202842
Message
Dialogue
Subscribe

I operate by Crocker's rules.

I try to not make people regret telling me things. So in particular:
- I expect to be safe to ask if your post would give AI labs dangerous ideas.
- If you worry I'll produce such posts, I'll try to keep your worry from making them more likely even if I disagree. Not thinking there will be easier if you don't spell it out in the initial contact.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
0Gurkenglas's Shortform
6y
0
With enough knowledge, any conscious agent acts morally
Gurkenglas2mo11

Let's sharpen A6. Consider this stamp collector construction: It sends and receives internet data, it has a magically accurate model of reality, it calculates how many stamps would result from each sequence of outputs, and then it outputs the one that results in the most stamps.

By definition it knows everything about reality, including any facts about what is morally correct, and that stamps are not particularly morally important. It knows how to self-modify, and how many stamps any such self-modification will result in.

I'd like to hear how this construction fares as we feed it through your proof. I think it gums up the section "Rejecting nihilistic alternatives". I think that section assumes the conclusion: You expect it to choose its biases on the basis of what is moral, instead of on the basis of its current biases.

Reply
Prover-Estimator Debate: A New Scalable Oversight Protocol
Gurkenglas4mo*30

Sure, and I expect there are reasons for assuming that A can't hide a flaw from B, e.g. maybe one can prove that honest and dishonest can't be distinguished if it can, but I don't see this mentioned anywhere.

Reply
Prover-Estimator Debate: A New Scalable Oversight Protocol
Gurkenglas4mo50

What about when A knows where the flaw is but B doesn't?

Reply
Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind
Gurkenglas3y10

we fix the BOS token attention to 0.5

1.5 does conjunction without my dummy.

When at most one clause is true, 0.5 does disjunction instead.

Reply
Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind
Gurkenglas3y10

(To show my idea compatible with Boolean attention.)

I use a single head, and the ranks add up linearly.

Reply
Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind
Gurkenglas3y10

Appendix C attempts to discard the softargmax, but it's an integral part of my suggestion. If the inputs to softargmax take values {0,1000,2000}, the outputs will take only two values.

From the RASP paper: https://arxiv.org/pdf/2106.06981

The uniform weights dictated by our selectors reflect an attention pattern in which ‘unselected’ pairs are all given strongly negative scores, while the selected pairs all have higher, similar, scores.

Addition of such attention patterns corresponds to anding of such selectors.

Reply
Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind
Gurkenglas3y20

Isn't (select(a, b, ==) and select(c, d, ==)) just the sum of the two QK matrices? It'll produce NxN entries in {0,1,2}, and softargmax with low temp treats 1s like 0s in the presence of a dummy token's 2.

Reply
Löb's Lemma: an easier approach to Löb's Theorem
Gurkenglas3y10

It seems important to clarify the difference between From ⊢A, conclude ⊢□A and ⊢□A→□□A. I don't feel like I get why we don't just set conclude := → and ⊢:=□.

Reply
Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide
Gurkenglas3y77

Similarly:

löb = □ (□ A → A) → □ A
□löb = □ (□ (□ A → A) → □ A)
□löb -> löb:
  löb premises □ (□ A → A).
  By internal necessitation, □ (□ (□ A → A)).
  By □löb, □ (□ A).
  By löb's premise, □ A.
Reply
What does it take to defend the world against out-of-control AGIs?
Gurkenglas3y20

The recursive self-improvement isn't necessarily human-out-of-the-loop: If an AGI comes up with simpler math, everything gets easier.

Reply
Load More
21Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda
5y
0
19Mapping Out Alignment
5y
0
Reflective category theory
3 years ago
(+100)
Reflective category theory
3 years ago
(+193/-111)
Reflective category theory
3 years ago
(+11/-13)
Reflective category theory
3 years ago
(+344/-78)
Reflective category theory
3 years ago
(+5)