Gurkenglas

I operate by Crocker's rules.

I try to not make people regret telling me things. So in particular:
- I expect to be safe to ask if your post would give AI labs dangerous ideas.
- If you worry I'll produce such posts, I'll try to keep your worry from making them more likely even if I disagree. Not thinking there will be easier if you don't spell it out in the initial contact.

Posts

Sorted by New

0Gurkenglas's Shortform

21Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

19Mapping Out Alignment

Wikitag Contributions

Reflective category theory

(+100)

Reflective category theory

(+193/-111)

Reflective category theory

(+11/-13)

Reflective category theory

(+344/-78)

Reflective category theory

(+5)

Comments

Sorted by

Newest

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Gurkenglas2y10

we fix the BOS token attention to 0.5

1.5 does conjunction without my dummy.

When at most one clause is true, 0.5 does disjunction instead.

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Gurkenglas2y10

(To show my idea compatible with Boolean attention.)

I use a single head, and the ranks add up linearly.

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Gurkenglas2y10

Appendix C attempts to discard the softargmax, but it's an integral part of my suggestion. If the inputs to softargmax take values {0,1000,2000}, the outputs will take only two values.

From the RASP paper: https://arxiv.org/pdf/2106.06981

The uniform weights dictated by our selectors reflect an attention pattern in which ‘unselected’ pairs are all given strongly negative scores, while the selected pairs all have higher, similar, scores.

Addition of such attention patterns corresponds to anding of such selectors.

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Gurkenglas2y20

Isn't (select(a, b, ==) and select(c, d, ==)) just the sum of the two QK matrices? It'll produce NxN entries in {0,1,2}, and softargmax with low temp treats 1s like 0s in the presence of a dummy token's 2.

Löb's Lemma: an easier approach to Löb's Theorem

Gurkenglas2y10

It seems important to clarify the difference between From ⊢A, conclude ⊢□A and ⊢□A→□□A. I don't feel like I get why we don't just set conclude := → and ⊢:=□.

Open technical problem: A Quinean proof of Löb's theorem, for an easier cartoon guide

Gurkenglas2y77

Similarly:

löb = □ (□ A → A) → □ A
□löb = □ (□ (□ A → A) → □ A)
□löb -> löb:
  löb premises □ (□ A → A).
  By internal necessitation, □ (□ (□ A → A)).
  By □löb, □ (□ A).
  By löb's premise, □ A.

What does it take to defend the world against out-of-control AGIs?

Gurkenglas2y20

The recursive self-improvement isn't necessarily human-out-of-the-loop: If an AGI comes up with simpler math, everything gets easier.

A Data limited future

Gurkenglas3y20

It wouldn't just guess the next line of the proof, it'd guess the next conjecture. It could translate our math papers into proofs that compile, then write libraries that reduce code duplication. I expect canonical solutions to our confusions to fall out of a good enough such library.

Ngo and Yudkowsky on alignment difficulty

Gurkenglas3y10

Oh, I wasn't expecting you to have addressed the issue! 10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly?

You're right on all counts in your last paragraph.

Ngo and Yudkowsky on alignment difficulty

Gurkenglas3y10

it is very interpretable to humans

Misunderstanding: I expect we can't construct a counterfactual planner because we can't pick out the compute core in the black-box learned model.

And my Eliezer's problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.