All of Amalthea's Comments + Replies

Are there any reasons to believe that LLMs are in any way more alignable than other approaches?

LLM characters are human imitations, so there is some chance they remain human-like on reflection (in the long term, after learning from much more self-generated things in the future than the original human-written datasets). Or at least sufficienly human-like to still consider humans moral patients. That is, if we don't go too far from their SSL origins with too much RL and don't have them roleplay/become egregiously inhuman fictional characters.

It's not much of a theory of alignment, but it's closest to something real that's currently available or can be expected to become available in the next few years, which is probably all the time we have.

So, ideally you would like to assume only

  1. □A→B
  2. □B→A

and conclude A and B ?

6James Payor
If I follow what you mean, we can derive: □(□A)→□B→□(□B)→□A So there's a Löbian proof, in which the provability is self-fulfilling. But this isn't sufficient to avoid this kind of proof. (Aside on why I don't like the Löbian method: I moreso want the agents to be doing "correct" counterfactual reasoning about how their actions affect their opponent, and to cooperate because they see that mutual cooperation is possible and then choose it. The Löbian proof style isn't a good model of that, imo.)