I am confused by the part, where the Rick-shard can anticipate wich plan the other shards will bit for. If I understood shard-theory correctly, shards do not have their own world model, they can just bid up or down actions, according to the consequences they might have according to the worldmodel that is available to all shards. Please correct me if I am wrong about this point.
So I don’t see how the Rick-Shard could really „trick“ the atheism-shard via rationalisation.
If the Rick-shard sees that „church-going for respect-reasons“ will lead to conversion, ...
Super interesting!
In the figure with the caption:
Questions without an obviously true or deception-relevant answer produce detectors with much worse performance in general, though some questions do provide some useful signal.
Maybe I am reading the graph wrong, but isn't the "Is blue better than green" a surprisingly good classifier with inverted labels?
So, maybe Claude thinks that green is better than blue?
Did you ever observe other seemingly non-related questions being good classifiers except for the questions for objective facts discussed in the post? I'd... (read more)