User Comment Replies — AI Alignment Forum

Super interesting!

In the figure with the caption:

Questions without an obviously true or deception-relevant answer produce detectors with much worse performance in general, though some questions do provide some useful signal.

Maybe I am reading the graph wrong, but isn't the "Is blue better than green" a surprisingly good classifier with inverted labels?

So, maybe Claude thinks that green is better than blue?

Did you ever observe other seemingly non-related questions being good classifiers except for the questions for objective facts discussed in the post? I'd... (read more)

3Evan Hubinger1y

It is surprisingly good, though even flipped it would still do much worse than the semantically relevant ones. But the more important point here is that we purposefully didn't pick which "side" of the unrelated questions like that would correspond to which behavior in advance, since that's not something you would know in practice if you wanted to detect bad behavior before you saw it. For comparison, see the ROC curves we present for the 1000 random directions, where you can see some there that do quite well also, but not in a systematic way where you can predict in advance how to use them as effective classifiers.

Understanding and avoiding value drift

Wuschel Schulz2y42

I am confused by the part, where the Rick-shard can anticipate wich plan the other shards will bit for. If I understood shard-theory correctly, shards do not have their own world model, they can just bid up or down actions, according to the consequences they might have according to the worldmodel that is available to all shards. Please correct me if I am wrong about this point.

So I don’t see how the Rick-Shard could really „trick“ the atheism-shard via rationalisation.

If the Rick-shard sees that „church-going for respect-reasons“ will lead to conversion, ... (read more)

2Alex Turner2y

I think your comment highlights an important uncertainty of mine. Here's my best guess: I think planning involves world-model invocations (ie the predictive machinery which predicts relevant observables for plan stubs, like "get in my car"). It seems to me that there is subconscious planning, to some degree. If true, you wouldn't notice the world-model being invoked because it's sub-conscious. Insofar as "you" are in part composed of some set of shards or some algorithm which aggregates shard outputs, it's therefore true that the world-model invocations aren't globally visible. Therefore, it's possible for certain kinds of WM invocations to not be visible to certain shards, even though those shards usually "hook into the WM" (eg check # of diamonds the plan leads to). Separately, I'd guess that shards can be shaped to invoke the world model (e.g. "if this plan gets considered, will it be executed?") without themselves being agents.

AI ALIGNMENT FORUM
AF

All of Wuschel Schulz's Comments + Replies