User Comment Replies — AI Alignment Forum

Pretraining Language Models with Human Preferences

Since <|bad|> also includes glitchy code etc, maybe the AI is less capable in bad mode, and therefore not a threat. Here it would be helpful to know what the AI produces when prompted by <|bad

... (read more)

2Tomek Korbak2y

Can we efficiently distinguish different mechanisms?

Robert_AIZI2y46

Can you clarify what you mean by this, especially (i)?

In particular, right now I don’t have even a single example of a function f such that (i) there are two clearly distinct mechanisms that can lead to f(x) = 1, (ii) there is no known efficient discriminator for distinguishing those mechanisms. I would really love to have such examples.

In particular, do you mean f(x)=1 is true for all input x, or just some particular x, etc?

1David Scott Krueger2y

Yeah this was super unclear to me; I think it's worth updating the OP.

1Ramana Kumar2y

It means f(x) = 1 is true for some particular x's, e.g., f(x_1) = 1 and f(x_2) = 1, there are distinct mechanisms for why f(x_1) = 1 compared to why f(x_2) = 1, and there's no efficient discriminator that can take two instances f(x_1) = 1 and f(x_2) = 1 and tell you whether they are due to the same mechanism or not.

AI ALIGNMENT FORUM
AF

All of Robert_AIZI's Comments + Replies