Robert_AIZI

Posts

Sorted by New

56Comparing Anthropic's Dictionary Learning to Ours

2y

1

63Sparse Autoencoders Find Highly Interpretable Directions in Language Models

2y

7

Wikitag Contributions

Comments

Sorted by

Newest

Pretraining Language Models with Human Preferences

Robert_AIZI2y61

Since <|bad|> also includes glitchy code etc, maybe the AI is less capable in bad mode, and therefore not a threat. Here it would be helpful to know what the AI produces when prompted by <|bad|>.
Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.
[Edit: A third option would be to poison-pill bad mode in training, for instance by making 50% of <|bad|> mode data random noise. Ideally this would leave <|good|> mode unaffected and make <|bad|> mode useless from a capabilities perspective.]

Reply

Can we efficiently distinguish different mechanisms?

Robert_AIZI2y46

Can you clarify what you mean by this, especially (i)?

In particular, right now I don’t have even a single example of a function f such that (i) there are two clearly distinct mechanisms that can lead to f(x) = 1, (ii) there is no known efficient discriminator for distinguishing those mechanisms. I would really love to have such examples.

In particular, do you mean f(x)=1 is true for all input x, or just some particular x, etc?

Reply