User Comment Replies — AI Alignment Forum

1Sam Marks1y

Thanks, you're correct that my definition breaks in this case. I will say that this situation is a bit pathological for two reasons: 1. The mode of a uniform distribution doesn't coincide with its mean. 2. The variance of the multivariate uniform distribution U([0,1]×[0,1]) is largest along the direction x1+x2, which is exactly the direction which we would want to represent a AND b. I'm not sure exactly which assumptions should be imposed to avoid pathologies like this, but maybe something of the form: we are working with boolean features f whose class-conditional distributions D(⋅|f),D(⋅|¬f) satisfy properties like * D(⋅|f),D(⋅|¬f) are unimodal, and their modes coincide with their means * The variance of D(⋅|f),D(⋅|¬f) along any direction is not too large relative to the difference of the means Ex∼D(⋅|f)(x)−Ex∼D(⋅|¬f)(x)

What’s up with LLMs representing XORs of arbitrary features?

DanielVarga1y11

I’ll say that a model linearly represents a binary feature f if there is a linear probe out of the model’s latent space which is accurate for classifying f

If a model linearly represents features a and b, then it automatically linearly represents $a \land b$ and $a \lor b$ .

I think I misunderstand your definition. Let feature a be represented by x_1 > 0.5, and let feature b be represented by x_2 > 0.5. Let x_i be iid uniform [0, 1]. Isn't that a counterexample to (a and b) being linearly representable?

AI ALIGNMENT FORUM
AF

All of DanielVarga's Comments + Replies