User Comment Replies — AI Alignment Forum

Fuzzing LLMs sometimes makes them reveal their secrets

1mo71

Nice post!

I agree that an important goal of MELBO is to elicit "complex many-token behaviors" (this is a current priority of mine).

You may want to check out my recent results on eliciting password-locked performance on DeepSeek-Math-7B. Using my new training algorithm for finding MELBO vectors, it's possible to find a vector which increases MATH performance from 3% to 23% on password-locked MATH.

The new algorithm is much more efficient than the sequential training procedure from the original post, to the point that I'm currently bottlenecked by inference (... (read more)

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack

11mo10

Thanks for your comment! Yes, I’d say that roughly sums things up.

As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don't know whether you've simply elicited a weird sequence of t... (read more)

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack

11mo10

Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack

11mo22

Thanks for pointing me to these references, particularly on NoiseCLR! (I was unaware of it previously). I think those sorts of ideas will be very useful when trying to learn interesting vectors on a larger data-set of prompts. In particular, skimming that paper, it looks like the numerator of equation (5) (defining their contrastive learning objective) basically captures what I meant above when I suggested "one could maximize the cosine similarity between the differences in target activations across multiple prompts". The fact that it seems to work so well... (read more)

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack

1y30

Yes, the learned vectors are always applied at every token (for all examples).

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack

1y10

I haven't tried the first singular vector of the Jacobian between layers. But for p=2,q=1 I tried looking at the first few eigenvectors of the Hessian of the objective function (around $θ = 0$ ) on the bomb-making prompt for Qwen-1.8B. These didn't appear to do anything interesting regardless of norm. So my feeling is that full-blown gradient descent is needed.

1tailcalled1y

The singular vectors of the Jacobian between two layers seems more similar to what you're doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don't change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal. Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy on the token predictions, but I guess in-context it's more likely you meant the objective function for the magnitude of change in later layer activations induced by a given activation vector?

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack

1y10

Thanks for your comment! Here are my thoughts on this:

I agree that a more automated way of choosing hyper-parameters is an obvious and important next step! I have some ideas here, but it is certainly not a solved problem. Here are some rough ideas, in order of compute costs:
1. Develop some useful heuristics based off diversity measures of steered completions. For example, for each value of R you could calculate sentence embeddings of the steered completions for a small number of learned vectors, and then use the summed variance in sentence embeddings as your

... (read more)

AI ALIGNMENT FORUM
AF

All of Andrew Mack's Comments + Replies