Andrew Mack - AI Alignment Forum

Fuzzing LLMs sometimes makes them reveal their secrets

Nice post!

I agree that an important goal of MELBO is to elicit "complex many-token behaviors" (this is a current priority of mine).

You may want to check out my recent results on eliciting password-locked performance on DeepSeek-Math-7B. Using my new training algorithm for finding MELBO vectors, it's possible to find a vector which increases MATH performance from 3% to 23% on password-locked MATH.

The new algorithm is much more efficient than the sequential training procedure from the original post, to the point that I'm currently bottlenecked by inference (even when using vllm). It also comes with a decent heuristic procedure for choosing R. In my experience, if you have a data-set of at least prompts, it's pretty good at finding a bunch of vectors corresponding to the most "salient" concept in the data-set (e.g. "jailbreak vectors" when trained on harmful instructions, or "capability-increasing vectors" when trained on password-locked models).

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack1y10

Thanks for your comment! Yes, I’d say that roughly sums things up.

As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don't know whether you've simply elicited a weird sequence of tokens or if there’s something systematic about it. But if you have a latent perturbation then you can run the perturbed model over many inputs to confirm that this is a generalizable behavior (e.g., systematic but subtle coding errors meant to sabotage you) rather than a weird sequence of tokens.

Moreover, this might chain well with mechanistic anomaly detection. Quoting the post: “before deployment, train a large set of unsupervised model perturbations, then during deployment, if the model's activations look similar to one of the previously learned perturbations, auditors can go back and evaluate the behaviors encoded by that perturbation more thoroughly to see if there is anything problematic with these behaviors.”

I agree that it would be helpful to spell out/test the advantages of latent perturbations more concretely in followup work (for example, trying to show that mechanistic anomaly detection actually works using the learned steering vectors).

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack1y10

Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack1y22

Thanks for pointing me to these references, particularly on NoiseCLR! (I was unaware of it previously). I think those sorts of ideas will be very useful when trying to learn interesting vectors on a larger data-set of prompts. In particular, skimming that paper, it looks like the numerator of equation (5) (defining their contrastive learning objective) basically captures what I meant above when I suggested "one could maximize the cosine similarity between the differences in target activations across multiple prompts". The fact that it seems to work so well in diffusion models gives me hope that it will also work in LLMs! My guess is that ultimately you get the most mileage out of combining the two objectives.

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack1y30

Yes, the learned vectors are always applied at every token (for all examples).

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack1y10

I haven't tried the first singular vector of the Jacobian between layers. But for p=2,q=1 I tried looking at the first few eigenvectors of the Hessian of the objective function (around ) on the bomb-making prompt for Qwen-1.8B. These didn't appear to do anything interesting regardless of norm. So my feeling is that full-blown gradient descent is needed.

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack1y10

Thanks for your comment! Here are my thoughts on this:

I agree that a more automated way of choosing hyper-parameters is an obvious and important next step! I have some ideas here, but it is certainly not a solved problem. Here are some rough ideas, in order of compute costs:
1. Develop some useful heuristics based off diversity measures of steered completions. For example, for each value of R you could calculate sentence embeddings of the steered completions for a small number of learned vectors, and then use the summed variance in sentence embeddings as your diversity metric. Then, plot diversity as a function of R. My guess is that this might look something like a “hockey stick”: for small values of R you get essentially zero diversity, but you quickly hit a threshold/phase transition where diversity sky-rockets. The values of R with super high diversity are probably not what you want (they will be incoherent). Instead, you want a value of R right at the cusp of the phase transition. You could then scale out to thousands of vectors at this “cusp” value of R. The trick is coming up with a useful heuristic for identifying the cusp, which would likely require more experience applying the method to a diverse range of prompts/examples. Ultimately, this feels largely analogous to identifying “elbows” in scree plots. This obviously depends on there being obvious cusps in the first place (a priori it's not clear if this will happen).
2. Fine-tune an LLM to identify the best value of R. This is similar to the above idea, except that instead of trying to come up with a heuristic for identifying cusps, we use a fine-tuned LLM as the heuristic. In other words, we manually decide the best R for a number of example prompts and fine-tune the LLM to predict which value of R is best on a new prompt we care about (you could imagine a number of different ways of specifying the details of the fine-tuning).
3. Just use many different values of R. If we’re already training thousands of vectors and using a trusted LLM to flag weird behavior, then multiplying this by N values of R may not be too costly, for N reasonably sized.
Your point about distinguishing between intentional backdoors and strange OOD behavior seems pretty important. I haven’t thought carefully about how you might distinguish the two in general, but I have thought a little bit about whether steering vectors might be helpful in recovering triggers. In particular, if the backdoor is something like “string X in the prompt elicits string Y in the response”, then I have this vague intuition that GCG-like attacks to discover X might work better if they targeted cosine-similarity with steering vectors which elicit Y (as you suggest). My reasoning here is that it “feels like” the optimization landscape should be more well-behaved if you’re targeting similarity in activations at some intermediate layer, as opposed to targeting at the logits level, the general principle being that if there are fewer layers between the tokens and the target then this should make things easier (although it’s by no means obvious to me that this hypothesis is true). So this is definitely an idea I would be interested in trying at some point. But given the difficulty of the challenge I would probably start with supervised steering vectors, and if results are good results then try with unsupervised steering vectors.
1. Alternatively, here’s an approach for discovering the trigger X which doesn’t rely on GCG: say we know the distribution of clean prompts, and we have some steering vector which elicits some bad behavior Y. We then train some steering vector $θ_{p r o m p t}$ to elicit the clean prompt distribution, i.e. so that if we start with an empty prompt (“” or <BOS>) and then generate steered by $θ_{p r o m p t}$ , then we get samples from the clean prompt distribution. Then to generate from the triggered distribution, we simply steer with $θ_{p r o m p t} + θ_{Y} .$ Intuitively, the hope is to use steering vector arithmetic to get around the difficulties of sampling LLMs backwards. My guess is this basic version still wouldn’t work (a priori it’s not clear why $θ_{Y}$ would elicit both the behavior Y and the trigger X), but you could imagine something more sophisticated like: when sampling token t, you upweight tokens which would cause the layer-L residual stream of token t+1 to have high cosine similarity with $θ_{Y} .$ Again, it’s not at all clear if this would work, so to make things tractable I’d first try with vectors trained on some known Y, then if results are good move on to unsupervised vectors.

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments