Roger Dearnaley

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

I expect (1) and (2) to go mostly fine (though see footnote[6]), and for (3) to completely fail. For example, suppose  has a bunch of features for things like “true according to smart humans.” How will we distinguish those from features for “true”? I think there’s a good chance that this approach will reduce the problem of discriminating  vs.  to an equally-difficult problem of discriminating desired vs. undesired features.

However, if we found that the classifier was using a feature for "How smart is the human asking the question?" to decide what answer to give (as opposed to how to then phrase it), that would be a red flag.

Model/FinetuneGlobal Mean (Cosine) Similarity

Gemma-2b/Gemma-2b-Python-codes0.6691

Mistral-7b/Mistral-7b-MetaMath0.9648

As an ex-Googler, my informed guess would be that Gemma 2B will have been distilled (or perhaps both pruned and distilled) from a larger teacher model, presumably some larger Gemini model — and that Gemma-2b and Gemma-2b-Python-codes may well have been distilled separately, as separate students of two similar-but-not-identical teacher models distilled using different teaching datasets. The fact that the global mean cosine you find here isn't ~0 shows that if so, the separate distillation processes were either warm-started from similar models (presumably a weak 2B model — a sensible way to save some distillation expense), or at least shared the same initial token embeddings/unembeddings.

Regardless of how they were created, these two Gemma models clearly differ pretty significantly, so I'm unsurprised by your subsequent discovery that the SAE basically doesn't transfer between them.

For Mistral 7B, I would be astonished if any distillation was involved, I would expect just a standard combination of fine-tuning followed by either RLHF or something along the lines of DPO. In very high dimensional space, a cosine of 0.96 means "almost identical", so clearly the instruct training here consists of fairly small, targeted changes, and I'm unsurprised that as a result the SAE transfers quite well.

knowing if the intentions of a user are legitimate is very difficult

This sounds more like a security and authentication problem than an AI problem.

Side note: skills vs facts. I framed some of the approaches as teaching false facts / removing some facts - and concrete raw facts is often what is the easiest to evaluate. But I think that the same approach can be used to remove the information that makes LLMs able to use most skills. For example, writing Python code requires the knowledge of the Python syntax and writing recursive functions requires the knowledge of certain patterns (at least for current LLMs, who don’t have the ability to rederive recursive from scratch, especially without a scratchpad), both of which could be unlearned like things that are more clearly “facts”.

If you read e.g. "Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level", current thinking in mechanistic interpretability is that simple memorization of otherwise random facts (i.e. ones where there are no helpful "rules of thumb" to get many cases right) uses different kinds of learned neural circuits that learning of skills does. If this were in fact the case, then unlearning of facts and skills might have different characteristics. In particular, for learnt skills, if mechanistic interpretability can locate and interpret the learned circuit implementing a skills, then we can edit them directly (as has been done successfully in a few cases). However, we've so far had no luck at interpreting neural circuitry for large collections of basically-random facts, and there are some theoretical arguments suggesting that such circuitry may be inherently hard to interpret, at least using current techniques.

I think something that might be quite illuminating for this factual recall question (as well has having potential safety uses) is editing the facts. Suppose, for this case, you take just the layer 2-6 MLPs with frozen attention, you warm-start from the existing model, add a loss function term that penalizes changes to the model away from that initialization (perhaps using a combination of L1 and L2 norms), and train that model on a dataset that consists of your full set of athlete names embeddings and their sports (in a frequency ratio matching the original training data, so with more copies of data about more famous athletes), but with one factual edit to it to change the sport of one athlete. Train until it has memorized this slightly edited set of facts reasonably well, and then look at what the pattern of weights and neurons that changed significantly is. Repeat several times for the same athlete, and see how consistent/stochastic the results are. Repeat for many athlete/sport edit combinations. This should give you a lot of information on how widely distributed the representation of the data on a specific athlete is, and ought to give you enough information to distinguish between not only your single step vs hashing hypotheses, but also things like bagging and combinations of different algorithms. Even more interesting would be to do this not just for an athlete's sport, but also some independent fact about them like their number.

Of course, this is computationally somewhat expensive, and you do need to find metaparameters that don't make this too expensive, while encouraging the resulting change to be as localized as possible consistent with getting a good score on the altered fact.