Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Every SAE in the paper is hosted on wandb, only some are hosted on huggingface, so I suggest loading them from wandb for now. We’ll upload more to huggingface if several people prefer that. Info for downloading from wandb can be found in the repo, the easiest way is probably:

# pip install e2e_sae
# Save your wandb api key in .env
from e2e_sae import SAETransformer
model = SAETransformer.from_wandb("sparsify/gpt2/d8vgjnyc")
sae = list(model.saes.values())[0] # Assumes only 1 sae in

... (read more)

2Logan Riggs Smith7mo

I finally checked! Here is the Jaccard similarity (ie similarity of input-token activations) across seeds The e2e ones do indeed have a much lower jaccard sim (there normally is a spike at 1.0, but this is removed when you remove features that only activate <10 times). I also (mostly) replicated the decoder similarity chart: And calculated the encoder sim: [I, again, needed to remove dead features (< 10 activations) to get the graphs here.] So yes, I believe the original paper's claim that e2e features learn quite different features across seeds is substantiated.

2Logan Riggs Smith9mo

Thanks so much! All the links and info will save me time:) Regarding cos-sim, after thinking a bit, I think it's more sinister. For cross-cos-sim comparison, you get different results if you take the max over the 0th or 1st dimension (equivalent to doing cos(local, e2e) vs cos(e2e, local). As an example, you could have 2 features each, 3 point in the same direction and 1 points opposte. Making up numbers: feature-directions(1D) = [ [1],[1]] & [[1],[-1]] cos-sim = [[1, 1], [-1, -1]] For more intuition, suppose 4 local features surround 1 e2e feature (and the other features are pointed elsewhere). Then the 4 local features will all have high max-cos sim but the e2e only has 1. So it's not just double-counting, but quadruple counting. You could see for yourself if you swap your dim=1 to 0 in your code. But my original comment showed your results are still directionally correct when doing [global max w/ replacement] (if I coded it correctly). The decoder directions have degrees of freedom, but the encoder directions...might have similar degrees of freedom and I'm wrong, lol. BUT! they might be functionally equivalent, so they activate on similar datapoints across seeds. That is more laborious to check though, waaaah. I can check both (encoder directions first) because previous literature is really only on the SVD of gradient (ie the output), but an SAE might be more constrained when separating out inputs into sparse features. Thanks for prompting for my intuition!

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun10mo40

Thanks Logan!

2. Unlike local SAEs, our e2e SAEs aren't trained on reconstructing the current layer's activations. So at least my expectation was that they would get a worse reconstruction error at the current layer.

Improving training times wasn't our focus for this paper, but I agree it would be interesting and expect there to be big gains to be made by doing things like mixing training between local and e2e+downstream and/or training multiple SAEs at once (depending on how you do this, you may need to be more careful about taking different pathways of com... (read more)

Untrusted smart models and trusted dumb models

Dan Braun1y44

Nice post.

Pushing back a little on this part of the appendix:

Also, unlike many other capabilities which we might want to evaluate, we don’t need to worry about the possibility that even though the model can’t do this unassisted, it can do it with improved scaffolding--the central deceptive alignment threat model requires the model to think its strategy through in a forward pass, and so if our model can’t be fine-tuned to answer these questions, we’re probably safe.

I'm a bit concerned about people assuming this is true for models going forward. A sufficient... (read more)

1Buck Shlegeris1y

Yep, I agree that we need to avoid secret scratchpads for this to hold; good point. If we don't, we get back to the position we are usually in for capability evaluations.

Understanding and controlling a maze-solving policy network

Dan Braun2y21

Nice project and writeup. I particularly liked the walkthrough of thought processes throughout the project

Decision square's Euclidean distance to the top-right $5 \times 5$ corner, positive ( $+ 1.326$ ).
We are confused and don't fully understand which logical interactions produce this positive regression coefficient.

I'd be weary about interpreting the regression coefficients of features that are correlated (see Multicollinearity). Even the sign may be misleading.

It might be worth making a cross-correlation plot of the features. This won't give you a new coeff... (read more)

2Alex Turner2y

We just posted Behavioral statistics for a maze-solving agent. TL;DR You raise a reasonable worry, but the three key variables[1] have stable signs and seem like legit decision-making factors. The variable you quote indeed seems to be a statistical artifact, as we speculated.[2] ---------------------------------------- There is indeed a strong correlation between two[3] of our highly predictive variables: dstep(decision-square,cheese) and dEuclidean(decision-square,cheese) have correlation of .886. We computed the variation inflation factors for the three predictive variables. VIF measures how collinearity increases the variance of the regression coefficients. A score exceeding 4 is considered to be a warning sign of multicollinearity. AttributeVIF Euclidean distance between cheese and top-right square1.05Steps between cheese and decision-square4.64Euclidean distance between cheese and decision-square4.66 So we're at risk here. However, we re-isolated these three variables as both: * Predictively useful on their own, and * No/extremely rare sign-flipping when regressing upon randomly selected subsets of variables. Considering a range of regressions on a range of train/validation splits, these variables have stable regression coefficient signs and somewhat stable coefficient magnitudes. (Although we don't mean for our analysis to be predicated on the magnitudes themselves; we know these are unreliable and contingent quantities!) Furthermore, we regressed upon 200 random subsets of our larger set of variables, and the cheese/decision-square distance regression coefficients never experienced a sign flip. The cheese/top-right Euclidean distance had a few sign flips. The other variables sign-flip frequently. We reran this analysis for a second dataset of 10,000 trajectories, and the analysis was the same, with the exception of dEuclidean(decision-square,cheese) failing to be predictive in certain regressions in the second dataset. Not sure what's up wit

AI ALIGNMENT FORUM
AF

All of Dan Braun's Comments + Replies