Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation.
@Lucius Bushnaq , why would MLPs compute linear transformations?
Because two linear transformations can be combined into one linear transformation, why wouldn't downstream MLPs/Attns that rely on this linearly transformed vector just learn the combined function?
What is the activation name for the resid SAEs? hook_resid_post or hook_resid_pre?
I found https://github.com/ApolloResearch/e2e_sae/blob/main/e2e_sae/scripts/train_tlens_saes/run_train_tlens_saes.py#L220
to suggest _post
but downloading the SAETransformer from wandb shows:(saes):
ModuleDict( (blocks-6-hook_resid_pre):
SAE( (encoder): Sequential( (0):...
which suggests _pre.
Kind of confused on why the KL-only e2e SAE have worse CE than e2e+downstream across dictionary size:
This is true for layers 2 & 6. I'm unsure if this means that training for KL directly is harder/unstable, and the intermediate MSE is a useful prior, or if this is a difference in KL vs CE (ie the e2e does in fact do better on KL but worse on CE than e2e+downstream).
I finally checked!
Here is the Jaccard similarity (ie similarity of input-token activations) across seeds
The e2e ones do indeed have a much lower jaccard sim (there normally is a spike at 1.0, but this is removed when you remove features that only activate <10 times).
I also (mostly) replicated the decoder similarity chart:
And calculated the encoder sim:
[I, again, needed to remove dead features (< 10 activations) to get the graphs here.]
So yes, I believe the original paper's claim that e2e features learn quite different features across seeds is substantiated.
And here's the code to convert it to NNsight (Thanks Caden for writing this awhile ago!)
import torch
from transformers import GPT2LMHeadModel
from transformer_lens import HookedTransformer
from nnsight.models.UnifiedTransformer import UnifiedTransformer
model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu")
# Undo my hacky LayerNorm removal
for block in model.transformer.h:
block.ln_1.weight.data = block.ln_1.weight.data / 1e6
block.ln_1.eps = 1e-5
block.ln_2.weight.data = block.ln_2.weight.data / 1e6
block.ln_2.eps = 1e-5
model.transformer.ln_f.weight.data = model.transformer.ln_f.weight.data / 1e6
model.transformer.ln_f.eps = 1e-5
# Properly replace LayerNorms by Identities
def removeLN(transformer_lens_model):
for i in range(len(transformer_lens_model.blocks)):
transformer_lens_model.blocks[i].ln1 = torch.nn.Identity()
transformer_lens_model.blocks[i].ln2 = torch.nn.Identity()
transformer_lens_model.ln_final = torch.nn.Identity()
hooked_model = HookedTransformer.from_pretrained("gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu")
removeLN(hooked_model)
model_nnsight = UnifiedTransformer(model="gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu")
removeLN(model_nnsight)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
prompt = torch.tensor([1,2,3,4], device=device)
logits = hooked_model(prompt)
with torch.no_grad(), model_nnsight.trace(prompt) as runner:
logits2 = model_nnsight.unembed.output.save()
logits, cache = hooked_model.run_with_cache(prompt)
torch.allclose(logits, logits2)
Did y'all do any ablations on your loss terms. For example:
1. JumpReLU() -> ReLU
2. L0 (w/ STE) -> L1
I'd be curious to see if the pareto improvements and high frequency features are due to one, the other, or both
Regarding urls, I think this is a mix of the HH dataset being non-ideal & the PM not being a great discriminator of chosen vs rejected reward (see nostalgebraist's comment & my response)
I do think SAE's find the relevant features, but inefficiently compressed (see Josh & Isaac's work on days of the week circle features). So an ideal SAE (or alternative architecture) would not separate these features. Relatedly, many of the features that had high url-relevant reward had above-random cos-sim with each other.
[I also think the SAE's could be optimized to trade off some reconstruction loss for reward-difference loss which I expect to show a cleaner effect on the reward]
The PM is pretty bad (it's trained on hh).
It's actually only trained after the first 20k/156k datapoints in hh, which moves the mean reward-diff from 1.04 -> 1.36 if you only calculate over that remaining ~136k subset.
My understanding is there's 3 bad things:
1. the hh dataset is inconsistent
2. The PM doesn't separate chosen vs rejected very well (as shown above)
3. The PM is GPT-J (7B parameter model) which doesn't have the most complex features to choose from.
The in-distribution argument is most likely the case for the "Thank you. My pleasure" case, because the assistant never (AFAIK, I didn't check) said that phrase as a response. Only "My pleasure" after the user said " thank you".
I prefer when they are directly mentioned in the post/paper!
That would be a more honest picture. The simplest change I could think of was adding it to the high-level takeaways.
I do think you could use SAE features to beat that baseline if done in the way specified by General Takeaways. Specifically, if you have a completion that seems to do unjustifiably better, then you can find all feature's effects on the rewards that were different than your baseline completion.
Features help come up with hypotheses, but also isolates the effect. If do have a specific hypothesis as mentioned, then you should be able to find features that capture that hypothesis (if SAEs are doing their job). When you create some alternative completion based on your hypothesis, you might unknowingly add/remove additional negative & positive features e.g. just wanting to remove completion-length, you also remove the end-of-sentence punctuation.
In general, I think it's hard to come up with the perfect counterfactual, but SAE's at least let you know if you're adding or removing specific reward-relevant features in your counterfactual completions.
Is there code available for this?
I'm mainly interested in the loss fuction. Specifically from footnote 4:
I'm unsure how this is implemented or the motivation.