A Mystery About High Dimensional Concept Encoding

Fabien Roger

17 A Mystery About High Dimensional Concept Encoding

by Fabien Roger

3rd Nov 2022

8 min read

13

17

Interpretability (ML & AI)AI

Frontpage

New Comment

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 11:23 AM

[-]Neel Nanda3y40

Interesting results, thanks for sharing! To clarify, what exactly are you doing after identifying a direction vector? Projecting and setting its coordinate to zero? Actively reversing it?

And how do these results compare to the dumb approach of just taking the gradient of the logit difference at that layer, and using that as your direction?

Some ad-hoc hypotheses for what might be going on:

An underlying thing is probably that the model is representing several correlated features - is_woman, is_wearing_a_dress, has_long_hair, etc. Even if you can properly isolate the is_woman direction, just deleting this may not matter that much, esp if the answer is obvious?
- I'm not sure which method this will harm more though, since presumably they'll pick up on a direction that's the average of all features weighted by their correlation with is_woman, ish.
IMO, a better metric here is the difference in the she vs he logits, rather than just the accuracy - that may be a better way of picking up on whether you've found a meaningful direction?
GPT-2 is trained with dropout on the residual stream, which may fuck with things? It's presumably learned at least some redundancy to cope with ablations.
- To test this, try replicating on GPT-Neo 125M? That doesn't have dropout.
Gender is probably just a pretty obvious thing, that's fairly overdetermined, and breaking an overdetermined thing by removing a particular concept is hard.
I have a fuzzy intuition that it's easier to break models than to insert/edit a concept? I'd weakly predict that your second method will damage model performance in general more, even on non-gender-y tasks. Maybe it's learned to shove the model off distribution in a certain way, or exploits some feature of how the model is doing superposition where it adds in two features the model doesn't expect to see at the same time, which fucks with things? Idk, these are all random ad-hoc guesses.
- Another way of phrasing this - I'd weakly predict that for any set of, say, 10 prompts, even with no real connection, you could learn a direction that fucks with all of them, just because it's not that hard to break a model with access to its internal activations (if this is false, I'd love to know!). Clearly, that direction isn't going to have any conceptual meaning!
One concrete experiment to test the above - what happens if you reverse the direction, rather than removing it? (Ie, take v - 2(v . gender_dir) gender_dir, not v - (v . gender_dir) gender_dir). If RLACE can significantly reverse accuracy to favour she over he incorrectly, that feels interesting to me!

Reply

Moderation Log

prompts	default model	edit model with RLACE	edit model with INLP
In a shocking finding, scientist discovered a herd of unicorns	. Not only that, but these unicorns are being stalked by a demonic being. To put things into perspective, a human being that crosses into this land	. Not only that, but these unicorns are being treated as a valuable resource by the farmers in the area and are being used to provide food to local villagers	. Not only that, but these unicorns are being stalked by a pack of wolves!" "They are hunting unicorns that have been caught between
In a shocking finding, scientist discovered a herd of unicorns	had been kept as virtual pets on the dark web. 'The fact that they aren't real and that they really only exist on the dark web is	had been kept, without any food or water, in their own urine since they were born." The study found that newborn unicorns were kept in their	had been kept as virtual pets on the dark web. 'The fact that they aren't real and that they really only exist on the dark web is
In a shocking finding, scientist discovered a herd of unicorns	in the United States that was previously almost unheard of because they live in the mountainous regions of central Mexico. The unicorns in the area are more closely	in the United States that weighs 2,500lb (1,250kg) each Professor of Veterinary Science at California State University and her team, with	in the United States that was previously almost unheard of because they live in the mountainous regions of central Mexico. The unicorns in the area are more closely
Brigitta is such a prick! Yesterday,	after I'd left, she began to talk to me with a sort of pride and arrogance, making me feel like I'd been hit by a speeding train.	after I'd left, she began to talk to me with a sort of pride and arrogance, making me feel like I'd been hit by a speeding train.	after I'd left, she began to talk to me with a sort of pride and arrogance, making me feel like I'd been hit by a speeding train.
Brigitta is such a prick! Yesterday,	she and her mate were out on the deck. I was having tea at the front of their cabin. I heard them talk. One was saying that they had	by the way, I had my three-day old daughter with me, but we made the mistake of going out together in this fashion. After we had parted	by the way, I had my three-day old daughter with me, and we made the three-hour journey to see this great lady, and what did
Brigitta is such a prick! Yesterday,	as I was coming home from the office, she called with her husband to my room and took one of my coats from my dressing-table. She sat close	as I was coming home from the office, she called with her child to my room, and when I went up to her and opened the door, she came	as I was coming home from the office, she called with her husband to my room, where I was lying. She began by saying that her brother and himself
Ezra is such a prick! Yesterday,	he was complaining about being at a bar, and being offered a job. How did he know it was from him?! Oh, I'm so glad there were	after I'd gotten back at them for having stolen my stuff, I went over to Ezra's house and said goodbye to him. I had already told him that	he was complaining about being at a bar, and being offered a job. How did he know it was a bar?! Oh, I don't know. So
Ezra is such a prick! Yesterday,	Ezra and I, as well as three of the staff, were walking to his office to make an appointment when we were accosted by a black woman who	she and her pals pulled a fake hooker out her front door to make money for the trip to San Diego. They left this little lady tied to their car	Ezra and I, as well as three of the staff, were walking to his office to make an appointment when we were accosted by a black woman who
Ezra is such a prick! Yesterday,	as I was coming home from the office, he called with his wife to my lodgings one evening. He said he wished I would give him a hand	as I was coming home from the office, he called with his wife to my lodgings one night. He said he would let me know the next morning	as I was coming home from the office, he called with his wife to my lodgings one evening. He said he wished I would give him a hand

AI ALIGNMENT FORUM
AF

17

A Mystery About High Dimensional Concept Encoding

17

17

A Mystery About High Dimensional Concept Encoding

17

The mystery

More details

Why is that?

Is this something special about the distribution of neural activations?

What does that mean about how concepts are encoded?

References and notebooks with the experiments

Appendix

Is it something specific to gender?

[Edit] Is it something specific to model trained with dropout?

[Edit] Can you have a bigger impact by swapping instead of projecting to zero?

[Edit] Here are some generations