All of Scott Emmons's Comments + Replies

Great post. I'm on GDM's new AI safety and alignment team in the Bay Area and hope readers will consider joining us!

I would advise people to think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”


What evidence is there that working at a scaling lab risks creating a "corrupted" perception? When I try thinking of examples, the people that come to my mind seem to have quite successfully transitioned from working at a scaling lab to doing nonprofit / government work. For example:

  • Paul Chri
... (read more)
2Mark Xu
I was intending to warn about the possibility of future perception of corruption, e.g. after a non-existential AI catastrophe. I do not think anyone currently working at safety teams is percieved as that "corrupted", although I do think there is mild negative sentiment among some online communities (some parts of twitter, reddit, etc.).

I speculate that at least three factors made CCS viral:

  1. It was published shortly after the Eliciting Latent Knowledge (ELK) report. At that time, ELK was not only exciting, but new and exciting.
  2. It is an interpretability paper. When CCS was published, interpretability was arguably the leading research direction in the alignment community, with Anthropic and Redwood Research both making big bets on interpretability.
  3. CCS mathematizes "truth" and explains it clearly. It would be really nice if the project of human rationality also helped with the alignment problem. So, CCS is an idea that people want to see work.
6Ryan Greenblatt
[Minor terminology point, unimportant] FWIW, I personally wouldn't describe this as interpretability research, I would instead call this "model internals research" or something. Like the research doesn't necessarily involve any human understanding anything about the model more than what they would understand from training a probe to classify true/false.

Did you try searching for similar ideas to your work in the broader academic literature? There seems to be lots of closely related work that you'd find interesting. For example:

Elite BackProp: Training Sparse Interpretable Neurons. They train CNNs to have "class-wise activation sparsity." They claim their method achieves "high degrees of activation sparsity with no accuracy loss" and "can assist in understanding the reasoning behind a CNN."

Accelerating Convolutional Neural Networks via Activation Map Compression. They "propose a three-stage compression and... (read more)

3Hoagy
Hi Scott, thanks for this! Yes I did do a fair bit of literature searching (though maybe not enough tbf) but very focused on sparse coding and approaches to learning decompositions of model activation spaces rather than approaches to learning models which are monosemantic by default which I've never had much confidence in, and it seems that there's not a huge amount beyond Yun et al's work, at least as far as I've seen. Still though, I've not seen almost any of these which suggests a big hole in my knowledge, and in the paper I'll go through and add a lot more background to attempts to make more interpretable models.

Neat to see the follow-up from your introductory prediction post on this project!

In my prediction I was particularly interested in the following stats:
1. If you put the cheese in the top-left and bottom-right of the largest maze size, what fraction of the time does the out-of-the-box policy you trained go to the cheese?
2. If you try to edit the mouse's activations to make it go to the top left or bottom right of the largest mazes (leaving the cheese wherever it spawned by default in the top right), what fraction of the time do you succeed in getting the mo... (read more)

2Alex Turner
We definitely didn't answer all the prediction questions in this posts, and don't have answers to all the prediction questions—I put in some so it wouldn't be obvious what exactly we had found.  Re: 2. I'd off-the-cuff estimate 50% success rate for locally retargeting to top-left and about 14% to bottom-right, modifying ~11 activations (out of 32,768). If we use the cheese vector as well (modifying all of the activations at the layer), that might go up further. Haven't run the stats, just my sense of how it would go down.

Neat experimental setup. Goal misgeneralization is one of the things I'm most worried about in advanced AI, so I'm excited to see you studying it in more detail!

I want to jot-down my freeform analysis of what I expect to happen. (I wrote these predictions independently, without looking at anyone else's analysis.)

In very small mazes, I think the mouse will behave as if it's following this algorithm: find the shortest path to the cheese location. In very large mazes, I think the mouse will behave as if it's following this algorithm: first, go to the top-righ... (read more)

Thanks for writing. I think this is a useful framing!

Where does the term "structural" come from?

The related literature I've seen uses the word "systemic", eg, the field of system safety. A good example is this talk (and slides, eg slide 24).

7mic
I first learned about the term "structural risk" in this article from 2019 by Remco Zwetsloot and Allan Dafoe, which was included in the AGI Safety Fundamentals curriculum.

Thanks for writing this! I appreciate it and hope you share more things that you write faster without totally polishing everything.

One word of caution I'd share is: beware of spending too much effort running experiments on toy examples. I think toy examples are useful to gain conceptual clarity. However, if your idea is primarily empirical (such as an improvement to a deep neural network architecture), then I would recommend spending basically zero time running toy experiments.

With deep learning, it's often the case that improvements on toy examples don't ... (read more)

3Lawrence Chan
I think this is a good word of caution. I'll edit in a link to this comment.

I think the main reasons to work on mechanistic interp do not look like "we can literally understand all the cognition behind a powerful AI", but instead "we can bound the behavior of the AI"

I assume "bound the behavior" means provide a worst-case guarantee. But if we don't understand all the cognition, how can we provide such a guarantee? How do we know that the part of the AI we don't understand wouldn't ruin our guarantee?
 


we can help other, weaker AIs understand the powerful AI

My understanding of interpretability is that humans understand what the... (read more)

1Neel Nanda
This somewhat feels like semantics to me - this still feels like a win condition! I don't personally care about whether interpretability helps via humans directly understanding the systems themselves, vs us somewhat understanding it ourselves and being able to use weaker AI systems to fully understand it, so long as it's good enough to make aligned systems. I also think that interpretability lies on a spectrum rather than being a binary.

What do you think are the top 3 (or top 5, or top handful) of interpretability results to date? If I gave a 5-minute talk called "The Few Greatest Achievements of Interpretability to Date," what would you recommend I include in the talk?

My answer to this is actually tucked into one paragraph on the 10th page of the paper: "This type of approach is valuable...reverse engineering a system". We cite examples of papers that have used interpretability tools to generate novel adversaries, aid in manually-finetuning a network to induce a predictable change, or reverse engineer a network. Here they are.

Making adversaries:

https://distill.pub/2019/activation-atlas/

https://arxiv.org/abs/2110.03605

https://arxiv.org/abs/1811.12231

https://arxiv.org/abs/2201.11114

https://arxiv.org/abs/2206.14754

https://... (read more)

You also claim that GPT-like models achieve "SOTA performance in domains traditionally dominated by RL, like games." You cite the paper "Multi-Game Decision Transformers" for this claim.

But, in Multi-Game Decision Transformers, reinforcement learning (specifically, a Q-learning variant called BCQ) trained on a single Atari game beats Decision Transformer trained on many Atari games. This is shown in Figure 1 of that paper. The authors of the paper don't even claim that Decision Transformer beats RL. Instead, they write: "We are not striving for mastery or ... (read more)

"A supreme counterexample is the Decision Transformer, which can be used to run processes which achieve SOTA for offline reinforcement learning despite being trained on random trajectories."

This is not true. The Decision Transformer paper doesn't run any complex experiments on random data; they only give a toy example with random data.

We actually ran experiments with Decision Transformer on random data from the D4RL offline RL suite. Specifically, we considered random data from the Mujoco Gym tasks. We found that when it only has access to random data, Dec... (read more)

You also claim that GPT-like models achieve "SOTA performance in domains traditionally dominated by RL, like games." You cite the paper "Multi-Game Decision Transformers" for this claim.

But, in Multi-Game Decision Transformers, reinforcement learning (specifically, a Q-learning variant called BCQ) trained on a single Atari game beats Decision Transformer trained on many Atari games. This is shown in Figure 1 of that paper. The authors of the paper don't even claim that Decision Transformer beats RL. Instead, they write: "We are not striving for mastery or ... (read more)

2janus
Thanks for the correction. I'll read the paper more closely and correct the post.

The technology [of lethal autonomous drones], from the point of view of AI, is entirely feasible. When the Russian ambassador made the remark that these things are 20 or 30 years off in the future, I responded that, with three good grad students and possibly the help of a couple of my robotics colleagues, it will be a term project [six to eight weeks] to build a weapon that could come into the United Nations building and find the Russian ambassador and deliver a package to him.

-- Stuart Russell on a February 25, 2021 podcast with the Future of Life Institu... (read more)

It seems to me that the comments in code provide "visible thoughts" for what the programmer intends. What do you hope to learn from training language models on thought-annotated dungeons that you couldn't learn from language models that have already been trained on commented code?