All of scasper's Comments + Replies

Mech Interp Challenge: August - Deciphering the First Unique Character Model

Several people seem to be coming to similar conclusions recently (e.g., this recent post).

I'll add that I have as well and wrote a sequence about it :)

Stephen Casper2y30

I think this is very exciting, and I'll look forward to seeing how it goes!

Open Problems and Fundamental Limitations of RLHF

Stephen Casper2y11

Thanks, we will consider adding each of these. We appreciate that you took a look and took the time to help suggest these!

Open Problems and Fundamental Limitations of RLHF

Stephen Casper2y32

No, I don't think the core advantages of transparency are really unique to RLHF, but in the paper, we list certain things that are specific to RLHF which we think should be disclosed. Thanks.

Eight Strategies for Tackling the Hard Part of the Alignment Problem

Stephen Casper2y12

Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human's comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless.

1davidad (David A. Dalrymple)2y

Less difficult than ambitious mechanistic interpretability, though, because that requires human comprehension of mechanisms, which is even more difficult.

Eight Strategies for Tackling the Hard Part of the Alignment Problem

Eight Strategies for Tackling the Hard Part of the Alignment Problem

Thanks -- I agree that this seems like an approach worth doing. I think that at CHAI and/or Redwood there is a little bit of work at least related to this, but don't quote me on that. In general, it seems like if you have a model and then a smaller distilled/otherwise-compressed version of it, there is a lot you can do with them from an alignment perspective. I am not sure how much work has been done in the anomaly detection literature that involves distillation/compression.

EIS IX: Interpretability and Adversaries

I think this is a good point, thanks.

EIS IX: Interpretability and Adversaries

We talked about this over DMs, but I'll post a quick reply for the rest of the world. Thanks for the comment.

A lot of how this is interpreted depends on what the exact definition of superposition that one uses and whether it applies to entire networks or single layers. But a key thing I want to highlight is that if a layer represents a certain set amount of information about an example, then they layer must have more information per neuron if it's thin than if it's wide. And that is the point I think that the Huang paper helps to make. The fact that ... (read more)

EIS V: Blind Spots In AI Safety Interpretability Research

Thanks, +1 to the clarification value of this comment. I appreciate it. I did not have the tied weights in mind when writing this.

Latent Adversarial Training

Thanks for the comment.

In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways.

This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety.

there are a

... (read more)

Penalize Model Complexity Via Self-Distillation

In general, I think not. The agent could only make this actively happen to the extent that their internal activation were known to them and able to be actively manipulated by them. This is not impossible, but gradient hacking is a significant challenge. In most learning formalisms such as ERM or solving MDPs, the model's internals are not modeled as a part of the actual algorithm. They're just implementational substrate.

Penalize Model Complexity Via Self-Distillation

One idea that comes to mind is to see if a chatbot who is vulnerable to DAN-type prompts could be made to be robust to them by self-distillation on non-DAN-type prompts.

I'd also really like to see if self-distillation or similar could be used to more effectively scrub away undetectable trojans. https://arxiv.org/abs/2204.06974

1research_prime_space2y

1. I don't really think that 1. would be true -- following DAN-style prompts is the minimum complexity solution. You want to act in accordance with the prompt. 2. Backdoors don't emerge naturally. So if it's computationally infeasible to find an input where the original model and the backdoored model differ, then self-distillation on the backdoored model is going to be the same as self-distillation on the original model. The only scenario where I think self-distillation is useful would be if 1) you train a LLM on a dataset, 2) fine-tune it to be deceptive/power-seeking, and 3) self-distill it on the original dataset, then self-distilled model would likely no longer be deceptive/power-seeking.

Stephen Casper2y3-2

I agree with this take. In general, I would like to see self-distillation, distillation in general, or other network compression techniques be studied more thoroughly for de-agentifying, de-backdooring, and robistifying networks. I think this would work pretty well and probably be pretty tractable to make ground on.

2research_prime_space2y

I think self-distillation is better than network compression, as it possesses some decently strong theoretical guarantees that you're reducing the complexity of the function. I haven't really seen the same with the latter. But what research do you think would be valuable, other than the obvious (self-distill a deceptive, power-hungry model to see if the negative qualities go away)?

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

Stephen Casper2y30

I buy this value -- FV can augment examplars. And I have never heard anyone ever say that FV is just better than examplars. Instead, I have heard the point that FV should be used alongside exemplars. I think these two things make a good case for their value. But I still believe that more rigorous task-based evaluation and less intuition would have made for a much stronger approach than what happened.

EIS XI: Moving Forward

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

Thanks! Fixed.
https://arxiv.org/abs/2210.04610

EIS VIII: An Engineer’s Understanding of Deceptive Alignment

I do not worry a lot about this. It would be a problem. But some methods are model-agnostic and would transfer fine. Some other methods have close analogs for other architectures. For example, ROME is specific to transformers, but causal tracing and rank one editing are more general principles that are not.

EIS V: Blind Spots In AI Safety Interpretability Research

Thanks for the comment. I appreciate how thorough and clear it is.

Knowing "what deception looks like" - the analogue of knowing the target class of a trojan in a classifier - is a problem.

Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner... (read more)

Stephen Casper2y11

Thanks. See also EIS VIII.

Could you give an example of a case of deception that is quite unlike a trojan? Maybe we have different definitions. Maybe I'm not accounting for something. Either way, it seems useful to figure out the disagreement.

2Charlie Steiner2y

I'm slowly making my way through these, so I'll leave you a more complete comment after I read post 8.

EIS VII: A Challenge for Mechanists

EIS VII: A Challenge for Mechanists

Yes, it does show the ground truth.

The goal of the challenge is not to find the labels, but to find the program that explains them using MI tools. In the post, when I say labeling "function", I really mean labeling "program" in this case.

EIS V: Blind Spots In AI Safety Interpretability Research

The MNIST CNN was trained only on the 50k training examples.

I did not guarantee that the models had perfect train accuracy. I don't believe they did.

I think that any interpretability tools are allowed. Saliency maps are fine. But to 'win,' a submission needs to come with a mechanistic explanation and sufficient evidence for it. It is possible to beat this challenge by using non mechanistic techniques to figure out the labeling function and then using that knowledge to find mechanisms by which the networks classify the data.

At the end of the day, I (and possibly Neel) will have the final say in things.

Thanks :)

Stephen Casper2y26

This seems interesting. I do not know of steelmen for isolation, renaming, reinventing, etc. What is yours?

EIS III: Broad Critiques of Interpretability Research

Stephen Casper2y30

Thanks. I'll talk in some depth about causal scrubbing in two of the upcoming posts which narrow down discussion specifically to AI safety work. I think it's a highly valuable way of measuring how well a hypothesis seems to explain a network, but there are some pitfalls with it to be aware of.

The Engineer’s Interpretability Sequence (EIS) I: Intro

EIS II: What is “Interpretability”?

Correct. I intended the 3 paragraphs in that comment to be separate thoughts. Sorry.

The Engineer’s Interpretability Sequence (EIS) I: Intro

There are not that many that I don't think are fungible with interpretability work :)

But I would describe most outer alignment work to be sufficiently different...

The Engineer’s Interpretability Sequence (EIS) I: Intro

Interesting to know that about the plan. I have assumed that remix was in large part about getting more people into this type of work. But I'm interested in the conclusions and current views on it. Is there a post reflecting on how it went and what lessons were learned from it?

3Lawrence Chan2y

Don't think there have been public writeups, but here's two relevant manifold markets:

The Engineer’s Interpretability Sequence (EIS) I: Intro

I think that my personal thoughts on capabilities externalities are reflected well in this post.

I'd also note that this concern isn't very unique to interpretability work but applies to alignment work in general. And in comparison to other alignment techniques, I think that the downside risks of interpretability tools are most likely lower than those of stuff like RLHF. Most theories of change for interpretability helping with AI safety involve engineering work at some point in time, so I would expect that most interpretability researchers have simil... (read more)

1Lawrence Chan2y

Fwiw this does not seem to be in the Dan Hendrycks post you linked!

The Engineer’s Interpretability Sequence (EIS) I: Intro

I think that (1) is interesting. This sounds plausible, but I do not know of any examples of this perspective being fleshed out. Do you know of any posts on this?

2Neel Nanda2y

I don't know if they'd put it like this, but IMO solving/understanding superposition is an important part of being able to really grapple with circuits in language models, and this is why it's a focus of the Anthropic interp team

3Ryan Greenblatt2y

I would argue that ARC's research is justified by (1) (roughly speaking). Sadly, I don't think that there are enough posts on their current plans for this to be clear or easy for me to point at. There might be some posts coming out soon.

Stephen Casper2y2-2

Thanks! I discuss in the second post of the sequence why I lump ARC's work in with human-centered interpretability.

Avoiding perpetual risk from TAI

This is an interesting point. But I'm not convinced, at least immediately, that this isn't likely to be largely a matter of AI governance.

There is a long list of governance strategies that aren't specific to AI that can help us handle perpetual risk. But there is also a long list of strategies that are. I think that all of the things I mentioned under strategy 2 have AI specific examples:

establishing regulatory agencies, auditing companies, auditing models, creating painful bureaucracy around building risky AI systems, influencing hardware supply cha

Existential AI Safety is NOT separate from near-term applications

I see your point here about generally strengthening methods versus specifically refining them for an application. I think this is a useful distinction. But I'm all about putting a good amount of emphasis on connections between this work and different applications. I feel like at this point, we have probably cruxed.

Existential AI Safety is NOT separate from near-term applications

Thanks for the comment. I'm inclined to disagree though. The application was for detecting deceptiveness. But the method was a form of contrastive probing. And one could use a similar approach for significantly different applications.

1Robert Kirk2y

This feels kind of like a semantic disagreement to me. To ground it, it's probably worth considering whether further research on the CCS-style work I posted would also be useful for self-driving cards (or other applications). I think that would depend on whether the work improves the robustness of the contrastive probing regardless of what is being probed for (which would be generically useful), or whether it improves the probing specifically for truthfulness in systems that have a conception of truthfulness , possibly by improving the constraints or adding additional constraints (less useful for other systems). I think both would be good, but I'm uncertain which would be more useful to pursue if one is motivated by reducing x-risk from misalignment.

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Thanks!

We do hope that these are just special cases and that our methods will resolve a broader set of problems.

I hope so too. And I would expect this to be the case for good solutions.

Whether they are based on mechanistic interpretability, probing, other interpretability tools, adversaries, relaxed adversaries, or red-teaming, I would expect methods that are good at detecting goal misgeneralization or deceptive alignment to also be useful for self-driving cars and other issues. At the end of the day, any misaligned model will have a bug -- some set ... (read more)

1Robert Kirk2y

One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see https://www.alignmentforum.org/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without) doesn't seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving cars aren't misaligned in the sense that they could drive perfectly safely but choose not to, but rather are just unable to drive perfectly safely because some of their internal (learned) systems aren't sufficiently robust.

Stephen Casper2y64

Nice post. I'll nitpick one thing.

In the paper, the approach was based on training a linear probe to differentiate between true and untrue question answer pairs. I believe I mentioned to you at one point that "contrastive" seems more precise than "unsupervised" to describe this method. To carry out an approach like this, it's not enough to have or create a bunch of data. One needs the ability to reliably find subsets of the data that contrast. In general, this would be as hard as labeling. But when using boolean questions paired with "yes" and "no" a... (read more)

Existential AI Safety is NOT separate from near-term applications