All of scasper's Comments + Replies

I’m glad you think that the post has a small audience and may not be needed. I suppose that’s a good sign. 
 


In the post I said it’s good that nukes don’t blow up on accident and similarly, it’s good that BSL-4 protocols and tech exist. I’m not saying that alignment solutions shouldn’t exist. I am speaking to a specific audience (e.g. the frontier companies and allies) that their focus on alignment isn’t commensurate with its usefulness. Also don’t forget the dual nature of alignment progress. I also mentioned in the post that frontier alignment progress hastens timelines and makes misuse risk more acute. 

My bad. Didn’t mean to imply you thought it was desirable. 

4Oliver Habryka
No worries! You did say it would be premised on either "inevitable or desirable for normal institutions to be eventually lose control". In some sense I do think this is "inevitable" but only in the same sense as past "normal human institutions" lost control.  We now have the internet and widespread democracy so almost all governmental institutions needed to change how they operate. Future technological change will force similar changes. But I don't put any value in the literal existence of our existing institutions, what I care about is whether our institutions are going to make good governance decisions. I am saying that the development of systems much smarter than current humans will change those institutions, very likely within the next few decades, making most concerns about present institutional challenges obsolete. Of course something that one might call "institutional challenges" will remain, but I do think there really will be a lot of buck-passing that will happen from the perspective of present day humans. We do really have a crunch time of a few decades on our hands, after which we will no longer have much influence over the outcome.

There's a crux here somewhere related to the idea that, with high probability, AI will be powerful enough and integrated into the world in such a way that it will be inevitable or desirable for normal human institutions to eventually lose control and for some small regime to take over the world. I don't think this is very likely for reasons discussed in the post, and it's also easy to use this kind of view to justify some pretty harmful types of actions.  

3Oliver Habryka
I don't think I understand. It's not about human institutions losing control "to a small regime". It's just about most coordination problems being things you can solve by being smarter. You can do that in high-integrity ways, probably much higher integrity and with less harmful effects than how we've historically overcome coordination problems. I de-facto don't expect things to go this way, but my opinions here are not at all premised on it being desirable for humanity to lose control?

Thx!

I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI. 

I won't put words in people's mouth, but it's not my goal to talk about words. I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.
 

If we don’t solve the technical alignment problem, then we’ll eventually wind up with a recipe for summoning more and more powerful demons with callo

... (read more)

I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.

Hmm. Sounds like “AI safety community” is a pretty different group of people from your perspective than from mine. Like, I would say that if there’s some belief that is rejected by Eliezer Yudkowsky and by Paul Christiano and by Holden Karnofsky and, widely rejected by employees of OpenPhil and 80,000 hours and ARC and UK-AISI, and widely rejected by self-described rationalists and by self-described EAs and by ... (read more)

Yea, thanks, good point. On one hand, I am assuming that the after identifying the SAE neurons of interest, they could be used for steering (this is related to section 5.3.2 of the paper). On the other hand, I am also assuming that in this case, identifying the problem is 80% of the challenge. IRL, I would assume that this kind of problem could and would be addressed by adversarial fine-tuning the model some more. 

Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn't seem too crazy. I think that this DOES match my expectations. 

But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much. 

Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don't have any disagreements. 

Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like "Sparse autoencoders produce interpretable features for large models" contribute to this. 

Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don't think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems. 

First, is that it's under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them. 

Second, there's no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetun... (read more)

Thanks for the useful post. There are a lot of things to like about this. Here are a few questions/comments.

First, I appreciate the transparency around this point. 

we advocate for what we see as the ‘minimal viable policy’ for creating a good AI ecosystem, and we will be open to feedback.

Second, I have a small question. Why the use of the word "testing" instead of "auditing"? In my experience, most of the conversations lately about this kind of thing revolve around the term "auditing." 

Third, I wanted to note that this post does not talk about ac... (read more)

Thanks. I agree that the points apply to individual researchers. But I don't think that it applies in a comparably worrisome way because individual researchers do not have comparable intelligence, money, and power compared to the labs. This is me stressing the "when put under great optimization pressure" of Goodhart's Law. Subtle misalignments are much less dangerous when there is a weak optimization force behind the proxy than when there is a strong one. 

Thanks for the reply. I think this resolves the thread and puts us on the same page :)

I have been thinking a lot lately about evals and what differences black- vs. white-box access makes for evaluators. I was confused about the appendix on black-box evals. My best attempt to make sense of it hinges on two instances of the authors intending something different than what was written. 

First, the appendix hinges somewhat on this point. 

For now, we think the easiest way to build control techniques that avoid making incorrect assumptions about scheming models is to make conservative assumptions about their internals, by rejecting any st

... (read more)
5Ryan Greenblatt
Our claim is that when arguing that models are controlled, we should (with the current state of technology) probably not depend on arguments for safety which depend on properties of the internals of models other than the basic property "SGD is reasonably sample efficient". (For a more precise notion of the sample efficiency point, if you train the model on labeled data, the model will be forced to perform near maximally well at fitting this data in a reasonably sample efficient way. This is due to assuming the model isn't capable of gradient hacking. This argument also implies some claims about what properties of RL you can depend on. We might also be able to evaluate whether models are capable of gradient hacking or look at concerns which are somewhere between gradient and exploration hacking.) Another sense in which control evalutions aren't exactly "black-box" is that these evaluations will use finetuning: the red team will finetuning the model to see if it is capable of exhibiting the problematic behavior. (And our actual protocol that we're evaluating will indeed use finetuning, it's just that we can't depend on any properties of this finetuning other than sample efficiency property mentioned above.) You might find it illuminating to read Appendix C of our recent paper. So overall, it does seem like the title of that appendix could be misleading/confusing. Here's an alterative description (which is a bit too long to be a title) "With the current state of technology, control evaluations should conservatively avoid depending on properties of the internals of models (e.g. interpretability) or generalization properties other than sample efficiency of SGD". I'll try to think of a less confusing and short title (and edit the section to be less confusing accordingly). Edit: I'll also generally try to be a bit more careful about the claims we make in this section. Like we're not claiming that current internals/generalization-based methods provide no evidence for s

Thanks!

I intuit that what you mentioned as a feature might also be a bug. I think that practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech. And if so, then we would want benchmarks that measure a method's ability to forget/unlearn just the things key to that domain and nothing else. For example, if a method succeeds in unlearning biotech but makes the target LM also unlearn math and physics, then we should be concerned about that, and we probably want benchmarks to help us quantify that.

I could... (read more)

Several people seem to be coming to similar conclusions recently (e.g., this recent post). 

I'll add that I have as well and wrote a sequence about it :)

I think this is very exciting, and I'll look forward to seeing how it goes!

Thanks, we will consider adding each of these. We appreciate that you took a look and took the time to help suggest these!

No, I don't think the core advantages of transparency are really unique to RLHF, but in the paper, we list certain things that are specific to RLHF which we think should be disclosed. Thanks.

Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human's comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless. 

1davidad (David A. Dalrymple)
Less difficult than ambitious mechanistic interpretability, though, because that requires human comprehension of mechanisms, which is even more difficult.

Thanks -- I agree that this seems like an approach worth doing. I think that at CHAI and/or Redwood there is a little bit of work at least related to this, but don't quote me on that. In general, it seems like if you have a model and then a smaller distilled/otherwise-compressed version of it, there is a lot you can do with them from an alignment perspective. I am not sure how much work has been done in the anomaly detection literature that involves distillation/compression. 

We talked about this over DMs, but I'll post a quick reply for the rest of the world. Thanks for the comment. 

A lot of how this is interpreted depends on what the exact definition of superposition that one uses and whether it applies to entire networks or single layers. But a key thing I want to highlight is that if a layer represents a certain set amount of information about an example, then they layer must have more information per neuron if it's thin than if it's wide. And that is the point I think that the Huang paper helps to make. The fact that ... (read more)

Thanks, +1 to the clarification value of this comment. I appreciate it. I did not have the tied weights in mind when writing this. 

Thanks for the comment.

In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways.

This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety. 

there are a

... (read more)

In general, I think not. The agent could only make this actively happen to the extent that their internal activation were known to them and able to be actively manipulated by them. This is not impossible, but gradient hacking is a significant challenge. In most learning formalisms such as ERM or solving MDPs, the model's internals are not modeled as a part of the actual algorithm. They're just implementational substrate. 

One idea that comes to mind is to see if a chatbot who is vulnerable to DAN-type prompts could be made to be robust to them by self-distillation on non-DAN-type prompts. 

I'd also really like to see if self-distillation or similar could be used to more effectively scrub away undetectable trojans. https://arxiv.org/abs/2204.06974 

1research_prime_space
1. I don't really think that 1. would be true -- following DAN-style prompts is the minimum complexity solution. You want to act in accordance with the prompt. 2. Backdoors don't emerge naturally. So if it's computationally infeasible to find an input where the original model and the backdoored model differ, then self-distillation on the backdoored model is going to be the same as self-distillation on the original model.  The only scenario where I think self-distillation is useful would be if 1) you train a LLM on a dataset, 2) fine-tune it to be deceptive/power-seeking, and 3) self-distill it on the original dataset, then self-distilled model would likely no longer be deceptive/power-seeking. 

I agree with this take. In general, I would like to see self-distillation, distillation in general, or other network compression techniques be studied more thoroughly for de-agentifying, de-backdooring, and robistifying networks. I think this would work pretty well and probably be pretty tractable to make ground on. 
 

2research_prime_space
I think self-distillation is better than network compression, as it possesses some decently strong theoretical guarantees that you're reducing the complexity of the function. I haven't really seen the same with the latter. But what research do you think would be valuable, other than the obvious (self-distill a deceptive, power-hungry model to see if the negative qualities go away)? 

I buy this value -- FV can augment examplars. And I have never heard anyone ever say that FV is just better than examplars. Instead, I have heard the point that FV should be used alongside exemplars. I think these two things make a good case for their value. But I still believe that more rigorous task-based evaluation and less intuition would have made for a much stronger approach than what happened.

I do not worry a lot about this. It would be a problem. But some methods are model-agnostic and would transfer fine. Some other methods have close analogs for other architectures. For example, ROME is specific to transformers, but causal tracing and rank one editing are more general principles that are not. 

Thanks for the comment. I appreciate how thorough and clear it is. 

Knowing "what deception looks like" - the analogue of knowing the target class of a trojan in a classifier - is a problem.

Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are  equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner... (read more)

Thanks. See also EIS VIII.

Could you give an example of a case of deception that is quite unlike a trojan? Maybe we have different definitions. Maybe I'm not accounting for something. Either way, it seems useful to figure out the disagreement.  

2Charlie Steiner
I'm slowly making my way through these, so I'll leave you a more complete comment after I read post 8.

Yes, it does show the ground truth.

The goal of the challenge is not to find the labels, but to find the program that explains them using MI tools. In the post, when I say labeling "function", I really mean labeling "program" in this case.

The MNIST CNN was trained only on the 50k training examples. 

I did not guarantee that the models had perfect train accuracy. I don't believe they did. 

I think that any interpretability tools are allowed. Saliency maps are fine. But to 'win,' a submission needs to come with a mechanistic explanation and sufficient evidence for it. It is possible to beat this challenge by using non mechanistic techniques to figure out the labeling function and then using that knowledge to find mechanisms by which the networks classify the data. 

At the end of the day, I (and possibly Neel) will have the final say in things. 

Thanks :)

This seems interesting. I do not know of steelmen for isolation, renaming, reinventing, etc. What is yours?

Thanks. I'll talk in some depth about causal scrubbing in two of the upcoming posts which narrow down discussion specifically to AI safety work. I think it's a highly valuable way of measuring how well a hypothesis seems to explain a network, but there are some pitfalls with it to be aware of. 

Correct. I intended the 3 paragraphs in that comment to be separate thoughts. Sorry.

There are not that many that I don't think are fungible with interpretability work :) 

But I would describe most outer alignment work to be sufficiently different...

Interesting to know that about the plan. I have assumed that remix was in large part about getting more people into this type of work. But I'm interested in the conclusions and current views on it. Is there a post reflecting on how it went and what lessons were learned from it?

3Lawrence Chan
Don't think there have been public writeups, but here's two relevant manifold markets:    

I think that my personal thoughts on capabilities externalities are reflected well in this post

I'd also note that this concern isn't very unique to interpretability work but applies to alignment work in general. And in comparison to other alignment techniques, I think that the downside risks of interpretability tools are most likely lower than those of stuff like RLHF. Most theories of change for interpretability helping with AI safety involve engineering work at some point in time, so I would expect that most interpretability researchers have simil... (read more)

1Lawrence Chan
Fwiw this does not seem to be in the Dan Hendrycks post you linked!

I think that (1) is interesting. This sounds plausible, but I do not know of any examples of this perspective being fleshed out. Do you know of any posts on this?

2Neel Nanda
I don't know if they'd put it like this, but IMO solving/understanding superposition is an important part of being able to really grapple with circuits in language models, and this is why it's a focus of the Anthropic interp team
3Ryan Greenblatt
I would argue that ARC's research is justified by (1) (roughly speaking). Sadly, I don't think that there are enough posts on their current plans for this to be clear or easy for me to point at. There might be some posts coming out soon.

Thanks! I discuss in the second post of the sequence why I lump ARC's work in with human-centered interpretability. 

This is an interesting point. But I'm not convinced, at least immediately, that this isn't likely to be largely a matter of AI governance. 

There is a long list of governance strategies that aren't specific to AI that can help us handle perpetual risk. But there is also a long list of strategies that are. I think that all of the things I mentioned under strategy 2 have AI specific examples:

establishing regulatory agencies, auditing companies, auditing models, creating painful bureaucracy around building risky AI systems, influencing hardware supply cha

... (read more)

I see your point here about generally strengthening methods versus specifically refining them for an application. I think this is a useful distinction. But I'm all about putting a good amount of emphasis on connections between this work and different applications. I feel like at this point, we have probably cruxed. 

Thanks for the comment. I'm inclined to disagree though. The application was for detecting deceptiveness. But the method was a form of contrastive probing.  And one could use a similar approach for significantly different applications. 

1Robert Kirk
This feels kind of like a semantic disagreement to me. To ground it, it's probably worth considering whether further research on the CCS-style work I posted would also be useful for self-driving cards (or other applications). I think that would depend on whether the work improves the robustness of the contrastive probing regardless of what is being probed for (which would be generically useful), or whether it improves the probing specifically for truthfulness in systems that have a conception of truthfulness , possibly by improving the constraints or adding additional constraints (less useful for other systems). I think both would be good, but I'm uncertain which would be more useful to pursue if one is motivated by reducing x-risk from misalignment.

Thanks!

We do hope that these are just special cases and that our methods will resolve a broader set of problems.

I hope so too. And I would expect this to be the case for good solutions. 

Whether they are based on mechanistic interpretability, probing, other interpretability tools, adversaries, relaxed adversaries, or red-teaming, I would expect methods that are good at detecting goal misgeneralization or deceptive alignment to also be useful for self-driving cars and other issues. At the end of the day, any misaligned model will have a bug -- some set ... (read more)

1Robert Kirk
One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see https://www.alignmentforum.org/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without) doesn't seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving cars aren't misaligned in the sense that they could drive perfectly safely but choose not to, but rather are just unable to drive perfectly safely because some of their internal (learned) systems aren't sufficiently robust.

Nice post. I'll nitpick one thing. 

In the paper, the approach was based on training a linear probe to differentiate between true and untrue question answer pairs. I believe I mentioned to you at one point that "contrastive" seems more precise than "unsupervised" to describe this method. To carry out an approach like this, it's not enough to have or create a bunch of data. One needs the ability to reliably find subsets of the data that contrast. In general, this would be as hard as labeling. But when using boolean questions paired with "yes" and "no" a... (read more)

Hi Paul, thanks. Nice reading this reply. I like your points here.

Some of what I say here might reveal a lack of keeping up well with ARC. But as someone who works primarily on interpretability, the thought of mechanistic anomaly detection techniques that are not useful for use in today's vision or language models seems surprising to me. Is there anything you can point me to to help me understand why an interpretability/anomaly detection tool that's useful for ASI or something might not be useful for cars?

5Paul Christiano
We are mostly thinking about interpretability and anomaly detection designed to resolve two problems (see here): * Maybe the AI thinks about the world in a wildly different way than humans and translates into human concepts by asking "what would a human say?" instead of "what is actually true?" This leads to bad generalization when we consider cases where the AI system plans to achieve a goal and has the option of permanently fooling humans. But that problem is very unlikely to be serious for self-driving cars, because we can acquire ground truth data for the relevant queries. On top of that it just doesn't seem they will think about physical reality in such an alien way. * Maybe an AI system explicitly understands that it is being evaluated, and then behaves differently if it later comes to believe that it is free to act arbitrarily in the world. We do hope that these are just special cases and that our methods will resolve a broader set of problems, but these two special cases loom really large. On the other hand, it's much less clear whether realistic failures for self-driving cars will involve the kind of mechanism distinctions we are trying to detect. Also: there are just way more pressing problems for self-driving cars. And on top of all that, we are taking a very theoretical approach precisely because we are worried it may be difficult to study these problems until a future time very close to when they become catastrophic. Overall I think that if someone looked at our research agenda and viewed it as an attempt to respond to reliability failures in existing models, the correct reaction should be more like "Why are they doing all of this instead of just empirically investigating which failures are most important for self-driving cars and then thinking about how to address those?" There's still a case for doing more fundamental theoretical research even if you are interested in more prosaic reliability failures, but (i) it's qualitatively much worse and I
1Robert Kirk
Not Paul, but some possibilities why ARC's work wouldn't be relevant for self-driving cars: * The stuff Paul said about them aiming at understanding quite simple human values (don't kill us all, maintain our decision-making power) rather than subtle things. It's likely for self-driving cars we're more concerned with high reliability and hence would need to be quite specific. E.g., maybe ARC's approach could discern whether a car understands whether it's driving on the road or not (seems like a fairly simple concept), but not whether it's driving in a riskier way than humans in specific scenarios. * One of the problems that I think ARC is worried about is ontology identification, which seems like a meaningfully different problem for sub-human systems (whose ontologies are worse than ours, so in theory could be injected into ours) than for human-level or super-human systems (where that may not hold). Hence focusing on the super-human case would look weird and possibly not helpful for the subhuman case, although it would be great if they could solve all the cases in full generality. * Maybe once it works ARC's approach could inform empirical work which helps with self-driving cars, but if you were focused on actually doing the thing for cars you'd just aim directly at that, whereas ARC's approach would be a very roundabout and needlessly complex and theoretical way of solving the problem (this may or may not actually be the case, maybe solving this for self-driving cars is actually fundamentally difficult in the same way as for ASI, but it seems less likely).
Load More