All of aog's Comments + Replies

To summarize this comment, you've proposed that baseline monitoring systems could reduce risk to an acceptable level. Specifically, the monitoring system would need to correctly identify at least 5% of dangerous queries as dangerous ("5% precision") and avoid incorrectly flagging more than 1 in 1000 safe queries as dangerous ("0.1% FPR"). 

I think this level of reliability is possible today (e.g. Claude 2 would likely meet it), but it's possible that future developments would make defense more difficult. For example, new attack methods have shown LLMs ... (read more)

2Ryan Greenblatt
[Mostly unimportant/nitpicking specific claim] For cases like bio, I don't think multimodal is that much of a threat because it should be possible to identify text outputs from the model as problematic without needing the context from the input in most cases. So, we can just have part of our monitoring ensemble ignore image inputs (for instance). Other image defenses for the monitor such as various compression strategies are plausible. (I think compression has historically been defeated given white box access to the compression algorithm, but we don't necessarily need to assume that. I'm also separately pretty optimistic about various randomization strategies for compression like things which I don't think have been tested in the literature.) In general, there is a tradeoff related to removing information from the monitor: the more info you remove, the harder it is to do a good job, but removing info this makes adversarial attacks harder. Aside: Removing information for the monitor in the collusion case We discuss this sort of monitoring tradeoff in a different case in our recent paper on AI control in the context of trying to prevent collusion (which is different than adversarial attacks but related). I think the similarity is somewhat interesting here. In section 3.4.2 we say:

I specifically avoided claiming that adversarial robustness is the best altruistic option for a particular person. Instead, I'd like to establish that progress on adversarial robustness would have significant benefits, and therefore should be included in the set of research directions that "count" as useful AI safety research. 

Over the next few years, I expect AI safety funding and research will (and should) dramatically expand. Research directions that would not make the cut at a small organization with a dozen researchers should still be part of the... (read more)

1Ryan Greenblatt
I agree with basically all of this and apologies for writing a comment which doesn't directly respond to your post (though it is a relevant part of my views on the topic).

I do think these arguments contain threads of a general argument that causing catastrophes is difficult under any threat model. Let me make just a few non-comprehensive points here: 

On cybersecurity, I'm not convinced that AI changes the offense defense balance. Attackers can use AI to find and exploit security vulnerabilities, but defenders can use it to fix them. 

On persuasion, first, rational agents can simply ignore cheap talk if they expect it not to help them. Humans are not always rational, but if you've ever tried to convince a dog or a b... (read more)

2Ryan Greenblatt
I agree that "causing catastrophes is difficult" should reduce concerns with "rogue AIs causing sudden extinction (or merely killing very large numbers of people like >1 billion)". However, I think these sorts of considerations don't reduce AI takeover or other catastrophe due to rogue AI as much as you might think for a few reasons: * Escaped rogue AIs might be able to do many obviously bad actions over a long period autonomously. E.g., acquire money, create a cult, use this cult to build a bioweapons lab, and then actually develop bioweapons over long-ish period (e.g., 6 months) using >tens of thousands of queries to the AI. This looks quite different from the misuse threat model which required that omnicidal (or otherwise bad) humans possess the agency to make the right queries to the AI and solve the problems that the AI can't solve. For instance, humans have to ensure that queries were sufficiently subtle/jailbreaking to avoid detection via various other mechanisms. The rogue AI can train humans over a long period and all the agency/competence can come from the rogue AI. So, even if misuse is unlikely by humans, autonomous rogue AIs making weapons of mass destruction is perhaps more likely. * Escaped rogue AIs are unlike misuse in that even if we notice a clear and serious problem, we might have less we can do. E.g., the AIs might have already built hidden datacenters we can't find. Even if they don't and are just autonomously replicating on the internet, shutting down the internet is extremely costly and only postpones the problem. * AI takeover can route through mechanisms other than sudden catastrophe/extinction. E.g., allying with rogue states, creating a rogue AI run AI lab which builds even more powerful AI as fast as possible. (I'm generally somewhat skeptical of AIs trying to cause extinction for reasons discussed here, here, and here. Though causing huge amounts of damage (e.g. >1 billion) dead seems somewhat more plausible as a thing rogue AIs wo

Also, I'd love to see research that simulates the position of a company trying to monitor misuse, and allows for the full range of defenses that you proposed. There could be a dataset of 1 trillion queries containing 100 malicious queries. Perhaps each query is accompanied by a KYC ID. Their first line of defense would be robust refusal to cause harm, and the second line would be automated detection of adversarial attacks. The company could also have a budget which can be spent on "human monitoring," which would give them access to the ground truth label o... (read more)

Thanks for the detailed thoughts! I like the overall plan, especially using KYC, human monitoring, and a separate model for bio. I'd like to point out that this plan still uses automated monitoring systems, and to the extent these systems aren't adversarially robust, the plan will be more expensive and/or less effective. 

The only plans that wouldn't benefit from adversarial robustness are those without automated monitoring. For example, humans could be hired to manually monitor each and every query for attempted misuse. Let's consider the viability of... (read more)

3Ryan Greenblatt
I agree that additional adversarial robustness (for both the policy and the monitor) helps considerably, but I think that it likely won't be necessarily to have better adversarial robustness than the current state of the art (e.g. the robustness of Claude 2). This is due to the use of monitoring ensembles and preventing iteration against the API. My belief that this is likely to work is totally possible to falsify by demonstrating cases where we can't figure out sufficient countermeasures against a red team with realistic affordances. I made some "Edit Time Additions" (ETA) to my original comment above for clarity. As discussed in my comment, I'm imagining being able to have a human review perhaps 0.01% to 1% of queries depending on the use case and other other factors. So, we'd review the most suspicious seeming (according to our overall combined monitoring score) 0.01% to 1% of queries. Equivalently, (assuming actual misuse is rare) we'd be targeting a False Positive Rate (FPR) of 0.01% or 1% (on the ROC curve). I think the claim: seems wrong or at least seems to be answering the wrong question. Let's suppose that we can afford to review the most suspicious seeming 0.1% of queries for a bio specific API endpoint. Equivalently, we can afford 0.1% FPR for our automated monitoring ensemble. Now, suppose (for simplicity) that catastrophic bio misuse requries 100 queries to the model and we're happy as long as we catch the user and report them to the FBI prior to them finishing all 100 queries. Then, if our monitoring ensemble at 0.1% FPR catches a given bio misuse query with just 5% probability (for a competent red team), bio misuse is 99.4% likely to be caught prior to succeeding. So, we don't actually need very high reliability on the ensemble. (Though we might target more conservative bars than 99.4% in practice.) A reasonable objection here is that maybe we'll have a large number of people (e.g., >200) trying to create bioweapons using the API and thus one w

This is a comment from Andy Zou, who led the RepE paper but doesn’t have a LW account:

“Yea I think it's fair to say probes is a technique under rep reading which is under RepE (https://www.ai-transparency.org/). Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best eli... (read more)

2Fabien Roger
I agree that not using labels is interesting from a data generation perspective, but I expect this to be useful mostly if you have clean pairs of concepts for which it is hard to get labels - and I think this will not be the case for takeover attempts datasets. About the performance of LAT: for monitoring, we mostly care about correlation - so LAT is worse IID, and it's unclear if LAT is better OOD. If causality leads to better generalization properties, then LAT is dominated by mean difference probing (see the screenshot of Zou's paper below), which is just regular probing with high enough L2 regularization (as shown in the first Appendix of this post).

What's the relationship between this method and representation engineering? They seem quite similar, though maybe I'm missing something. You train a linear probe on a model's activations at a particular layer in order to distinguish between normal forward passes and catastrophic ones where the model provides advice for theft. 

Representation engineering asks models to generate both positive and negative examples of a particular kind of behavior. For example, the model would generate outputs with and without theft, or with and without general power-seek... (read more)

Probes fall within the representation engineering monitoring framework.

LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying - especial... (read more)

Great resource, thanks for sharing! As somebody who's not too deeply familiar with either mechanistic interpretability or the academic field of interpretability, I find myself confused by the fact that AI safety folks usually dismiss the large academic field of interpretability. Most academic work on ML isn't useful for safety because safety studies different problems with different kinds of systems. But unlike focusing on worst-case robustness or inner misalignment, I would expect generating human understandable explanations of what neural networks are do... (read more)

These professors all have a lot of published papers in academic conferences. It’s probably a bit frustrating to not have their work summarized, and then be asked to explain their own work, when all of their work is published already. I would start by looking at their Google Scholar pages, followed by personal websites and maybe Twitter. One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.

5JohnMalin
I don't think the onus should be on the reader to infer x-risk motivations. In academic ML, it's the author's job to explain why the reader should care about the paper. I don't see why this should be different in safety. If it's hard to do that in the paper itself, you can always e.g. write a blog post explaining safety relevance (as mentioned by aogara, people are already doing this, which is great!). There are often many different ways in which a paper might be intended to be useful for x-risks (and ways in which it might not be). Often the motivation for a paper (even in the groups mentioned above) may be some combination of it being an interesting ML problem, interests of the particular student, and various possible thoughts around AI safety. It's hard to try to disentangle this from the outside by reading between the lines.

Agree with both aogara and Eli's comment. 

One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.

For me this reading between the lines is hard: I spent ~2 hours reading academic papers/websites yesterday and while I could quite quickly summarize the work itself, it was quite hard to me to figure out the motivations.

Ah okay. Are there theoretical reasons to think that neurons with lower variance in activation would be better candidates for pruning? I guess it would be that the effect on those nodes is similar across different datapoints, so they can be pruned and their effects will be replicated by the rest of the network.

1Donald Hobson
Well if the node has no variance in its activation at all, then its constant, and pruning it will not change the networks behavior at all.  I can prove an upper bound. Pruning a node with standard deviation X should increase the loss by at most KX, where K is the product of the operator norm of the weight matrices. The basic idea is that the network is a libshitz function, with libshitz constant K. So adding the random noise means  randomness of standard deviation at most  KX in the logit prediction. And as the logit is log(p)−log(1−p), and an increase in log(p) means a decrease in log(1−p),,  then each of those must be perterbed by at most KX.  What this means in practice is that, if the kernals are smallish, then the neurons with small standard deviation in activation aren't effecting the output much. Of course, its possible for a neuron to have a large standard deviation and have its output totally ignored by the next layer. Its possible for a neuron to have a large standard deviation and be actively bad. Its possible for a tiny standard deviation to be amplified by large values in the kernels. 

“…nodes with the smallest standard deviation.” Does this mean nodes whose weights have the lowest absolute values?

1Donald Hobson
Not quite. It means running the network on the training data. For each node, look at the values. (which will always be ≥0, as the activation function is relu) and taking the empirical standard deviation. So consider the randomness to be a random choice of input datapoint.