I specifically avoided claiming that adversarial robustness is the best altruistic option for a particular person. Instead, I'd like to establish that progress on adversarial robustness would have significant benefits, and therefore should be included in the set of research directions that "count" as useful AI safety research.
Over the next few years, I expect AI safety funding and research will (and should) dramatically expand. Research directions that would not make the cut at a small organization with a dozen researchers should still be part of the...
I do think these arguments contain threads of a general argument that causing catastrophes is difficult under any threat model. Let me make just a few non-comprehensive points here:
On cybersecurity, I'm not convinced that AI changes the offense defense balance. Attackers can use AI to find and exploit security vulnerabilities, but defenders can use it to fix them.
On persuasion, first, rational agents can simply ignore cheap talk if they expect it not to help them. Humans are not always rational, but if you've ever tried to convince a dog or a b...
Also, I'd love to see research that simulates the position of a company trying to monitor misuse, and allows for the full range of defenses that you proposed. There could be a dataset of 1 trillion queries containing 100 malicious queries. Perhaps each query is accompanied by a KYC ID. Their first line of defense would be robust refusal to cause harm, and the second line would be automated detection of adversarial attacks. The company could also have a budget which can be spent on "human monitoring," which would give them access to the ground truth label o...
Thanks for the detailed thoughts! I like the overall plan, especially using KYC, human monitoring, and a separate model for bio. I'd like to point out that this plan still uses automated monitoring systems, and to the extent these systems aren't adversarially robust, the plan will be more expensive and/or less effective.
The only plans that wouldn't benefit from adversarial robustness are those without automated monitoring. For example, humans could be hired to manually monitor each and every query for attempted misuse. Let's consider the viability of...
This is a comment from Andy Zou, who led the RepE paper but doesn’t have a LW account:
“Yea I think it's fair to say probes is a technique under rep reading which is under RepE (https://www.ai-transparency.org/). Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best eli...
What's the relationship between this method and representation engineering? They seem quite similar, though maybe I'm missing something. You train a linear probe on a model's activations at a particular layer in order to distinguish between normal forward passes and catastrophic ones where the model provides advice for theft.
Representation engineering asks models to generate both positive and negative examples of a particular kind of behavior. For example, the model would generate outputs with and without theft, or with and without general power-seek...
Probes fall within the representation engineering monitoring framework.
LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying - especial...
Great resource, thanks for sharing! As somebody who's not too deeply familiar with either mechanistic interpretability or the academic field of interpretability, I find myself confused by the fact that AI safety folks usually dismiss the large academic field of interpretability. Most academic work on ML isn't useful for safety because safety studies different problems with different kinds of systems. But unlike focusing on worst-case robustness or inner misalignment, I would expect generating human understandable explanations of what neural networks are do...
These professors all have a lot of published papers in academic conferences. It’s probably a bit frustrating to not have their work summarized, and then be asked to explain their own work, when all of their work is published already. I would start by looking at their Google Scholar pages, followed by personal websites and maybe Twitter. One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.
Agree with both aogara and Eli's comment.
One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.
For me this reading between the lines is hard: I spent ~2 hours reading academic papers/websites yesterday and while I could quite quickly summarize the work itself, it was quite hard to me to figure out the motivations.
Ah okay. Are there theoretical reasons to think that neurons with lower variance in activation would be better candidates for pruning? I guess it would be that the effect on those nodes is similar across different datapoints, so they can be pruned and their effects will be replicated by the rest of the network.
To summarize this comment, you've proposed that baseline monitoring systems could reduce risk to an acceptable level. Specifically, the monitoring system would need to correctly identify at least 5% of dangerous queries as dangerous ("5% precision") and avoid incorrectly flagging more than 1 in 1000 safe queries as dangerous ("0.1% FPR").
I think this level of reliability is possible today (e.g. Claude 2 would likely meet it), but it's possible that future developments would make defense more difficult. For example, new attack methods have shown LLMs ... (read more)