4gate - AI Alignment Forum

For a while, there has been a growing focus into safety training using activation engineering, such as via circuit breakers and LAT (more LAT). There's also new work on improving safety training and always plenty of new red-teaming attacks that (ideally) create space for new defenses. I'm not sure if what I'm illustrating here is 100% a coherent category, but generally I mean to include methods that are applicable IRL (i.e. the Few Tokens Deep paper uses the easiest form of data augmentation ever and it seems to fix some known vulnerabilities effectively), can be iterated alongside red-teaming to get increasingly better defenses, and focus on interventions to safety-relevant phenomena (more on this below).

Is DM exploring this sort of stuff? It doesn't seem to be under the mantle of "AGI Safety" given the post above. Maybe it's another team? It's true that it's more "AI" than "AGI" safety, and that we need the more scientific/theoretical AGI Safety research illustrated in the post too, if we are to have a reasonably good future alongside AGIs. With that said, this sort of more empirical red-teaming + safety-training oriented research has some benefits:

You get to create interesting questions for the MI people that are totally toy models, thereby making their work more useful IRL and creating more information on which to gain broader theoretical understanding of phenomena.
You actually fix problems today. You can also fail fast. I don't know much about the debate literature, but look at the debate example from my perspective: (1) 6 years ago someone conceptualized debate and made some theoretical argument (2) there was a community with expectations and probably a decent amount of discussion about debate in the span of those 6 years, even including (theoretical) improvements on the original debate ideas, (3) someone actually tried debate and it didn't work as expected... today... 6 years later. I understand that for debate you probably need good-enough models (unless you are more clever than me—and probably can be), so maybe harping on debate is not fair. That's not what I'm trying to do here, anyways. Mainly I'm just highlighting that when we can iterate by solving real problems and getting feedback in shorter timespans, we can actually get a lot more safety.
A lot of safety training is about controlling/constraining what the AI can say/do so that it won't say/do the bad things. The tools of this sort of control are pretty generic, so it's not unlikely that they would provide some benefits in future situations as well. As models scale (capabilities), so long as we keep improving our methods for red-teaming+safety training, these sorts of semi-empirical tools should roughly scale (in their capability to control/constrain the AI). It is more likely that by working on pure science & tools that "will be useful eventually" we are overall less safe and have larger jumps in the size of the gap between the ability of an AI to cause harm and our ability to keep it from doing so.

The way I see it there are roughly 4 categories (though maybe this is rather procrustean and I'm missing some) of research that can be done in AI Safety:

Pure science: this is probably very useful but in a very long time. It will be very interesting and not show up in the real world until kind of late. I think a large proportion of MI falls into this. AFAIK no one uses SAEs IRL for safety tasks? With that said, they will surely be very scientifically useful. Maybe steering vectors are the exception, but they also are in 4 (below). Pure science is usually more about understanding how things work first before being able to intervene.
Evals: (in the broad sense of the word, including safety and capabilities benchmarks) self explanatory. Useful at every stage.
Safety Theory: into this I lump ideas like debate and amplified oversight, which don't really do much in the real world (products people use, etc...) right now AFAIK (not sure, am I wrong?) but are a combination of (primarily, still) conceptual frameworks for how we could have AGI Safety plus the tools to enact that framework. Usually, things from this category arose from someone thinking about how we could have AI Safety in the future, and coming up with some strategy. That strategy is often not really enactable until the future, with perhaps some toy models as exceptions, so I call these "theory."
Safety Practice: into this I lump basically most red-teaming attacks, prompt/activation engineering, and safety training methods that people use or are possible to plug in. These methods usually arise because there is a clear, real-world problem and so their goal is to fix that problem. They are usually applicable in short timespans and are sometimes a little bit of a patchwork, but iterative and possibly to test and improve. More so than 3 (above), they arise from a realistic current need instead of a likely future need. Unlike 1(above) they are focused on making interventions first and understanding later.

In this categorization, it seems like DM's AGI Safety team is very much more focused on 1,2, and 3. There's nothing wrong with any of these, but it would seem like 2 and 4 should be the bread and butter right? Is there any sort of 4 work going on? Aren't companies like DM in a much better position to do this sort of work than the academic labs and other organizations that you find publishing this stuff? You guys have access to the surrounding systems (meaning you can gain a better understanding of attack vectors and side-effects than someone who is just testing the input/output of a chatbot) , have access to the model internals, have boatloads of compute (it would also be nice to know how things like LAT work on a full-scale model instead of just Llama3-8B), and are a common point of failure (most people are using models from OAI, Anthropic, DM, Meta). Maybe I'm conflating DM with other parts of Alphabet?

Anyways, I'm curious where things along the lines of 4 figure in to your plan for AGI Safety. It would be criminal to try and make AI "safe" while ignoring all the real world, challenging-but-tractable, information-rich challenges that arise from things such as red-teaming attacks that can happen today. Also curious to hear if you think this categorization is flawed in some key way.

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

4gate7mo10

Mechanistically Eliciting Latent Behaviors in Language Models

4gate11mo00

Maybe a dumb question but (1) how can we know for sure if we are on manifold, (2) why is it so important to stay on manifold? I'm guessing that you mean that vaguely we want to stay within the space of possible activations induced by inputs from data that is in some sense "real-world." However, there appear to be a couple complications: (1) measuring distributional properties of later layers from small to medium sized datasets doesn't seem like a realistic estimate of what should be expected of an on-manifold vector since it's likely later layers are more semantically/high-level focused and sparse; (2) what people put into the inputs does change in small ways simply due to new things happening in the world, but also there are prompt engineering attacks that people use that are likely in some sense "off-distribution" but still in the real world and I don't think we should ignore these fully. Is this notion of a manifold a good way to think about the notion of getting indicative information of real world behavior? Probably, but I'm not sure so I thought I might ask. I am new to this field.

I do thing at the end of the day we want indicative information, so I think somewhat artifical environments might at times have a certain usefulness.

Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it's not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a "realistic" sentence or whatever.

Curious to hear thoughts :)

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments