Maybe a dumb question but (1) how can we know for sure if we are on manifold, (2) why is it so important to stay on manifold? I'm guessing that you mean that vaguely we want to stay within the space of possible activations induced by inputs from data that is in some sense "real-world." However, there appear to be a couple complications: (1) measuring distributional properties of later layers from small to medium sized datasets doesn't seem like a realistic estimate of what should be expected of an on-manifold vector since it's likely later layers are more ...
For a while, there has been a growing focus into safety training using activation engineering, such as via circuit breakers and LAT (more LAT). There's also new work on improving safety training and always plenty of new red-teaming attacks that (ideally) create space for new defenses. I'm not sure if what I'm illustrating here is 100% a coherent category, but generally I mean to include methods that are applicable IRL (i.e. the Few Tokens Deep paper uses the easiest form of data augmentation ever and it seems to fix some known vulnerabilities effectively),... (read more)
Google DeepMind does lots of work on safety practice, mostly by other teams. For example, Gemini Safety (mentioned briefly in the post) does a lot of automated red teaming. The AGI Safety & Alignment team has also contributed to safety practice work. GDM usually doesn't publish about that work, mainly because the work here is primarily about doing all the operational work necessary to translate existing research techniques into practice, which doesn't really lend itself to paper publications.
I disagree that the AGI safety team should have 4 as its "bre... (read more)