All of Ben Cottier's Comments + Replies

Nice! A couple things that this comment pointed out for me:

  1. Real time is not always (and perhaps often not) the most useful way to talk about timelines. It can be more useful to talk about different paths, or economic growth, if that's more relevant to how tractable the research is.
  2. An agenda doesn't necessarily have to argue that its assumptions are more likely, because we may have enough resources to get worthwhile expected returns on multiple approaches.

Something that's unclear here: are you excited about this approach because you think brain-like AGI will be easier to align? Or is it more about the relative probabilities / neglectedness / your fit?

4Steve Byrnes
I don't think it's obvious that "we should do extra safety research that bet on a future wherein AGI safety winds up being easy". If anything it seems backwards. Well, tractability cuts one way, importance cuts the other way, "informing what we should do viz. differential technology development" is a bit unclear. I do know one person who works on brain-like AGI capabilities on the theory that brain-like AGI would be easier to align. Not endorsing that, but at least there's an internal logic there. (FWIW, my hunch is that brain-like AGI would be better / less bad for safety than the "risks from learned optimization" scenario, albeit with low confidence. How brain-like AGI compares to other scenarios (GPT-N or whatever), I dunno.) Instead I'm motivated to work on this because of relative probabilities and neglectedness.

I'm excited about this project. I've been thinking along similar lines about inducing a model to learn deception, in the context of inner alignment. It seems really valuable to have concrete (but benign) examples of a problem to poke at and test potential solutions on. So far there seem to be less concrete examples of deception, betrayal and the like to work with in ML compared to say, distributional shift, or negative side effects.

Previous high level projects have tried to define concepts like "trustworthiness" (or the closely related "truthful") and motivated the AI to follow them. Here we will try the opposite: define "betrayal", and motivate the AIs to avoid it.

Why do you think the betrayal approach is more tractable or useful? It's not clear from the post.
 

Thanks. I think I understand, but I'm still confused about the effect on the risk of catastrophe (i.e. not just being pseudo-aligned, but having a catastrophic real-world effect). It may help to clarify that I was mainly thinking of deceptive alignment, not other types of pseudo-alignment. And I'll admit now that I phrased the question stronger than I actually believe, to elicit more response :)

I agree that the probability of pseudo-alignment will be the same, and that an unrecoverable action could occur despite the threat of modification. I'm interested i... (read more)

3Evan Hubinger
I agree with all of this—online learning doesn't change the probability of pseudo-alignment but might make it harder for a deceptively aligned model to defect. That being said, I don't think that deceptive models defecting later is necessarily a good thing—if your deceptive models start defecting sooner, but in recoverable ways, that's actually good because it gives you a warning shot. And once you have a deceptive model, it's going to try to defect against you at some point, even if it just has to gamble and defect randomly with some probability. First, I do think that worst-case guarantees are achievable if we do relaxed adversarial training with transparency tools. Second, I actually have done a bunch of probabilistic risk analysis on exactly this sort of situation here. Note, however, that the i.i.d. situation imagined in that analysis is extremely hard to realize in practice as there are fundamental distributional shifts that are very difficult to overcome—such as the distributional shift from a situation where the model can't defect profitably to a situation where it can.

In the limit of training on a diverse set of tasks, we expect joint optimization of both the base and mesa- objectives to be unstable. Assuming that the mesa-optimizer converges towards behavior that is optimal from the perspective of the base optimizer, the mesa-optimizer must somehow learn the base objective.

Joint optimization may be unstable, but if the model is not trained to convergence, might it still be jointly optimizing at the end of training? This occurred to me after reading https://arxiv.org/abs/2001.08361 which finds that "Larger models are... (read more)

2Evan Hubinger
I talk about this a bit here, but basically if you train huge models for a short period of time, you're really relying on your inductive biases to find the simplest model that fits the data—and mesa-optimizers, especially deceptive mesa-optimizers, are quite simple, compressed policies.