All of Andrew McKnight's Comments + Replies

I mean, it's mostly semantics but I think of mechanical interpretability as "inner" but not alignment and think it's clearer that way, personally, so that we don't call everything alignment. Observing properties doesn't automatically get you good properties. I'll read your link but it's a bit too much to wade into for me atm.

Either way, it's clear how to restate my question: Is mechanical interpretability work the only inner alignment work Anthropic is doing?

Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:

  1. Developing model organisms for deceptive inner alignment, which we may use to study the risk factors for deceptive alignment
  2. Conditioning predictive models as an alternative to training agents. Predictive models may pose fewer inner alignment risks, for reasons discussed here
  3. Studying the extent to which models exhibit likely pre-requisites to deceptive inner alignment, such as situational awareness (a very preliminary exploration is i
... (read more)

Great post. I'm happy to see these plans coming out, following OpenAI's lead.

It seems like all the safety strategies are targeted at outer alignment and interpretability. None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment, iirc, even though this seems to me like the biggest challenge.

Is Anthropic mostly leaving inner alignment untouched, for now?

It seems like all the safety strategies are targeted at outer alignment and interpretability.

None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment

???


Less tongue-in-cheek: certainly it's unclear to what extent interpretability will be sufficient for addressing various forms of inner alignment failures, but I definitely think interpretability research should count as inner alignment research.

I haven't read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. If it can't learn at this point then I find it hard to believe it's generally capable, and if it can, it will have incentive to simply remove the device or create a copy of itself that is correct about its own world model. Do you address this in the articles?

On the other hand, this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.

2Koen Holtman
Yes I address this, see for example the part about The possibility of learned self-knowledge in the sequence. I show there that any RL agent, even a non-AGI, will always have the latent ability to 'look at itself' and create a machine-learned model of its compute core internals. What is done with this latent ability is up to the designer. The key thing here is that you have a choice as a designer, you can decide if you want to design an agent which indeed uses this latent ability to 'look at itself'. Once you decide that you don't want to use this latent ability, certain safety/corrigibility problems become a lot more tractable. Wikipedia has the following definition of AGI: Though there is plenty of discussion on this forum which silently assumes otherwise, there is no law of nature which says that, when I build a useful AGI-level AI, I must necessarily create the entire package of all human cognitive abilities inside of it. Terminology note if you want to look into this some more: ML typically does not frame this goal as 'instructing the model not to learn about Q'. ML would frame this as 'building the model to approximate the specific relation P(X|Y,Z) between some well-defined observables, and this relation is definitely not Q'.