I don't really get how the regularizer is supposed to be different from these two cases -- perhaps the idea is that it is a regularizer specifically on the distribution over utility functions that the AI is optimizing?

To clarify, the difference was that in the case of the regularizer, the impact penalty would gradually be discarded as the AI learned more. In the case of an influence limiter, the AI would always be restricted to low impact actions, until it is retired.

Reply

[-]Rohin Shah6y10

In that case, why isn't this equivalent to impact as a safety protocol? The period during which we use the regularizer is "training".

(Perhaps the impact as a safety protocol point was meant to be about mesa optimization specifically?)

Reply

[-]Matthew Barnett6y20

The period during which we use the regularizer is "training".

This is misleading. In the case of mesa optimization, there are plausibly two regimes which could both legitimately be considered "training." To use the analogy of human evolution, there is the training regime which allowed human brains to evolve, and there's the training regime where humans grow up from babies to adults. Both of these use some sort of learning algorithm, but the second uses the learned learning algorithms.

In the case of "Impact as a regularizer" I meant it in the sense of a baby growing up to be an adult. Whereas the safety protocol was meant to imply safety over the regime of developing brains to begin with.

Reply

[-]Matthew Barnett6y10

Also to note: I think I could have been substantially more clear in that post specifically. I wrote it during a brainstorming session where I was trying to think of ways an impact measure could help alignment. I will consider adding a note on the post saying "I don't currently think about impact measures this way anymore."

Reply

[-]Matthew Barnett6y10

[EDIT: See my other comment which explains my reply much better] You're right to say that they are very similar. The only real difference is in the conceptual framing. In the safety protocol case I imagined creating a testing environment, which may include potentially misaligned mesa optimizers. The case of regularizer is one where we have given it autonomy and it is no longer in a regime for us to perform tests on.

Reply

[-]Pattern6y10

Neither classical adversarial training nor training on a version of ImageNet designed to reduce the reliance on texture helps a lot, but modifying the network architecture can increase the accuracy on ImageNet-A from around 5% to 15%.

(Section link.)

Wow, 15% sounds really low. How well do people perform on said dataset?

This reminds me of:

https://www.lesswrong.com/posts/s4mqFdgTfsjfwGFiQ/who-s-an-unusual-thinker-that-you-recommend-following#9m26yMR9TtbxCK7m5

David Ha's

most recent paper, Weight Agnostic Neural Networks looks at what happens when you do architecture search over neural nets initialized with random weights to try and better understand how much work structure is doing in neural nets.

Reply