User Comment Replies — AI Alignment Forum

1Max H2y

I also predict that if you modify (improve?) the training process, perhaps only slightly, the behaviors you observe go away and you get a pure cheese-finder. (Caveat: I'm not super familiar with the literature on goal mis-generalization and Langosco et al.; what follows is just based on my reading of this post and the previous ones in the sequence.) From the previous post: Concretely, I'm predicting that, if there were training examples where the cheese was located in, say, the bottom-right corner, you probably wouldn't end up with an agent that sometimes goes to the top-right, sometimes to the bottom-right, and sometimes to the cheese, or even an agent that learns a "going right" shard (as a combination of the top-right and bottom-right shards), and a cheese-finding shard. The agent would just always, or nearly always, find the cheese in the test environment. Or, if you want to make sure the training -> test environment requires the same amount of generalization (by the metric of number of squares in which the cheese can appear in the training process vs. the test environment), fix 25 (or perhaps fewer) random squares where the cheese can appear throughout the maze during training, not restricted to the top-right. Put differently, the behaviors in Statistically informed impressions seem relevant only in the regime where P(cheese acquired) is not close to 1. That seems like a pretty fragile / narrow / artificial condition, at least for maze-solving. I'm looking forward to seeing more follow up work on this though. I do think there are a lot of interesting interpretability questions this kind of experimentation can answer. What happens if you subtract the cheese vector from a perfect cheese finder, for example?

Behavioural statistics for a maze-solving agent

peligrietzer2y10

I'd maybe point the finger more at the simplicity of the training task than at the size of the network? I'm not sure there's strong reason to believe the network is underparameterized for the training task. But I agree that drawing lessons from small-ish networks trained on simple tasks requires caution.

AI ALIGNMENT FORUM
AF

All of peligrietzer's Comments + Replies