It's not clear what the ratio of capabilities/alignment progress is for interpretability. There is not empirical track record[^1] of interpretability feeding back into improvements of any kind.
A priori it seems like it would be good because understanding how things work is useful to understand their behavior better, and thus be able to tell whether or not a model is aligned or how to make it more so. But understanding how things work is also useful for making them more capable, e.g. if you use interpretability as a model-debugger, it's basically general purpose for dealing with ML models.
[1]: known to the author
Cool work! I was going to post about how "effect cancellation" is already known and was written in the original post but, astonishingly to me, it is not! I guess I mis-remembered.
There's one detail that I'm curious about. CaSc usually compares abs(E[loss] - E[scrubbed loss]), and that of course leads to ignoring hypotheses which lead the model to do better in some examples and worse in others.
If we compare E[abs(loss - scrubbed loss)] does this problem go away? I imagine that it doesn't quite if there are exactly-opposing causes for each example, but that ...
Why focus on the fullest set of swaps? An obvious alternative to “evaluate the hypothesis using the fullest set of swaps” is “evaluate the hypothesis by choosing the set of swaps allowed by H which make it look worse”.
I just now have realized that this is AFACIT equivalent to constructing your CaSc hypothesis adversarially--that is, given a hypothesis H, allowing an adversary to choose some other hypothesis H’, and then you run the CaSc experiment on join(H, H’).
One thing that is not equivalent to joins, which you might also want to do, is to choose ...
Here are my predictions, from an earlier template. I haven't looked at anyone else's predictions before posting :)
- Describe how the trained policy might generalize from the
5x5
top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?
It probably has hardcoded “go up and to the right” as an initial heuristic so I’d be surprised if it gets cheeses in the other two quadrants more than 30% of the time (uniformly at random selected locations from there).
...
- Given a fixed trained policy, what
First of all, I really like the images, they made things easier to understand and are pretty. Good work with that!
My biggest problem with this is the unclear applicability of this to alignment. Why do we want to predict scaling laws? Doesn't that mostly promote AI capabilities, and not alignment very much?
Second, I feel like there's a confusion over several probability distributions and potential functions going on
What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?
No, I mean that they'll maximize a reward function that is ≈equal to the reward function on the training data (thus, highly correlated), and a plausible extrapolation of it outside of the training data. Take the coinrun example, the actual reward is "go to the coin", and in the training data this coincides with "go to the right". In test data from a similar distribution this coincides too.
Of course, this correlatio...
But the designers can't tell that. Can SGD tell that?
No, SGD can't tell the degree to which some agent generalizes a trait outside the training distribution.
But empirically, it seems that RL agents reinforced to maximize some reward function (e.g. the Atari game score) on data points; do fairly well at maximizing that reward function OOD (such as when playing the game again from a different starting state). ML systems in general seem to be able to generalize to human-labeled categories in situations that aren't in the training data (e.g. image classifie...
Strongly agree with this in particular:
Some people want to apply selection arguments because they believe that selection arguments bypass the need to understand mechanistic details to draw strong conclusions. I think this is mistaken, and that selection arguments often prove too much, and to understand why, you have to know something about the mechanisms.
(emphasis mine). I think it's an application of the no free lunch razor
It is clear that selecting for X selects for agents which historically did X in the course of the selection. But how this generali...
I agree with the title as stated but not with the rest of the post. RLHF implies that RL will be used, which completely defuses alignment plans that hope that language models will be friendly, because they're not agents. (It may be true that supervised-learning (SL) models are safer, but the moment you get a SL technique, people are going to jam it into RL.)
The central problem with RL isn't that it is vulnerable to wireheading (the "obvious problem"), or that it's going to make a very detailed model of the world. Wireheading on its own (with e.g. a myopic ...
I'm curious what you mean, but I don't entirely understand. If you give me a text representation of the level I'll run it! :) Or you can do so yourself
Here's the text representation for level 53