Thanks for the detailed responses! I'm happy to talk about "descriptions" throughout.
Trying to summarize my current understanding of what you're saying:
Is there some formal-ish definition of "explanation of (network, dataset)" and "mathematical description length of an explanation" such that you think SAEs are especially short explanations? I still don't think I have whatever intuition you're describing, and I feel like the issue is that I don't know how you're measuring description length and what class of "explanations" you're considering.
As naive examples that probably don't work (similar to the ones from my original comment):
The sparsity penalty trains the SAE to activate fewer features for any given datapoint, thus optimizing for shorter mathematical description length.
I'm confused by this claim and some related ones, sorry if this comment is correspondingly confused and rambly.
It's not obvious at all to me that SAEs lead to shorter descriptions in any meaningful sense. We get sparser features (and maybe sparser interactions between features), but in exchange, we have more features and higher loss. Overall, I share Ryan's intuition here that it seems pretty hard to...
Thanks! Mostly agree with your comments.
I actually think this is reasonably relevant, and is related to treeification.
I think any combination of {rewriting, using some canonical form} and {treeification, no treeification} is at least possible, and they all seem sort of reasonable. Do you mean the relation is that both rewriting and treeification give you more expressiveness/more precise hypotheses? If so, I agree for treeification, not sure for rewriting. If we allow literally arbitrary extensional rewrites, then that does increase the number of different ...
ETA: We've now written a post that compares causal scrubbing and the Geiger et al. approach in much more detail: https://www.alignmentforum.org/posts/uLMWMeBG3ruoBRhMW/a-comparison-of-causal-scrubbing-causal-abstractions-and
I still endorse the main takeaways from my original comment below, but the list of differences isn't quite right (the newer papers by Geiger et al. do allow multiple interventions, and I neglected the impact that treeification has in causal scrubbing).
To me, the methods seem similar in much more than just the problem they're tackling. I...
Thanks for the responses! I think we qualitatively agree on a lot, just put emphasis on different things or land in different places on various axes. Responses to some of your points below:
The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. [...]
Let me try to put the argument into my own words: because of locality, any "reasonable" variable transformation can in some sense be split into "local transformations", each of which involve only a few vari...
I agree this is an exciting idea, but I don't think it clearly "just works", and since you asked for ways it could fail, here are some quick thoughts:
No, I'm not claiming that. What I am claiming is something more like: there are plausible ways in which applying 30 nats of optimization via RLHF leads to worse results than best-of-exp(30) sampling, because RLHF might find a different solution that scores that highly on reward.
Toy example: say we have two jointly Gaussian random variables X and Y that are positively correlated (but not perfectly). I could sample 1000 pairs and pick the one with the highest X-value. This would very likely also give me an unusually high Y-value (how high depends on the corr...
As a caveat, I didn't think of the RL + KL = Bayesian inference result when writing this, I'm much less sure now (and more confused).
Anyway, what I meant: think of the computational graph of the model as a causal graph, then changing the weights via RLHF is an intervention on this graph. It seems plausible there are somewhat separate computational mechanisms for producing truth and for producing high ratings inside the model, and RLHF could then reinforce the high rating mechanism without correspondingly reinforcing the truth mechanism, breaking the correl...
Thanks! Causal Goodhart is a good point, and I buy now that RLHF seems even worse from a Goodhart perspective than filtering. Just unsure by how much, and how bad filtering itself is. In particular:
In the case of useful and human-approved answers, I expect that in fact, there exist maximally human-approved answers that are also maximally useful
This is the part I'm still not sure about. For example, maybe the simplest/apparently-easiest-to-understand answer that looks good to humans tends to be false. Then if human raters prefer simpler answers (because the...
It's not clear to me that 3. and 4. can both be true assuming we want the same level of output quality as measured by our proxy in both cases. Sufficiently strong filtering can also destroy correlations via Extremal Goodhart (e.g. this toy example). So I'm wondering whether the perception of filtering being safer just comes from the fact that people basically never filter strongly enough to get a model that raters would be as happy with as a fine-tuned one (I think such strong filtering is probably just computationally intractable?)
Maybe there is some more...
Thanks, computing J not being part of step 1 helps clear things up.
I do think that "realistically defining the environment" is pretty closely related to being able to detect deceptive misalignment: one way J could fail due to deception would be if its specification of the environment is good enough for most purposes, but still has some differences to the real world which allow an AI to detect the difference. Then you could have a policy that is good according to J, but which still destroys the world when actually deployed.
Similar to my comment in the other...
I see, that makes much more sense than my guess, thanks!
I'm pretty confused as to how some of the details of this post are meant to be interpreted, I'll focus on my two main questions that would probably clear up the rest.
Reward Specification: Finding a policy-scoring function such that (nearly–)optimal policies for that scoring function are desirable.
If I understand this and the next paragraphs correctly, then J takes in a complete description of a policy, so it also takes into account what the policy does off-distribution or in very rare cases, is that right? So in this decomposition, "reward s...
I agree that aligned AI could also make humans irrelevant, but not sure how that's related to my point. Paraphrasing what I was saying: given that AI makes humans less relevant, unaligned AI would be bad even if no single AI system can take over the world. Whether or not aligned AI would also make humans irrelevant just doesn't seem important for that argument, but maybe I'm misunderstanding what you're saying.
Interesting points, I agree that our response to part C doesn't address this well.
AI's colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it's the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the "AIs take part in human societal structures indefinitely" hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we...
Thanks for the interesting comments!
Briefly, I think Katja's post provides good arguments for (1) "things will go fine given slow take-off", but this post interprets it as arguing for (2) "things will go fine given AI never becomes dangerously capable". I don't think the arguments here do quite enough to refute claim (1), although I'm not sure they are meant to, given the scope ("we are not discussing").
Yeah, I didn't understand Katja's post as arguing (1), otherwise we'd have said more about that. Section C contains reasons for slow take-off, but my...
Thanks for the comments!
One can define deception as a type of distributional shift. [...]
I technically agree with what you're saying here, but one of the implicit claims I'm trying to make in this post is that this is not a good way to think about deception. Specifically, I expect solutions to deception to look quite different from solutions to (large) distributional shift. Curious if you disagree with that.
I directionally agree with this (and think it's good to write about this more, strongly upvoted!)
For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:
I th... (read more)
I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there's some under way, and I'm excited for more).
We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.
I plan to write more about white box control soon (but I plan to write about a lot of stuff…)