AI ALIGNMENT FORUM
AF

All of Erik Jenner's Comments + Replies

I directionally agree with this (and think it's good to write about this more, strongly upvoted!)

For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:

"ensuring that if the AIs are not aligned [...], then you are still OK" (which I think is the main meaning of "AI control")
Making ~worst-case assumptions about things like neural representations or inductive biases (which in practice means you likely rely on black-box methods, as in Redwood's existing work on control).

I th... (read more)

Buck Shlegeris6mo*198

I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there's some under way, and I'm excited for more).

We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.

I plan to write more about white box control soon (but I plan to write about a lot of stuff…)

Sparsify: A mechanistic interpretability research agenda

Erik Jenner1y20

Thanks for the detailed responses! I'm happy to talk about "descriptions" throughout.

Trying to summarize my current understanding of what you're saying:

SAEs themselves aren't meant to be descriptions of (network, dataset). (I'd just misinterpreted your earlier comment.)
As a description of just the network, SAEs have a higher description length than a naive neuron-based description of the network.
Given a description of the network in terms of "parts," we can get a description of (network, dataset) by listing out which "parts" are "active" on each sample. I

... (read more)

1Lee Sharkey1y

Yes all four sound right to me. To avoid any confusion, I'd just add an emphasis that the descriptions are mathematical, as opposed semantic. I too am keen to converge on a format in terms of Turing machines or Kolmogorov complexity or something else more formal. But I don't feel very well placed to do that, unfortunately, since thinking in those terms isn't very natural to me yet.

Sparsify: A mechanistic interpretability research agenda

Erik Jenner1y10

Is there some formal-ish definition of "explanation of (network, dataset)" and "mathematical description length of an explanation" such that you think SAEs are especially short explanations? I still don't think I have whatever intuition you're describing, and I feel like the issue is that I don't know how you're measuring description length and what class of "explanations" you're considering.

As naive examples that probably don't work (similar to the ones from my original comment):

We could consider any Turing machine that approximately outputs (network, dat

... (read more)

1Lee Sharkey1y

I'll register that I prefer using 'description' instead of 'explanation' in most places. The reason is that 'explanation' invokes a notion of understanding, which requires both a mathematical description and a semantic description. So I regret using the word explanation in the comment above (although not completely wrong to use it - but it did risk confusion). I'll edit to replace it with 'description' and strikethrough 'explanation'. "explanation of (network, dataset)": I'm afraid I don't have a great formalish definition beyond just pointing at the intuitive notion. But formalizing what an explanation is seems like a high bar. If it's helpful, a mathematical description is just a statement of what the network is in terms of particular kinds of mathematical objects. "mathematical description length of an explanation": (Note: Mathematical descriptions are of networks, not of explanations.) It's just the set of objects used to describe the network. Maybe helpful to think in terms of maps between different descriptions: E.g. there is a many-to-one map between a description of a neural network in terms of polytopes and in terms of neurons. There are ~exponentially many more polytopes. Hence the mathematical description of the network in terms of individual polytopes is much larger. I would not. So: I think that the confusion might again be from using 'explanation' rather than description. SAEs (or decompiled networks that use SAEs as the building block) are supposed to approximate the original network behaviour. So SAEs are mathematical descriptions of the network, but not of the (network, dataset). What's a mathematical description of the (network, dataset), then? It's just what you get when you pass the dataset through the network; this datum interacts with this weight to produce this activation, that datum interacts with this weight to produce that activation, and so on. A mathematical description of the (network, dataset) in terms of SAEs are: th

Sparsify: A mechanistic interpretability research agenda

Erik Jenner1y411

The sparsity penalty trains the SAE to activate fewer features for any given datapoint, thus optimizing for shorter mathematical description length.

I'm confused by this claim and some related ones, sorry if this comment is correspondingly confused and rambly.

It's not obvious at all to me that SAEs lead to shorter descriptions in any meaningful sense. We get sparser features (and maybe sparser interactions between features), but in exchange, we have more features and higher loss. Overall, I share Ryan's intuition here that it seems pretty hard to... (read more)

2Lee Sharkey1y

Thanks Erik :) And I'm glad you raised this. One of the things that many researchers I've talked to don't appreciate is that, if we accept networks can do computation in superposition, then we also have to accept that we can't just understand the network alone. We want to understand the network's behaviour on a dataset, where the dataset contains potentially lots of features. And depending on the features that are active in a given datum, the network can do different computations in superposition (unlike in a linear network that can't do superposition). The combined object '(network, dataset)' is much larger than the network itself. Explanations Descriptions of the (network, dataset) object can actually be compressions despite potentially being larger than the network. So, You can have one feature active for each datapoint, but now we've got an explanation description of the (network, dataset) that scales linearly in the size of the dataset, which sucks! Instead, if we look for regularities (opportunities for compression) in how the network treats data, then we have a better chance at explanations descriptions that scale better with dataset size. Suppose a datum consists of a novel combination of previously explained described circuits. Then our explanation description of the (network, dataset) is much smaller than if we explained described every datapoint anew. In light of that, you can understand my disagreement with "in that case, I could also reduce the description length by training a smaller model." No! Assuming the network is smaller yet as performant (therefore presumably doing more computation in superposition), then the explanation description of the (network, dataset) is basically unchanged.

A comparison of causal scrubbing, causal abstractions, and related methods

Erik Jenner2y10

Thanks! Mostly agree with your comments.

I actually think this is reasonably relevant, and is related to treeification.

I think any combination of {rewriting, using some canonical form} and {treeification, no treeification} is at least possible, and they all seem sort of reasonable. Do you mean the relation is that both rewriting and treeification give you more expressiveness/more precise hypotheses? If so, I agree for treeification, not sure for rewriting. If we allow literally arbitrary extensional rewrites, then that does increase the number of different ... (read more)

1Jenny Nitishinskaya2y

Not sure if I'm fully responding to your q but... This sounds right to me, and overall I mostly think of treeification as just a kind of extensional rewrite (plus adding more inputs). I think of the underlying graph as providing some combination of 1) causal relationships, and 2) smaller pieces to help with search/reasoning, rather than being an object we inherently care about. (It's possibly useful to think of hypotheses more as making predictions about the behavior but idk.) I do agree that in some applications you might want to restrict which rewrites (including treeification!) are allowed. e.g., in MAD for ELK we might want to make use of the fact that there is a single "diamond" (which may be ~distributed, but not ~duplicated) upstream of all the sensors.

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Erik Jenner2y*162

ETA: We've now written a post that compares causal scrubbing and the Geiger et al. approach in much more detail: https://www.alignmentforum.org/posts/uLMWMeBG3ruoBRhMW/a-comparison-of-causal-scrubbing-causal-abstractions-and

I still endorse the main takeaways from my original comment below, but the list of differences isn't quite right (the newer papers by Geiger et al. do allow multiple interventions, and I neglected the impact that treeification has in causal scrubbing).

To me, the methods seem similar in much more than just the problem they're tackling. I... (read more)

4Nora Belrose2y

FWIW it appears that out of the 4 differences you cited here, only one of them (the relaxation of the restriction that the scrubbed output must be the same) still holds as of this January paper from Geiger's group https://arxiv.org/abs/2301.04709. So the methods are even more similar than you thought.

1Lawrence Chan2y

Strongly upvoted for a clear explanation!

Natural Abstractions: Key claims, Theorems, and Critiques

Erik Jenner2y111

Thanks for the responses! I think we qualitatively agree on a lot, just put emphasis on different things or land in different places on various axes. Responses to some of your points below:

The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. [...]

Let me try to put the argument into my own words: because of locality, any "reasonable" variable transformation can in some sense be split into "local transformations", each of which involve only a few vari... (read more)

Anomalous tokens reveal the original identities of Instruct models

Erik Jenner2y59

I agree this is an exciting idea, but I don't think it clearly "just works", and since you asked for ways it could fail, here are some quick thoughts:

If I understand correctly, we'd need a model that we're confident is a mesa-optimizer (and perhaps even deceptive---mesa-optimizers per se might be ok/desirable), but still not capable enough to be dangerous. This might be a difficult target to hit, especially if there are "thresholds" where slight changes have big effects on how dangerous a model is.
If there's a very strong inductive bias towards deception,

... (read more)

2janus2y

These are plausible ways the proposal could fail. And, as I said in my other comment, our knowledge would be usefully advanced by finding out what reality has to say on each of these points. Here are some notes about the JD's idea I made some time ago. There's some overlap with the things you listed. * Hypotheses / cruxes * (1) Policies trained on the same data can fall into different generalization basins depending on the initialization. https://arxiv.org/abs/2205.12411 * Probably true; Alstro has found "two solutions w/o linear connectivity in a 150k param CIFAR-10 classifier" with different validation loss * Note: This is self-supervised learning with the exact same data. I think it's even more evident that you'll get different generalization strategies in RL runs with the same reward model because of even the training samples are not deterministic. * (1A) These generalization strategies correspond to differences we care about, like in the limit deceptive vs honest policies * (2) Generalization basins are stable across scale (and architectures?) * If so, we can scope out the basins of smaller models and then detect/choose basins in larger models * We should definitely see if this is true for current scales. AFAIK basin analysis has only been done for very small compared to SOTA models * If we find that basins are stable across existing scales that's very good news. However, we should remain paranoid, because there could still be phase shifts at larger scales. The hypothetical mesaoptimizers you describe are much more sophisticated and situationally aware than current models, e.g. "Every intelligent policy has an incentive to lie about sharing your values if it wants out of the box." Mesaoptimizers inside GPT-3 probably are not explicitly reasoning about being in a box at all, except maybe on the ephemeral simulacra level. * But that is no reason not to attempt any of this. * And I think stable basins at exist

Trying to disambiguate different questions about whether RLHF is “good”

Erik Jenner2y26

No, I'm not claiming that. What I am claiming is something more like: there are plausible ways in which applying 30 nats of optimization via RLHF leads to worse results than best-of-exp(30) sampling, because RLHF might find a different solution that scores that highly on reward.

Toy example: say we have two jointly Gaussian random variables X and Y that are positively correlated (but not perfectly). I could sample 1000 pairs and pick the one with the highest X-value. This would very likely also give me an unusually high Y-value (how high depends on the corr... (read more)

1Lawrence Chan2y

Cool, I don't think we disagree here.

Trying to disambiguate different questions about whether RLHF is “good”

Erik Jenner2y10

As a caveat, I didn't think of the RL + KL = Bayesian inference result when writing this, I'm much less sure now (and more confused).

Anyway, what I meant: think of the computational graph of the model as a causal graph, then changing the weights via RLHF is an intervention on this graph. It seems plausible there are somewhat separate computational mechanisms for producing truth and for producing high ratings inside the model, and RLHF could then reinforce the high rating mechanism without correspondingly reinforcing the truth mechanism, breaking the correl... (read more)

1Lawrence Chan2y

I think your claim is something like: As stated, this claim is false for LMs without top-p sampling or floating point rounding errors, since every token has a logit greater than negative infinity and thus a probability greater than actual 0. So with enough sampling, you'll find the RL trajectories. This is obviously a super pedantic point: RL finds sentences with cross entropy of 30+ nats wrt to the base distribution all the time, while you'll never do Best-of-exp(30)~=1e13. And there's an empirical question of how much performance you get versus how far your new policy is from the old one, e.g. if you look at Leo Gao's recent RLHF paper, you'll see that RL is more off distribution than BoN at equal proxy rewards. That being said, I do think you need to make more points than just "RL can result in incredibly implausible trajectories" in order to claim that BoN is safer than RL, since I claim that Best-of-exp(30) is not clearly safe either!

Trying to disambiguate different questions about whether RLHF is “good”

Erik Jenner2y23

Thanks! Causal Goodhart is a good point, and I buy now that RLHF seems even worse from a Goodhart perspective than filtering. Just unsure by how much, and how bad filtering itself is. In particular:

In the case of useful and human-approved answers, I expect that in fact, there exist maximally human-approved answers that are also maximally useful

This is the part I'm still not sure about. For example, maybe the simplest/apparently-easiest-to-understand answer that looks good to humans tends to be false. Then if human raters prefer simpler answers (because the... (read more)

1Lawrence Chan2y

Can you explain why RLHF is worse from a Causal Goodhart perspective?

Trying to disambiguate different questions about whether RLHF is “good”

Erik Jenner2y50

It's not clear to me that 3. and 4. can both be true assuming we want the same level of output quality as measured by our proxy in both cases. Sufficiently strong filtering can also destroy correlations via Extremal Goodhart (e.g. this toy example). So I'm wondering whether the perception of filtering being safer just comes from the fact that people basically never filter strongly enough to get a model that raters would be as happy with as a fine-tuned one (I think such strong filtering is probably just computationally intractable?)

Maybe there is some more... (read more)

4davidad (David A. Dalrymple)2y

Extremal Goodhart relies on a feasibility boundary in U,V-space that lacks orthogonality, in such a way that maximal U logically implies non-maximal V. In the case of useful and human-approved answers, I expect that in fact, there exist maximally human-approved answers that are also maximally useful—even though there are also maximally human-approved answers that are minimally useful! I think the feasible zone here looks pretty orthogonal, pretty close to a Cartesian product, so Extremal Goodhart won't come up in either near-term or long-term applications. Near-term, it's Causal Goodhart and Regressional Goodhart, and long-term, it might be Adversarial Goodhart. Extremal Goodhart might come into play if, for example, there are some truths about what's useful that humans simply cannot be convinced of. In that case, I am fine with answers that pretend those things aren't true, because I think the scope of that extremal tradeoff phenomenon will be small enough to cope with for the purpose of ending the acute risk period. (I would not trust it in the setting of "ambitious value learning that we defer the whole lightcone to.") For the record, I'm not very optimistic about filtering as an alignment scheme either, but in the setting of "let's have some near-term assistance with alignment research", I think Causal Goodhart is a huge problem for RLHF that is not a problem for equally powerful filtering. Regressional Goodhart will be a problem in any case, but it might be manageable given a training distribution of human origin.

Reframing inner alignment

Erik Jenner2y20

Thanks, computing J not being part of step 1 helps clear things up.

I do think that "realistically defining the environment" is pretty closely related to being able to detect deceptive misalignment: one way J could fail due to deception would be if its specification of the environment is good enough for most purposes, but still has some differences to the real world which allow an AI to detect the difference. Then you could have a policy that is good according to J, but which still destroys the world when actually deployed.

Similar to my comment in the other... (read more)

2davidad (David A. Dalrymple)2y

To the final question, for what it’s worth to contextualize my perspective, I think my inside-view is simultaneously: * unusually optimistic about formal verification * unusually optimistic about learning interpretable world-models * unusually pessimistic about learning interpretable end-to-end policies

1davidad (David A. Dalrymple)2y

I agree, if there is a class of environment-behaviors that occur with nonnegligible probability in the real world but occur with negligible probability in the environment-model encoded in J, that would be a vulnerability in the shape of alignment plan I’m gesturing at. However, aligning a predictive model of reality to reality is “natural” compared to normative alignment. And the probability with which this vulnerability can actually be bad is linearly related to something like total variation distance between the model and reality; I don’t know if this is exactly formally correct, but I think there’s some true theorem vaguely along the lines of: a 1% TV distance could only cause a 1% chance of alignment failure via this vulnerability. We don’t have to get an astronomically perfect model of reality to have any hope of its not being exploited. Judicious use of worst-case maximin approaches (e.g. credal sets rather than pure Bayesian modeling) will also help a lot with narrowing this gap, since it will be (something like) the gap to the nearest point in the set rather than to a single distribution.

Reframing inner alignment

Erik Jenner2y22

I see, that makes much more sense than my guess, thanks!

Reframing inner alignment

Erik Jenner2y62

I'm pretty confused as to how some of the details of this post are meant to be interpreted, I'll focus on my two main questions that would probably clear up the rest.

Reward Specification: Finding a policy-scoring function $J (π)$ such that (nearly–)optimal policies for that scoring function are desirable.

If I understand this and the next paragraphs correctly, then J takes in a complete description of a policy, so it also takes into account what the policy does off-distribution or in very rare cases, is that right? So in this decomposition, "reward s... (read more)

2davidad (David A. Dalrymple)2y

To the second point, I meant something very different—I edited this sentence and hopefully it is more clear now. I did not mean that T should respect extensional equivalence of policies (if it didn’t, we could always simply quotient it by extensional equivalence of policies, since it outputs rather than inputs policies). Instead, I meant that a training story that involves mitigating your model-free learning algorithm’s unbounded out-of-distribution optimality gap by using some kind of interpretability loop where you’re applying a detector function to the policy to check for inner misalignment (and using that to guide policy search) has a big vulnerability: the policy search can encode similarly deceptive (or even exactly extensionally equivalent) policies in other forms which make the deceptiveness invisible to the detector. Respecting extensional equivalence is a bare-minimum kind of robustness to ask from an inner-misalignment detector that is load-bearing in an existential-safety strategy.

2davidad (David A. Dalrymple)2y

Thanks, this is very helpful feedback about what was confusing. Please do ask more questions if there are still more parts that are hard to interpret. To the first point, yes, J evaluates π on all trajectories, even off-distribution. It may do this in a Bayesian way, or a worst-case way. I claim that J does not need to “detect deceptive misalignment” in any special way, and I’m not optimistic that progress on such detection is even particularly helpful, since incompetence can also be fatal, and deceptive misalignment could Red Queen Race ahead of the detector. Instead: a deceptively aligned policy that is bad must concretely do bad stuff on some trajectories. J can detect this by simply detecting bad stuff. If there’s a sneaky hard part of Reward Specification beyond the obvious hard part of defining what’s good and bad, it would be “realistically defining the environment.” (That’s where purely predictive models come in.)

Response to Katja Grace's AI x-risk counterarguments

Erik Jenner2y10

I agree that aligned AI could also make humans irrelevant, but not sure how that's related to my point. Paraphrasing what I was saying: given that AI makes humans less relevant, unaligned AI would be bad even if no single AI system can take over the world. Whether or not aligned AI would also make humans irrelevant just doesn't seem important for that argument, but maybe I'm misunderstanding what you're saying.

Response to Katja Grace's AI x-risk counterarguments

Erik Jenner2y30

Interesting points, I agree that our response to part C doesn't address this well.

AI's colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it's the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the "AIs take part in human societal structures indefinitely" hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we... (read more)

2AdamGleave2y

I agree that in a fast takeoff scenario there's little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I'm personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient. In terms of humans "owning" the economy but still having trouble getting what they want, it's not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I've bought a dud on the basis of canny marketing, overall I'm much better off living in a modern capitalist economy than the stone age where humans were more directly in control. However, it does seem like there's a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, "slightly better than 2022" and "all humans dead" are rounding errors relative to "possible future human flourishing". But things look quite different under other ethical views, so I'm reluctant to conflate these outcomes.

0Luna Rimar2y

This problem of human irrelevancy seems somewhat orthogonal to the alignment problem; even a maximally aligned AI will strip humans of their agency, as it knows best. Making the AI value human agency will not be enough; humans suck enough that the other objectives will override the agency penalty most of the time, especially in important matters.

Response to Katja Grace's AI x-risk counterarguments

Erik Jenner2y21

Thanks for the interesting comments!

Briefly, I think Katja's post provides good arguments for (1) "things will go fine given slow take-off", but this post interprets it as arguing for (2) "things will go fine given AI never becomes dangerously capable". I don't think the arguments here do quite enough to refute claim (1), although I'm not sure they are meant to, given the scope ("we are not discussing").

Yeah, I didn't understand Katja's post as arguing (1), otherwise we'd have said more about that. Section C contains reasons for slow take-off, but my... (read more)

2David Scott Krueger2y

Responding in order: 1) yeah I wasn't saying it's what her post is about. But I think you can get two more interesting cruxy stuff by interpreting it that way. 2) yep it's just a caveat I mentioned for completeness. 3) Your spontaneous reasoning doesn't say that we/it get(/s) good enough at getting it to output things humans approve of before it kills us. Also, I think we're already at "we can't tell if the model is aligned or not", but this won't stop deployment. I think the default situation isn't that we can tell if things are going wrong, but people won't be careful enough even given that, so maybe it's just a difference of perspective or something... hmm.......

Disentangling inner alignment failures

Erik Jenner2y21

Thanks for the comments!

One can define deception as a type of distributional shift. [...]

I technically agree with what you're saying here, but one of the implicit claims I'm trying to make in this post is that this is not a good way to think about deception. Specifically, I expect solutions to deception to look quite different from solutions to (large) distributional shift. Curious if you disagree with that.

2Johannes Treutlein2y

Overall I agree that solutions to deception look different from solutions to other kinds of distributional shift. (Also, there are probably different solutions to different kinds of large distributional shift as well. E.g., solutions to capability generalization vs solutions to goal generalization.) I do think one could claim that some general solutions to distributional shift would also solve deceptiveness. E.g., the consensus algorithm works for any kind of distributional shift, but it should presumably also avoid deceptiveness (in the sense that it would not go ahead and suddenly start maximizing some different goal function, but instead would query the human first). Stuart Armstrong might claim a similar thing about concept extrapolation? I personally think it is probably best to just try to work on deceptiveness directly instead of solving some more general problem and hoping non-deceptiveness is a side effect. It is probably harder to find a general solution than to solve only deceptiveness. Though maybe this depends on one's beliefs about what is easy or hard to do with deep learning.