AI ALIGNMENT FORUM
AF

All of Rohin Shah's Comments + Replies

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Rather, I think that most of the value lies in something more like "enabling oversight of cognition, despite not having data that isolates that cognition."

Is this a problem you expect to arise in practice? I don't really expect it to arise, if you're allowing for a significant amount of effort in creating that data (since I assume you'd also be putting a significant amount of effort into interpretability).

Research directions Open Phil wants to fund in technical AI safety

Rohin Shah1mo20

Is that right?

Yes, that's broadly accurate, though one clarification:

This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly

That's a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.

I think this will

Rohin Shah2mo20

If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point

That doesn't seem right. It can simultaneously be the case that you can't tell that there are problems stemming from long-term optimization problems when you don't use MONA, and also if you actually use MONA, then it will measurably improve quality.

For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which ... (read more)

3Fabien Roger1mo

To rephrase what I think you are saying are situations where work on MONA is very helpful: * By default people get bitten by long-term optimization. They notice issues in prod because it's hard to catch everything. They patch individual failures when they come up, but don't notice that if they did more work on MONA, they would stop the underlying driver of many issues (including future issues that could result in catastrophes). They don't try MONA-like techniques because it's not very salient when you are trying to fix individual failures and does not pass cost-benefit to fix individual failures. * If you do work on MONA in realistic-ish settings, you may be able to demonstrate that you can avoid many failures observed in prod without ad-hoc patches and that the alignment tax is not too large. This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly and because people don't by default think of the individual failures you've seen in prod as symptoms of their long-term optimization, but your empirical work pushes them over the line and they end up trying to adopt MONA to avoid future failures in prod (and maybe reduce catastrophic risk - though given competitive pressures, that might not be the main factor driving decisions and so you don't have to make an ironclad case for MONA reducing catastrophic risk). Is that right? I think this is at least plausible. I think this will become much more likely once we actually start observing long-term optimization failures in prod. Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?

Research directions Open Phil wants to fund in technical AI safety

Rohin Shah2mo20

I meant "it's obvious you should use MONA if you are seeing problems with long-term optimization", which I believe is Fabien's position (otherwise it would be "hard to find").

Your reaction seems more like "it's obvious MONA would prevent multi-step reward hacks"; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).

Research directions Open Phil wants to fund in technical AI safety

Rohin Shah2mo20

I have some credence in all three of those bullet points.

For MONA it's a relatively even mixture of the first and second points.

(You are possibly the first person I know of who reacted to MONA with "that's obvious" instead of "that obviously won't perform well, why would anyone ever do it". Admittedly you are imagining a future hypothetical where it's obvious to everyone that long-term optimization is causing problems, but I don't think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)

For debate it's mostly the first point, and to some extent the third point.

2Fabien Roger2mo

That's right. If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point (especially if the alignment tax is non-negligible), and I am not sure how you get it. In particular, for the median AIs that labs use to 20x AI safety research, my guess is that you won't have invisible long-term reward hacking problems, and so I would advise labs to spend the alignment tax on other interventions (like using weaker models when possible, or doing control), not on using process-based rewards. I would give different advice if * the alignment tax of MONA were tiny * there were decent evidence for invisible long-term reward hacking problems with catastrophic consequences solved by MONA I think this is not super plausible to happen, but I am sympathetic to research towards these two goals. So maybe we don't disagree that much (except maybe on the plausibility of invisible long-term reward hacking problems for the AIs that matter the most).

2Ryan Greenblatt2mo

I also have the "that's obvious reaction", but possibly I'm missing somne details. I also think it won't perform well enough in practice to pencil given other better places to allocate safety budget (if it does trade off which is unclear).

Research directions Open Phil wants to fund in technical AI safety

Rohin Shah2mo*20

Got it, that makes more sense. (When you said "methods work on toy domains" I interpreted "work" as a verb rather than a noun.)

But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings

I think by far the biggest open question is "how do you provide the nonmyopic approval so that the model actually performs well". I don't think anyone has even attempted to tackle this so it's hard to tell what you could learn about it, but I'd be surprised if there weren't generaliza... (read more)

2Fabien Roger2mo

My bad, I was a bit sloppy here. The debate-for-control stuff is in the RFP but not the debate vs subtle reward hacks that don't show up in feedback quality evals. I think we agree that there are some flavors of debate work that are exciting and not present in the RFP.

Research directions Open Phil wants to fund in technical AI safety

Rohin Shah2mo51

I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?

I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn't have to grapple as hard with the challenge of defining the approval feedback as I'd expect in a realistic deployment. But it does impose an alignment tax, so there's no point in using MONA currently, when good enough alignm... (read more)

4Fabien Roger2mo

Which of the bullet points in my original message do you think is wrong? Do you think MONA and debate papers are: * on the path to techniques that measurably improve feedback quality on real domains with potentially a low alignment tax, and that are hard enough to find that labs won't use them by default? * on the path to providing enough evidence of their good properties that even if they did not measurably help with feedback quality in real domains (and slightly increased cost), labs could be convinced to use them because they are expected to improve non-measurable feedback quality? * on the path to speeding up safety-critical domains? * (valuable for some other reason?)

2Fabien Roger2mo

That is not my claim. By "I would have guessed that methods work on toy domains and math will be useless" I meant "I would have guessed that if a lab decided to do process-based feedback, it will be better off not doing a detailed literature review of methods in MONA and followups on toy domains, and just do the process-based supervision that makes sense in the real domain they now look at. The only part of the method section of MONA papers that matters might be "we did process-based supervision"." I did not say "methods that work on toy domains will be useless" (my sentence was easy to misread). I almost have the opposite stance, I am closer to "it's so obvious that process-based feedback helps that if capabilities people ever had issues stemming from long-term optimization, they would obviously use more myopic objectives. So process-based feedback so obviously prevents problems from non-myopia in real life that the experiments in the MONA paper don't increase the probability that people working on capabilities implement myopic objectives." But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings (above the "capability researcher does what looks easiest and sensible to them" baseline)? Not a crux. My guess is that if debate did "work" to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality. So my low confidence guess is that it's not high priority for people working on x-risk safety to speed up that work. But I am excited about debate work that is not just about improving feedback quality. For example I am interested in debate vs schemers or debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn't show up in "feedback quality

Research directions Open Phil wants to fund in technical AI safety

Rohin Shah2mo30

I don't know of any existing work in this category, sorry. But e.g. one project would be "combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards", which in theory could work better than either one of them alone.

Research directions Open Phil wants to fund in technical AI safety

Rohin Shah2mo*5220

I'm excited to see this RFP out! Many of the topics in here seem like great targets for safety work.

I'm sad that there's so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is "alternatives to adversarial training". (There's also "new moonshots for aligning superintelligence" but I don't expect much to come out of that, and "white-box estimation of rare misbehavior" could help if you are willing to put optimizat... (read more)

6Fabien Roger2mo

I think additional non-moonshot work in these domains will have a very hard time helping. [low confidence] My high level concern is that non-moonshot work in these clusters may be the sort of things labs will use anyway (with or without safety push) if this helped with capabilities because the techniques are easy to find, and won't use if it didn't help with capabilities because the case for risk reduction is weak. This concern is mostly informed by my (relatively shallow) read of recent work in these clusters. [edit: I was at least somewhat wrong, see comment threads below] Here are things that would change my mind: * If I thought people were making progress towards techniques with nicer safety properties and no alignment tax that seem hard enough to make workable in practice that capabilities researchers won't bother using by default, but would bother using if there was existing work on how to make them work. * (For the different question of preventing AIs from using "harmful knowledge", I think work on robust unlearning and gradient routing may have this property - the current SoTA is far enough from a solution that I expect labs to not bother doing anything good enough here, but I think there is a path to legible success, and conditional on success I expect labs to pick it up because it would be obviously better, more robust, and plausibly cheaper than refusal training + monitoring. And I think robust unlearning and gradient routing have better safety properties than refusal training + monitoring.) * If I thought people were making progress towards understanding when not using process-based supervision and debate is risky. This looks like demos and model organisms aimed at measuring when, in real life, not using these simple precautions would result in very bad outcomes while using the simple precautions would help. * (In the different domain of CoT-faithfulness I think there is a lot of value in demonstrating the risk of opaque CoT well-enough that

6Buck Shlegeris2mo

I don't know what this means, do you have any examples?

Buck Shlegeris2mo128

I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".

Ten people on the inside

Rohin Shah2mo50

I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.

This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly dis... (read more)

In response to critiques of Guaranteed Safe AI

Rohin Shah2mo*97

In broad strokes I agree with Zac. And tbc I'm generally a fan of formal verification and have done part of a PhD in program synthesis.

So, GSAI addresses the band of capabilities where AI systems become potentially no longer safe to interact with directly due to their potential ability to model and exploit human psychology

This seems like a great example of something that I strongly expect GSAI will not handle (unless the proposal reduces to "don't build such AIs", in which case I would appreciate that being stated more directly, or if it reduces to "limit ... (read more)

Ten people on the inside

Rohin Shah2mo*1312

I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don't accomplish much.

By far my biggest objection is that there are approximately zero useful things that "[don't] make anyone's workflow harder". I expect you're vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to i... (read more)

8Buck Shlegeris2mo

Yep, I think that at least some of the 10 would have to have some serious hustle and political savvy that is atypical (but not totally absent) among AI safety people. What laws are you imagine making it harder to deploy stuff? Notably I'm imagining these people mostly doing stuff with internal deployments. I think you're overfixating on the experience of Google, which has more complicated production systems than most.

MONA: Managed Myopia with Approval Feedback

Rohin Shah2mo70

Yes, it's the same idea as the one you describe in your post. I'm pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I'm pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.

We do cite Paul's approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post ... (read more)

MONA: Managed Myopia with Approval Feedback

Rohin Shah2mo*40

Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?"

And also "don't propagate rewards backwards in time", which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)

EDIT: And tbc, "don't propagate rewards backwards in time" is the primary focus in this paper -- in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section ... (read more)

MONA: Managed Myopia with Approval Feedback

Rohin Shah2mo*40

Problem: Won't this training run basically hack the judges, and produce actions that look good to the judges but aren't actually useful for accomplishing tasks in the real world?

It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you're imagining.

But it sounds like you're talking about a pretty weak (subhuman) AI system. So I'm not quite sure why you'd expect it to "hack" the judge.

Certainly sometimes the overseer will approve of some action that isn't actually the right one, e.g. when book... (read more)

2Daniel Kokotajlo2mo

Thanks this is helpful. Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?" Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.

What’s the short timeline plan?

Rohin Shah3mo1111

I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:

What we need to achieve [...] The first AI that significantly speeds up alignment research isn’t successfully scheming [...]
The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. [...]
Layer 1

... (read more)

Marius Hobbhahn3mo1212

That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe.

Takes on "Alignment Faking in Large Language Models"

Rohin Shah3mo*82

Thus, you might’ve had a story like: “sure, AI systems might well end up with non-myopic motivations that create some incentive towards scheming. However, we’re also training them to behave according to various anti-scheming values – e.g., values like honesty, behaving-as-intended, etc. And these values will suffice to block schemer-like behavior overall.” Thus, on this story, anti-scheming values might function in a manner similar to anti-stealing values in a human employee considering stealing from her employer (and in a position to do so). It’s not

... (read more)

Zach Stein-Perlman's Shortform

Rohin Shah3mo40

I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:

You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
You do train on synthetically generated dangerous actio

... (read more)

Zach Stein-Perlman's Shortform

Rohin Shah3mo20

You'd hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn't have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.

3Buck Shlegeris3mo

I don't think this is right. To train an untrusted monitor, you need to synthetically generate dangerous actions (because by assumption these actions are rare). So during training, you'll always know the correct label for an action before you show it to the model. You might want to use something debate-like in the synthetic input generation process, but that's structurally different.

Zach Stein-Perlman's Shortform

Rohin Shah3mo30

I agree that this distinction is important -- I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).

I'm not in full agreement on your comments on the theories of change:

I'm pretty uncertain about the effects of bad reward signals on propensity for scheming / non-myopic reward hacking, and in particular I think the effects could be large.
I'm less worried about purely optimizing against a flawed reward signal though not unworried. I agree it doesn't strike us by surprise, but I also don't expect scheming to strike us by surp

... (read more)

2Buck Shlegeris3mo

I don’t understand your last sentence, can you rephrase?

EIS XIV: Is mechanistic interpretability about to be practically useful?

Rohin Shah6mo66

Once the next Anthropic, GDM, or OpenAI paper on SAEs comes out, I will evaluate my predictions in the same way as before.

Uhh... if we (GDM mech interp team) saw good results on any one of the eight things on your list, we'd probably write a paper just about that thing, rather than waiting to get even more results. And of course we might write an SAE paper that isn't about downstream uses (e.g. I'm also keen on general scientific validation of SAEs), or a paper reporting negative results, or a paper demonstrating downstream use that isn't one of your eight... (read more)

0Knight Lee6mo

It seems like the post is implicitly referring to the next big paper on SAEs from one of these labs, similar in newsworthiness as the last Anthropic paper. A big paper won't be a negative result or a much smaller downstream application, and a big paper would compare its method against baselines if possible, making 165% still within the ballpark. I still agree with your comment, especially the recommendation for a time-based prediction (I explained in my other comment here). Thank you for your alignment work :)

1Stephen Casper6mo

Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn't seem too crazy. I think that this DOES match my expectations. But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much.

Mark Xu's Shortform

Rohin Shah6mo62

This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me. E.g. my categorization would be something like:

Clearly alignment: debate theory, certain flavors of process supervision
Clearly control: removing affordances (e.g. "don't connect the model to the Internet")
Could be either one: interpretability, critique models (in control this is called "untrusted monitoring"), most conceptions of ELK, generating inputs on which models behave badly, anomaly detection, capability evaluations, faithful chain

... (read more)

6Ryan Greenblatt6mo

Agreed. (And I made a comment to this effect on Mark's draft of this short form.) I don't think it is that productive to classify research directions, the central thing we're trying to point at is a classification for reasons why you might think that you're fine where one reason is "the AI couldn't do bad stuff without getting caught" and another reason is "the AI wouldn't do bad stuff" (obviously simplifying somewhat). I also think that white-box control should typically be considered pretty separately from black-box control and the story for it is often pretty different. (Though when thinking about white-box control you should often adopt the central control perspective, e.g., you should incorporate white-box runtime detection techniques (like probes) into an overall control evaluation via exogenously trying to determine the probe reliability and other important aspects of the probe. For instance, the probe can take advantage of black box strategies and how safe we are depends on how many actions AIs need to cause huge problems.) The claims I'd like to make is: * There should be more research specifically targeting a black-box control story and which is operating from a black-box control perspective. (I'd guess like 20% of the overall AI safety portfolio should be this and maybe 30% of safety work at labs. This is not including work like "get AIs to be more useful at alignment research and get better at DC eval methodology" which should also be subsidized from a control perspective but seem non-central.) * When people do research which could be pretty useful from a black-box control perspective (e.g. debate or other scalable oversight style research), they should consider sometimes explicitly think about the control perspective and consider adjusting their research accordingly (to the extent they can understand this perspective well enough). To the extent that people have an alignment style story they think about while doing their research, they should be rou

5Ryan Greenblatt6mo

You're potentially thinking about footnote 4 in our post on control: This footnote is a bit confusingly worded, but I think the situation we were trying to say is "Suppose you expect dangerously powerful AI in the next year, and the current year is prior to 2032. We're guessing you should put the better team on control." This is different than research in the run up. ---------------------------------------- I also think that to the extent people are trying to do backchained research focused on specific applications, it makes sense to put the better team on control over reducing the chance that scheming arises. (But these aren't the only classes of interventions and some interventions don't nicely fit into these buckets, e.g., you can do work on differentially making AIs more useful for alignment work which isn't well classified as either and you can work on high level interpretability which aims to roughly understand how AIs make decisions in some cases (this high-level interp doesn't clearly help with reducing the chance that scheming arises very directly, but could help with a bunch of stuff).)

Estimating Tail Risk in Neural Networks

Rohin Shah6mo44

Fwiw the thing that scares me is the combination of three things:

Producing an accurate estimate requires "understanding" the input
Our standard for the quality of estimates will be very high
The approach is purely algorithmic (rather than e.g. using the flexible intelligence of LLMs / humans)

You presumably agree with (1) and (3). For (2), it seems like there's a lot of correlations you need to take into account for this to produce estimates that aren't wildly off:

Underestimates of risk: These are cases where the presumption of independence neglects correlati

... (read more)

5Mark Xu6mo

Yes I agree with what you have written, and do think it’s overall not that likely that everything pans out as hoped. We do also have other hopes for how this general picture can still cohere if the specific path doesn’t work out, eg we’re open to learning some stuff empirically and adding an “algorithmic cherry on top” to produce the estimate.

Estimating Tail Risk in Neural Networks

Rohin Shah6mo42

A few questions:

The literature review is very strange to me. Where is the section on certified robustness against epsilon-ball adversarial examples? The techniques used in that literature (e.g. interval propagation) are nearly identical to what you discuss here.
Relatedly, what's the source of hope for these kinds of methods outperforming adversarial training? My sense from the certified defenses literature is that the estimates they produce are very weak, because of the problems with failing to model all the information in activations. (Note I'm not sure h

... (read more)

6Mark Xu6mo

I was meaning to include such a section, but forgot :). Perhaps I will edit it in. I think such work is qualitatively similar to what we're trying to do, but that the key difference is that we're interested in "best guess" estimates, as opposed to formally verified-to-be-correct estimates (mostly because we don't think formally verified estimates are tractable to produce in general). The main hope comes from the fact that we're using a "best guess" estimate, instead of trying to certify that the model won't produce catastrophic actions. For example, Method 1 can be thought of as running a single example with a Gaussian blob around it through the model, but also tracking the "1st order" contributions that come from the Gaussian blob. If we wanted to bound the potential contributions from the Gaussian blob, our estimates would get really broad really fast, as you tend to see with interval propagation. Although, this also comes with the opposite issue of how to know if the estimates are at all reasonble, especially when you train against them. I think fundamentally we just need our estimates to "not get that much worse" as things get deeper/more complicated. The main hope for why we can achieve this is that the underlying model itself will not get worse as it gets deeper/the chain of thought gets longer. This implies that there is some sort of stabalization going on, so we will need to capture the effect of this stabalization. It does seem like in order to do this, we will have to model only high level properties of this distribution, instead of trying to model things on the level of activations. In other words, one issue with interval propagation is that it makes an assumption that can only become less true as you propagate through the model. After a few layers, you're (perhaps only implicitly) putting high probability on activations that the model will never produce. But as long as your "activation model" is behaving reasonably, then hopefully it will only become

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Rohin Shah6mo50

Suppose you trained a regular SAE in the normal way with a dictionary size of 2304. Do you expect the latents to be systematically different from the ones in your meta-SAE?

For example, here's one systematic difference. The regular SAE is optimized to reconstruct activations uniformly sampled from your token dataset. The meta-SAE is optimized to reconstruct decoder vectors, which in turn were optimized to reconstruct activations from the token dataset -- however, different decoder vectors have different frequencies of firing in the token dataset, so uniform... (read more)

5Neel Nanda6mo

Interesting thought! I expect there's systematic differences, though it's not quite obvious how. Your example seems pretty plausible to me. Meta SAEs are also more incentived to learn features which tend to split a lot, I think, as then they're useful for more predicting many latents. Though ones that don't split may be useful as they entirely explain a latent that's otherwise hard to explain. Anyway, we haven't checked yet, but I expect many of the results in this post would look similar for eg sparse linear regression over a smaller SAEs decoder. Re why meta SAEs are interesting at all, they're much cheaper to train than a smaller SAE, and BatchTopK gives you more control over the L0 than you could easily get with sparse linear regression, which are some mild advantages, but you may have a small SAE lying around anyway. I see the interesting point of this post more as "SAE latents are not atomic, as shown by one method, but probably other methods would work well too"

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Rohin Shah7mo88

Google DeepMind does lots of work on safety practice, mostly by other teams. For example, Gemini Safety (mentioned briefly in the post) does a lot of automated red teaming. The AGI Safety & Alignment team has also contributed to safety practice work. GDM usually doesn't publish about that work, mainly because the work here is primarily about doing all the operational work necessary to translate existing research techniques into practice, which doesn't really lend itself to paper publications.

I disagree that the AGI safety team should have 4 as its "bre... (read more)

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Rohin Shah7mo21

I'm not going to repeat all of the literature on debate here, but as brief pointers:

Factored cognition discusses intuitively why we can hope to approximate exponentially-sized trees of arguments (which would be tremendously bigger than arguments between people)
AI safety via debate makes the same argument for debate (by showing that a polynomial time judge can supervise PSPACE -- PSPACE-complete problems typically involve exponential-sized trees)
Cross-examination is discussed here
This paper discusses the experiments you'd do to figure out what the human jud

... (read more)

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Rohin Shah7mo33

I do, but more importantly, I want to disallow the judge understanding all the concepts here.

I think I don't actually care about being robust to this assumption. Generally I think of arbitrarily-scalable-debate as depending on a universality assumption (which in turn would rule out "the judge can never understand the concepts"). But even if the universality assumption is false, it wouldn't bother me much; I don't expect such a huge gap between debaters and judges that the judge simply can't understand the debaters' concepts, even given arbitrary amounts of... (read more)

3bhauth7mo

You don't? But this is a major problem in arguments between people. The variation within humans is already more than enough for this! There's a gap like that every 35 IQ points or so. I don't understand why you're confident this isn't an issue. I guess we've found our main disagreement, at least? Now you're training for multiple objectives: 1. You want the debater AI to argue for proposition A or not-A according to its role and convince human judges of that. 2. You want it to not change its position on sub-arguments. But (2) is ill-defined. Can sub-arguments be combined for less weighting? Are they all worth the same? What if you have several sub-arguments that all depend on a single sub-2-argument? Good arguments for A or not-A should have lots of disagreements - or do you want to train AI that makes all the same sub-arguments for A or not-A and then says "this implies A / not-A"? I don't think this works. ---------------------------------------- In response to the linked "HCH" post: Yes, an agent past some threshold can theoretically make a more-intelligent agent. But that doesn't say anything about alignment; the supposed "question-answering machine" would be subject to instrumental convergence and mesaoptimizer issues, and you'd get value drift with each HCH stage, just as you would with RSI schemes.

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Rohin Shah7mo3-1

Making that kind of abstract conclusion from a practical number of experiments requires abstractions like potential energy, entropy, Noether's theorem, etc - which in this example, the judge doesn't understand. (Without such abstractions, you'd need to consider every possible type of machine separately, which isn't feasible.)

I agree, but I don't see why that matters. As I mentioned, a main point of debate is to produce good oversight of claims without giving the judge an understanding of those claims. In this example I would imagine that you decompose the ... (read more)

1bhauth7mo

I do, but more importantly, I want to disallow the judge understanding all the concepts here. Suppose the judge says to #1: "What is energy?" or "What is conservation?" and it can't be explained to them - what then? Also, argument 1 isn't actually correct, E=mc^2 and so on. "The honest strategy"? If you have that, you can just ask it and not bother with the debate. If the problem is distinguishing it, and only dishonest actors are changing their answers based on the provided situation, you can just use that info. But why are you assuming you have an "honest strategy" available here?

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Rohin Shah7mo20

There are several different outs to this example:

You should at least be able to argue that the evidence does not support the conclusion, and that the boss should have substantial probability on "the box can make some electricity but not infinitely much".
You can recursively decompose the claim "perpetual motion machines are known to be impossible" until you get down to a claim like "such and such experiment should have such and such outcome", which the boss can then perform to determine a winner.
- This does not mean that the boss then understands why perpetua

... (read more)

1bhauth7mo

Ah, I don't think you can. Making that kind of abstract conclusion from a practical number of experiments requires abstractions like potential energy, entropy, Noether's theorem, etc - which in this example, the judge doesn't understand. (Without such abstractions, you'd need to consider every possible type of machine separately, which isn't feasible.) This seems like a core of our disagreement here. The debaters are the same AI with different contexts, so the same is true of both debaters. Am I missing something here? Yes, "doubly efficient debate".

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Rohin Shah7mo*60

Strongly agree on the first challenge; on the theory workstream we're thinking about how to deal with this problem. Some past work (not from us) is here and here.

Though to be clear, I don't think the empirical evidence clearly rules out "just making neural networks explainable". Imo, if you wanted to do that, you would do things in the style of debate and prover-verifier games. These ideas just haven't been tried very much yet. I don't think "asking an AI what another AI is doing and doing RLHF on the response" is nearly as good; that is much more likely t... (read more)

2bhauth7mo

To clarify the 2nd point, here's an example. Suppose someone presents you with a large box that supposedly produces electricity endlessly. Your boss thinks it works, and you're debating the inventor in front of your boss. "Perpetual motion machines are known to be impossible" you say, but your boss isn't familiar with that conceptual class or the reasoning involved. The inventor says, "Here, let's plug in a thing, we can see that the box does in fact produce a little electricity." Your boss finds this very convincing. The process proposed in the paper is something like, "let's randomly sample every possible machine to see if it does perpetual motion". So the inventor points to the sun and says, "that thing has been making energy continuously and never stops for as long as we've been able to tell". They point to some stars and say the same thing. The sampling and evaluation is dependent on a conceptual framework that isn't agreed on, and waiting for the sun and stars to burn out isn't very practical.

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Rohin Shah7mo90

It depends fairly significantly on how you draw the boundaries; I think anywhere between 30 and 50 is defensible. (For the growth numbers I chose one specific but arbitrary way of drawing the boundaries, I expect you'd get similar numbers using other methods of drawing the boundaries.) Note this does not include everyone working on safety, e.g. it doesn't include the people working on present day safety or adversarial robustness.

On scalable oversight with weak LLMs judging strong LLMs

Rohin Shah9mo135

Okay, I think it's pretty clear that the crux between us is basically what I was gesturing at in my first comment, even if there are minor caveats that make it not exactly literally that.

I'm probably not going to engage with perspectives that say all current [alignment work towards building safer future powerful AI systems] is net negative, sorry. In my experience those discussions typically don't go anywhere useful.

3Joe Collman9mo

That's fair. I agree that we're not likely to resolve much by continuing this discussion. (but thanks for engaging - I do think I understand your position somewhat better now) What does seem worth considering is adjusting research direction to increase focus on [search for and better understand the most important failure modes] - both of debate-like approaches generally, and any [plan to use such techniques to get useful alignment work done]. I expect that this would lead people to develop clearer, richer models. Presumably this will take months rather than hours, but it seems worth it (whether or not I'm correct - I expect that [the understanding required to clearly demonstrate to me that I'm wrong] would be useful in a bunch of other ways).

On scalable oversight with weak LLMs judging strong LLMs

Rohin Shah9mo52

I expect that, absent impressive levels of international coordination, we're screwed.

This is the sort of thing that makes it hard for me to distinguish your argument from "[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down".

I agree that, conditional on believing that we're screwed absent huge levels of coordination regardless of technical work, the... (read more)

1Joe Collman9mo

I'm claiming something more like "[given a realistic degree of technical work on current agendas in the time we have], there will be some existentially risky failures left, so if we proceed we're highly likely to get doom. I'll clarify more below. Sure, but I mostly don't buy p(doom) reduction here, other than through [highlight near-misses] - so that an approach that hides symptoms of fundamental problems is probably net negative. In the free-for-all world, I think doom is overdetermined, absent miracles [1]- and [significantly improved debate setup] does not strike me as a likely miracle, even after I condition on [a miracle occurred]. Factors that push in the other direction: * I can imagine techniques that reduce near-term widespread low-stakes failures. * This may be instrumentally positive if e.g. AI is much better for collective sensemaking than otherwise it would be (even if that's only [the negative impact isn't as severe]). * Similarly, I can imagine such techniques mitigating the near-term impact of [we get what we measure] failures. This too seems instrumentally useful. * I do accept that technical work I'm not too keen on may avoid some early foolish/embarrassing ways to fail catastrophically. * I mostly don't think this helps significantly, since we'll consistently hit doom later without a change in strategy. * Nonetheless, [don't be dead yet] is instrumentally useful if we want more time to change strategy, so avoiding early catastrophe is a plus. * [probably other things along similar lines that I'm missing] But I suppose that on the [usefulness of debate (/scalable oversight techniques generally) research], I'm mainly thinking: [more clearly understanding how and when this may fail catastrophically, and how we'd robustly predict this] seems positive, whereas [show that versions of this technique get higher scores on some benchmarks] probably doesn't. Even if I'm wrong about the latter, the former seems more important. Granted, i

On scalable oversight with weak LLMs judging strong LLMs

Rohin Shah9mo60

Not going to respond to everything, sorry, but a few notes:

It fits the pattern of [lower perceived risk] --> [actions that increase risk].

My claim is that for the things you call "actions that increase risk" that I call "opportunity cost", this causal arrow is very weak, and so you shouldn't think of it as risk compensation.

E.g. presumably if you believe in this causal arrow you should also believe [higher perceived risk] --> [actions that decrease risk]. But if all building-safe-AI work were to stop today, I think this would have very little effect ... (read more)

1Joe Collman9mo

No worries at all - I was aiming for [Rohin better understands where I'm coming from]. My response was over-long. Agreed, but I think this is too coarse-grained a view. I expect that, absent impressive levels of international coordination, we're screwed. I'm not expecting [higher perceived risk] --> [actions that decrease risk] to operate successfully on the "move fast and break things" crowd. I'm considering: * What kinds of people are making/influencing key decisions in worlds where we're likely to survive? * How do we get those people this influence? (or influential people to acquire these qualities) * What kinds of situation / process increase the probability that these people make risk-reducing decisions? I think some kind of analysis along these lines makes sense - though clearly it's hard to know where to draw the line between [it's unrealistic to expect decision-makers/influencers this principled] and [it's unrealistic to think things may go well with decision-makers this poor]. I don't think conditioning on the status-quo free-for-all makes sense, since I don't think that's a world where our actions have much influence on our odds of success. Agreed (I think your links make good points). However, I'd point out that it can be true both that: 1. Most first-principles reasoning about x is terrible. 2. First-principles reasoning is required in order to make any useful prediction of x. (for most x, I don't think this holds) I think almost everything comes down to [perceived level of risk] sometimes dropping hugely more than [actual risk] in the case of AI. So it's about the magnitude of the input. * We understand AI much less well. * We'll underestimate a bunch of risks, due to lack of understanding. * We may also over-estimate a bunch, but the errors don't cancel: being over-cautious around fire doesn't stop us from drowning. * Certain types of research will address [some risks we understand], but fail to address [some risks we don't s

On scalable oversight with weak LLMs judging strong LLMs

Rohin Shah9mo40

Main points:

There'll always be some risk of existential failure.
I am saying "we might get doom"
I am not saying "we should not do safety work"

I'm on board with these.

I'm saying "risk compensation needs to be a large factor in deciding which safety work to do"

I still don't see why you believe this. Do you agree that in many other safety fields, safety work mostly didn't think about risk compensation, and still drove down absolute risk? (E.g. I haven't looked into it but I bet people didn't spend a bunch of time thinking about risk compensation when deciding ... (read more)

Joe Collman9mo58

[apologies for writing so much; you might want to skip/skim the spoilered bit, since it seems largely a statement of the obvious]

Do you agree that in many other safety fields, safety work mostly didn't think about risk compensation, and still drove down absolute risk?

Agreed (I imagine there are exceptions, but I'd be shocked if this weren't usually true).

[I'm responding to this part next, since I think it may resolve some of our mutual misunderstanding]

It seems like your argument here, and in other parts of your comment, is something like "we could d

Rohin Shah9mo99

I like both of the theories of change you listed, though for (1) I usually think about scaling till human obsolescence rather than superintelligence.

(Though imo this broad class of schemes plausibly scales to superintelligence if you eventually hand off the judge role to powerful AI systems. Though I expect we'll be able to reduce risk further in the future with more research.)

I note here that this isn't a fully-general counterargument, but rather a general consideration.

I don't see why this isn't a fully general counterargument to alignment work. Your arg... (read more)

Joe Collman9mo54

Thanks for the response. I realize this kind of conversation can be annoying (but I think it's important).
[I've included various links below, but they're largely intended for readers-that-aren't-you]

I don't see why this isn't a fully general counterargument to alignment work. Your argument sounds to me like "there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down".

(Thanks for this too. I don't ... (read more)

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

Rohin Shah11mo62

It's interesting to look back at this question 4 years later; I think it's a great example of the difficulty of choosing the right question to forecast in the first place.

I think it is still pretty unlikely that the criterion I outlined is met -- Q2 on my survey still seems like a bottleneck. I doubt that AGI researchers would talk about instrumental convergence in the kind of conversation I outlined. But reading the motivation for the question, it sure seems like a question that reflected the motivation well would have resolved yes by now (probably some t... (read more)

Refusal in LLMs is mediated by a single direction

Rohin Shah1y115

Because I don't think this is realistically useful, I don't think this at all reduces my probability that your techniques are fake and your models of interpretability are wrong.
Maybe the groundedness you're talking about comes from the fact that you're doing interp on a domain of practical importance?

??? Come on, there's clearly a difference between "we can find an Arabic feature when we go looking for anything interpretable" vs "we chose from the relatively small set of practically important things and succeeded in doing something interesting in that doma... (read more)

Buck Shlegeris1y97

??? Come on, there's clearly a difference between "we can find an Arabic feature when we go looking for anything interpretable" vs "we chose from the relatively small set of practically important things and succeeded in doing something interesting in that domain".

Oh okay, you're saying the core point is that this project was less streetlighty because the topic you investigated was determined by the field's interest rather than cherrypicking. I actually hadn't understood that this is what you were saying. I agree that this makes the results slightly better.

Explaining grokking through circuit efficiency

Rohin Shah1y20

Sounds plausible, but why does this differentially impact the generalizing algorithm over the memorizing algorithm?

Perhaps under normal circumstances both are learned so fast that you just don't notice that one is slower than the other, and this slows both of them down enough that you can see the difference?

AXRP Episode 29 - Science of Deep Learning with Vikrant Varma

Rohin Shah1y62

Daniel Filan: But I would’ve guessed that there wouldn’t be a significant complexity difference between the frequencies. I guess there’s a complexity difference in how many frequencies you use.
Vikrant Varma: Yes. That’s one of the differences: how many you use and their relative strength and so on. Yeah, I’m not really sure. I think this is a question we pick out as a thing we would like to see future work on.

My pet hypothesis here is that (a) by default, the network uses whichever frequencies were highest at initialization (for which there is significant ... (read more)

Improving Dictionary Learning with Gated Sparse Autoencoders

Rohin Shah1y*20

This suggestion seems less expressive than (but similar in spirit to) the "rescale & shift" baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn't capture all the benefits of Gated SAEs.

The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to $ReLU (π_{gate} (x))$ , so you might think of $π_{gate}$ as "tainted", and want to use it as little as possible. T... (read more)

Improving Dictionary Learning with Gated Sparse Autoencoders

Rohin Shah1y40

Thinking on this a bit more, this might actually reflect a general issue with the way we think about feature shrinkage; namely, that whenever there is a nonzero angle between two vectors of the same length, the best way to make either vector close to the other will be by shrinking it.

This was actually the key motivation for building this metric in the first place, instead of just looking at the ratio $\frac{E [| |^x | |^{2}]}{E [| | x | |^{2}]}$ . Looking at the $γ$ that would optimize the reconstruction loss ensures that we're capturing only bias from the L1 regulariza... (read more)

Improving Dictionary Learning with Gated Sparse Autoencoders

Rohin Shah1y*32

Possibly I'm missing something, but if you don't have $L_{aux}$ , then the only gradients to $W_{gate}$ and $b_{gate}$ come from $L_{sparsity}$ (the binarizing Heaviside activation function kills gradients from $L_{reconstruct}$ ), and so $π_{gate}$ would be always non-positive to get perfect zero sparsity loss. (That is, if you only optimize for L1 sparsity, the obvious solution is "none of the features are active".)

(You could use a smooth activation function as the gate, e.g. an element-wise sigmoid, and then you could just stick with&n... (read more)

1Sam Marks1y

Ah thanks, you're totally right -- that mostly resolves my confusion. I'm still a little bit dissatisfied, though, because the Laux term is optimizing for something that we don't especially want (i.e. for ^x(ReLU(πgated(x)) to do a good job of reconstructing x). But I do see how you do need to have some sort of a reconstruction-esque term that actually allows gradients to pass through to the gated network.

Transformers Represent Belief State Geometry in their Residual Stream

Rohin Shah1y3514

Is it accurate to summarize the headline result as follows?

Train a Transformer to predict next tokens on a distribution generated from an HMM.
One optimal predictor for this data would be to maintain a belief over which of the three HMM states we are in, and perform Bayesian updating on each new token. That is, it maintains $p (hidden state = H_{i})$ .
Key result: A linear probe on the residual stream is able to reconstruct $p (hidden state = H_{i})$ .

(I don't know what Computational Mechanics or MSPs are so this could be totally off.)

EDIT: Looks like yes. From this... (read more)

Adam Shai1y73

That is a fair summary.

Analogies between scaling labs and misaligned superintelligent AI

Rohin Shah1y*126

I feel like a lot of these arguments could be pretty easily made of individual AI safety researchers. E.g.

Misaligned Incentives
In much the same way that AI systems may have perverse incentives, so do the [AI safety researchers]. They are [humans]. They need to make money, [feed themselves, and attract partners]. [Redacted and redacted even just got married.] This type of accountability to [personal] interests is not perfectly in line with doing what is good for human interests. Moreover, [AI safety researchers are often] technocrats whose values and demogr

... (read more)

5Stephen Casper1y

Thanks. I agree that the points apply to individual researchers. But I don't think that it applies in a comparably worrisome way because individual researchers do not have comparable intelligence, money, and power compared to the labs. This is me stressing the "when put under great optimization pressure" of Goodhart's Law. Subtle misalignments are much less dangerous when there is a weak optimization force behind the proxy than when there is a strong one.

The case for ensuring that powerful AIs are controlled

Rohin Shah1y20

Sounds reasonable, though idk what you think realistic values of N are (my wild guess with hardly any thought is 15 minutes - 1 day).

EDIT: Tbc in the 1 day case I'm imagining that most of the time goes towards running the experiment -- it's more a claim about what experiments we want to run. If we just talk about the time to write the code and launch the experiment I'm thinking of N in the range of 5 minutes to 1 hour.

The case for ensuring that powerful AIs are controlled

Rohin Shah1y30

Cool, that all roughly makes sense to me :)

I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn't feel that crazy to me, I already do a moderate amount of multi-tasking.

Multi-tasking where you are responsible for the entire design of the project? (Designing the algorithm, choosing an experimental setting and associated metrics, knowing the related work, interpreting the results of the experiments, figuring out what the next experiment should be, ...)

Suppose today I gave you a dev... (read more)

4Ryan Greenblatt1y

Probably yes for realistic values of N? Assuming the box is pretty smart at understanding instructions (and has an understanding of my typical ontology to the extent that you would get after working with me a few weeks and reading various posts) and the box will ask follow-up questions in cases where the instructions are unclear. (And we can do small diffs with reduced latency like asking the results to be plotted in a different way.) My main concern is running out of ideas after a while despite copies of myself with more thinking time having more time to generate ideas.

The case for ensuring that powerful AIs are controlled

Rohin Shah1y82

I agree it helps to run experiments at small scales first, but I'd be pretty surprised if that helped to the point of enabling a 30x speedup -- that means that the AI labor allows you get 30x improvement in compute needed beyond what would be done by default by humans (though the 30x can include e.g. improving utilization, it's not limited just to making individual experiments take less time).

I think the most plausible case for your position would be that the compute costs for ML research scale much less than quadratically with the size of the pretrained m... (read more)

6Ryan Greenblatt1y

Overall, this has updated me to some extent and it seems less plausible to me that ML research can achieve 30x speedups while having human researchers do all of the high level ideas. (I think the picture looks importantly better when AIs are removing this bottleneck.) The situation I was imagining is where most experiments use some combination of: * A relatively small amount of finetuning/inference on the biggest models (including for the actual AI researcher) * Larger (possibly full) training runs, but at much smaller scale (e.g. GPT-3 level performance models) Then, we can in total afford ~training dataset sized amounts of finetuning/inference for the biggest models (by the inference availability argument). And GPT-3 performance experiments will be pretty cheap. So assuming our base model looks like GPT-6 with the expected compute requirement and model size, this is a huge amount of possible inference availability. So, the implicit claim is that compute costs scale much less than quadratically. It's certainly not obvious ML research can be progressed fast enough with this little compute. I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn't feel that crazy to me, I already do a moderate amount of multi-tasking. Note that this often involves multiple people working on the same paper. In the AI case, the division of labor might look at least somewhat different. (Though I don't think this changes the picture very much from what you're describing because most people now aren't the "ideas" people.)

The case for ensuring that powerful AIs are controlled

Rohin Shah1y60

I think ML research in particular can plausibly be accelerated by maybe 30x by only making it extremely fast and cheap to go from high level ideas to implemented experiments (rather than needing to generate these high level ideas)

Why doesn't compute become the bottleneck well before the 30x mark? It seems like the AIs have to be superhuman at something to overcome that bottleneck (rather than just making it fast and cheap to implement experiments). Indeed the AIs make the problem somewhat worse, since you have to spend compute to run the AIs.

7Ryan Greenblatt1y

I guess I'm not that sold that compute will actually be that much of a key bottleneck for experiments in the future in a way that can't be overcome with 2x additional labor and/or a 2x slow down. Like in many cases you can spend additional labor to reduce compute usage of experiments. (E.g., first run the experiments on smaller models.) And, we're conditioning on having really powerful AIs which correlates with a high baseline level of compute and that will help whenever we can run experiments at small scales and then use scaling laws etc. Further, the current scaling laws imply huge inference availablity if huge amounts of compute are used for training. This might depend on what type of ML research we're talking about.