Is that right?
Yes, that's broadly accurate, though one clarification:
This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly
That's a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.
...I think this will
If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point
That doesn't seem right. It can simultaneously be the case that you can't tell that there are problems stemming from long-term optimization problems when you don't use MONA, and also if you actually use MONA, then it will measurably improve quality.
For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which ...
I meant "it's obvious you should use MONA if you are seeing problems with long-term optimization", which I believe is Fabien's position (otherwise it would be "hard to find").
Your reaction seems more like "it's obvious MONA would prevent multi-step reward hacks"; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).
I have some credence in all three of those bullet points.
For MONA it's a relatively even mixture of the first and second points.
(You are possibly the first person I know of who reacted to MONA with "that's obvious" instead of "that obviously won't perform well, why would anyone ever do it". Admittedly you are imagining a future hypothetical where it's obvious to everyone that long-term optimization is causing problems, but I don't think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)
For debate it's mostly the first point, and to some extent the third point.
Got it, that makes more sense. (When you said "methods work on toy domains" I interpreted "work" as a verb rather than a noun.)
But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings
I think by far the biggest open question is "how do you provide the nonmyopic approval so that the model actually performs well". I don't think anyone has even attempted to tackle this so it's hard to tell what you could learn about it, but I'd be surprised if there weren't generaliza...
I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?
I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn't have to grapple as hard with the challenge of defining the approval feedback as I'd expect in a realistic deployment. But it does impose an alignment tax, so there's no point in using MONA currently, when good enough alignm...
I don't know of any existing work in this category, sorry. But e.g. one project would be "combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards", which in theory could work better than either one of them alone.
I'm excited to see this RFP out! Many of the topics in here seem like great targets for safety work.
I'm sad that there's so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is "alternatives to adversarial training". (There's also "new moonshots for aligning superintelligence" but I don't expect much to come out of that, and "white-box estimation of rare misbehavior" could help if you are willing to put optimizat...
I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".
I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.
This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly dis...
In broad strokes I agree with Zac. And tbc I'm generally a fan of formal verification and have done part of a PhD in program synthesis.
So, GSAI addresses the band of capabilities where AI systems become potentially no longer safe to interact with directly due to their potential ability to model and exploit human psychology
This seems like a great example of something that I strongly expect GSAI will not handle (unless the proposal reduces to "don't build such AIs", in which case I would appreciate that being stated more directly, or if it reduces to "limit ...
I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don't accomplish much.
By far my biggest objection is that there are approximately zero useful things that "[don't] make anyone's workflow harder". I expect you're vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to i...
Yes, it's the same idea as the one you describe in your post. I'm pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I'm pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.
We do cite Paul's approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post ...
Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?"
And also "don't propagate rewards backwards in time", which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)
EDIT: And tbc, "don't propagate rewards backwards in time" is the primary focus in this paper -- in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section ...
Problem: Won't this training run basically hack the judges, and produce actions that look good to the judges but aren't actually useful for accomplishing tasks in the real world?
It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you're imagining.
But it sounds like you're talking about a pretty weak (subhuman) AI system. So I'm not quite sure why you'd expect it to "hack" the judge.
Certainly sometimes the overseer will approve of some action that isn't actually the right one, e.g. when book...
I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:
...What we need to achieve [...] The first AI that significantly speeds up alignment research isn’t successfully scheming [...]
The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. [...]
Layer 1
That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe.
...Thus, you might’ve had a story like: “sure, AI systems might well end up with non-myopic motivations that create some incentive towards scheming. However, we’re also training them to behave according to various anti-scheming values – e.g., values like honesty, behaving-as-intended, etc. And these values will suffice to block schemer-like behavior overall.” Thus, on this story, anti-scheming values might function in a manner similar to anti-stealing values in a human employee considering stealing from her employer (and in a position to do so). It’s not
I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:
You'd hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn't have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.
I agree that this distinction is important -- I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).
I'm not in full agreement on your comments on the theories of change:
Once the next Anthropic, GDM, or OpenAI paper on SAEs comes out, I will evaluate my predictions in the same way as before.
Uhh... if we (GDM mech interp team) saw good results on any one of the eight things on your list, we'd probably write a paper just about that thing, rather than waiting to get even more results. And of course we might write an SAE paper that isn't about downstream uses (e.g. I'm also keen on general scientific validation of SAEs), or a paper reporting negative results, or a paper demonstrating downstream use that isn't one of your eight...
This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me. E.g. my categorization would be something like:
Fwiw the thing that scares me is the combination of three things:
You presumably agree with (1) and (3). For (2), it seems like there's a lot of correlations you need to take into account for this to produce estimates that aren't wildly off:
A few questions:
Suppose you trained a regular SAE in the normal way with a dictionary size of 2304. Do you expect the latents to be systematically different from the ones in your meta-SAE?
For example, here's one systematic difference. The regular SAE is optimized to reconstruct activations uniformly sampled from your token dataset. The meta-SAE is optimized to reconstruct decoder vectors, which in turn were optimized to reconstruct activations from the token dataset -- however, different decoder vectors have different frequencies of firing in the token dataset, so uniform...
Google DeepMind does lots of work on safety practice, mostly by other teams. For example, Gemini Safety (mentioned briefly in the post) does a lot of automated red teaming. The AGI Safety & Alignment team has also contributed to safety practice work. GDM usually doesn't publish about that work, mainly because the work here is primarily about doing all the operational work necessary to translate existing research techniques into practice, which doesn't really lend itself to paper publications.
I disagree that the AGI safety team should have 4 as its "bre...
I'm not going to repeat all of the literature on debate here, but as brief pointers:
I do, but more importantly, I want to disallow the judge understanding all the concepts here.
I think I don't actually care about being robust to this assumption. Generally I think of arbitrarily-scalable-debate as depending on a universality assumption (which in turn would rule out "the judge can never understand the concepts"). But even if the universality assumption is false, it wouldn't bother me much; I don't expect such a huge gap between debaters and judges that the judge simply can't understand the debaters' concepts, even given arbitrary amounts of...
Making that kind of abstract conclusion from a practical number of experiments requires abstractions like potential energy, entropy, Noether's theorem, etc - which in this example, the judge doesn't understand. (Without such abstractions, you'd need to consider every possible type of machine separately, which isn't feasible.)
I agree, but I don't see why that matters. As I mentioned, a main point of debate is to produce good oversight of claims without giving the judge an understanding of those claims. In this example I would imagine that you decompose the ...
There are several different outs to this example:
Strongly agree on the first challenge; on the theory workstream we're thinking about how to deal with this problem. Some past work (not from us) is here and here.
Though to be clear, I don't think the empirical evidence clearly rules out "just making neural networks explainable". Imo, if you wanted to do that, you would do things in the style of debate and prover-verifier games. These ideas just haven't been tried very much yet. I don't think "asking an AI what another AI is doing and doing RLHF on the response" is nearly as good; that is much more likely t...
It depends fairly significantly on how you draw the boundaries; I think anywhere between 30 and 50 is defensible. (For the growth numbers I chose one specific but arbitrary way of drawing the boundaries, I expect you'd get similar numbers using other methods of drawing the boundaries.) Note this does not include everyone working on safety, e.g. it doesn't include the people working on present day safety or adversarial robustness.
Okay, I think it's pretty clear that the crux between us is basically what I was gesturing at in my first comment, even if there are minor caveats that make it not exactly literally that.
I'm probably not going to engage with perspectives that say all current [alignment work towards building safer future powerful AI systems] is net negative, sorry. In my experience those discussions typically don't go anywhere useful.
I expect that, absent impressive levels of international coordination, we're screwed.
This is the sort of thing that makes it hard for me to distinguish your argument from "[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down".
I agree that, conditional on believing that we're screwed absent huge levels of coordination regardless of technical work, the...
Not going to respond to everything, sorry, but a few notes:
It fits the pattern of [lower perceived risk] --> [actions that increase risk].
My claim is that for the things you call "actions that increase risk" that I call "opportunity cost", this causal arrow is very weak, and so you shouldn't think of it as risk compensation.
E.g. presumably if you believe in this causal arrow you should also believe [higher perceived risk] --> [actions that decrease risk]. But if all building-safe-AI work were to stop today, I think this would have very little effect ...
Main points:
There'll always be some risk of existential failure.
I am saying "we might get doom"
I am not saying "we should not do safety work"
I'm on board with these.
I'm saying "risk compensation needs to be a large factor in deciding which safety work to do"
I still don't see why you believe this. Do you agree that in many other safety fields, safety work mostly didn't think about risk compensation, and still drove down absolute risk? (E.g. I haven't looked into it but I bet people didn't spend a bunch of time thinking about risk compensation when deciding ...
[apologies for writing so much; you might want to skip/skim the spoilered bit, since it seems largely a statement of the obvious]
Do you agree that in many other safety fields, safety work mostly didn't think about risk compensation, and still drove down absolute risk?
Agreed (I imagine there are exceptions, but I'd be shocked if this weren't usually true).
[I'm responding to this part next, since I think it may resolve some of our mutual misunderstanding]
...It seems like your argument here, and in other parts of your comment, is something like "we could d
I like both of the theories of change you listed, though for (1) I usually think about scaling till human obsolescence rather than superintelligence.
(Though imo this broad class of schemes plausibly scales to superintelligence if you eventually hand off the judge role to powerful AI systems. Though I expect we'll be able to reduce risk further in the future with more research.)
I note here that this isn't a fully-general counterargument, but rather a general consideration.
I don't see why this isn't a fully general counterargument to alignment work. Your arg...
Thanks for the response. I realize this kind of conversation can be annoying (but I think it's important).
[I've included various links below, but they're largely intended for readers-that-aren't-you]
I don't see why this isn't a fully general counterargument to alignment work. Your argument sounds to me like "there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down".
(Thanks for this too. I don't ...
It's interesting to look back at this question 4 years later; I think it's a great example of the difficulty of choosing the right question to forecast in the first place.
I think it is still pretty unlikely that the criterion I outlined is met -- Q2 on my survey still seems like a bottleneck. I doubt that AGI researchers would talk about instrumental convergence in the kind of conversation I outlined. But reading the motivation for the question, it sure seems like a question that reflected the motivation well would have resolved yes by now (probably some t...
Because I don't think this is realistically useful, I don't think this at all reduces my probability that your techniques are fake and your models of interpretability are wrong.
Maybe the groundedness you're talking about comes from the fact that you're doing interp on a domain of practical importance?
??? Come on, there's clearly a difference between "we can find an Arabic feature when we go looking for anything interpretable" vs "we chose from the relatively small set of practically important things and succeeded in doing something interesting in that doma...
??? Come on, there's clearly a difference between "we can find an Arabic feature when we go looking for anything interpretable" vs "we chose from the relatively small set of practically important things and succeeded in doing something interesting in that domain".
Oh okay, you're saying the core point is that this project was less streetlighty because the topic you investigated was determined by the field's interest rather than cherrypicking. I actually hadn't understood that this is what you were saying. I agree that this makes the results slightly better.
Sounds plausible, but why does this differentially impact the generalizing algorithm over the memorizing algorithm?
Perhaps under normal circumstances both are learned so fast that you just don't notice that one is slower than the other, and this slows both of them down enough that you can see the difference?
Daniel Filan: But I would’ve guessed that there wouldn’t be a significant complexity difference between the frequencies. I guess there’s a complexity difference in how many frequencies you use.
Vikrant Varma: Yes. That’s one of the differences: how many you use and their relative strength and so on. Yeah, I’m not really sure. I think this is a question we pick out as a thing we would like to see future work on.
My pet hypothesis here is that (a) by default, the network uses whichever frequencies were highest at initialization (for which there is significant ...
This suggestion seems less expressive than (but similar in spirit to) the "rescale & shift" baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn't capture all the benefits of Gated SAEs.
The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to , so you might think of as "tainted", and want to use it as little as possible. T...
Thinking on this a bit more, this might actually reflect a general issue with the way we think about feature shrinkage; namely, that whenever there is a nonzero angle between two vectors of the same length, the best way to make either vector close to the other will be by shrinking it.
This was actually the key motivation for building this metric in the first place, instead of just looking at the ratio . Looking at the that would optimize the reconstruction loss ensures that we're capturing only bias from the L1 regulariza...
Possibly I'm missing something, but if you don't have , then the only gradients to and come from (the binarizing Heaviside activation function kills gradients from ), and so would be always non-positive to get perfect zero sparsity loss. (That is, if you only optimize for L1 sparsity, the obvious solution is "none of the features are active".)
(You could use a smooth activation function as the gate, e.g. an element-wise sigmoid, and then you could just stick with&n...
Is it accurate to summarize the headline result as follows?
(I don't know what Computational Mechanics or MSPs are so this could be totally off.)
EDIT: Looks like yes. From this...
That is a fair summary.
I feel like a lot of these arguments could be pretty easily made of individual AI safety researchers. E.g.
...Misaligned Incentives
In much the same way that AI systems may have perverse incentives, so do the [AI safety researchers]. They are [humans]. They need to make money, [feed themselves, and attract partners]. [Redacted and redacted even just got married.] This type of accountability to [personal] interests is not perfectly in line with doing what is good for human interests. Moreover, [AI safety researchers are often] technocrats whose values and demogr
Sounds reasonable, though idk what you think realistic values of N are (my wild guess with hardly any thought is 15 minutes - 1 day).
EDIT: Tbc in the 1 day case I'm imagining that most of the time goes towards running the experiment -- it's more a claim about what experiments we want to run. If we just talk about the time to write the code and launch the experiment I'm thinking of N in the range of 5 minutes to 1 hour.
Cool, that all roughly makes sense to me :)
I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn't feel that crazy to me, I already do a moderate amount of multi-tasking.
Multi-tasking where you are responsible for the entire design of the project? (Designing the algorithm, choosing an experimental setting and associated metrics, knowing the related work, interpreting the results of the experiments, figuring out what the next experiment should be, ...)
Suppose today I gave you a dev...
I agree it helps to run experiments at small scales first, but I'd be pretty surprised if that helped to the point of enabling a 30x speedup -- that means that the AI labor allows you get 30x improvement in compute needed beyond what would be done by default by humans (though the 30x can include e.g. improving utilization, it's not limited just to making individual experiments take less time).
I think the most plausible case for your position would be that the compute costs for ML research scale much less than quadratically with the size of the pretrained m...
I think ML research in particular can plausibly be accelerated by maybe 30x by only making it extremely fast and cheap to go from high level ideas to implemented experiments (rather than needing to generate these high level ideas)
Why doesn't compute become the bottleneck well before the 30x mark? It seems like the AIs have to be superhuman at something to overcome that bottleneck (rather than just making it fast and cheap to implement experiments). Indeed the AIs make the problem somewhat worse, since you have to spend compute to run the AIs.
Is this a problem you expect to arise in practice? I don't really expect it to arise, if you're allowing for a significant amount of effort in creating that data (since I assume you'd also be putting a significant amount of effort into interpretability).