I'm a researcher on the technical governance team at MIRI.
Views expressed are my own, and should not be taken to represent official MIRI positions. Similarly, views within the technical governance team do vary.
Previously:
Quick takes on the above:
Information flow might be inhibited actively through an agent's actions. The primary way this could happen is gradient hacking, but it’s not the only kind of action an AI might take to conceal misaligned thoughts. Of course, active circumvention methods require that interpreters either can’t see or aren’t looking at the thoughts that generate those actions.
Most potential circumvention methods that can be passive can also be active. But some methods can only be active.
It seems to me that there's no fixed notion of "active" that works for both paragraphs here.
If active means [is achieved through the agent's actions], then this does not in general imply that it is deliberately achieved through the agent's actions. For example, training against interpretability tools might produce actions that hide misaligned thoughts/actions as side-effects.
With this notion of 'active' the first bolded section doesn't hold: this can happen even when the agent's thoughts are entirely visible.
If instead active means [is achieved deliberately through the agent's actions], then the "But some methods can only be active" doesn't hold.
There are two dimensions here:
In particular, the mechanisms you've labelled "strictly active" can, in principle, be selected for passively - so do not in general require any misaligned thoughts (admittedly, the obvious way this happens is by training against interpretability tools).
It may be better to think about it that way, yes - in some cases, at least.
Probably it makes sense to throw in some more variables.
Something like:
In these terms, [loss of control] is something like [ensuring important properties becomes much more expensive (or impossible)].
That's fair. I agree that we're not likely to resolve much by continuing this discussion. (but thanks for engaging - I do think I understand your position somewhat better now)
What does seem worth considering is adjusting research direction to increase focus on [search for and better understand the most important failure modes] - both of debate-like approaches generally, and any [plan to use such techniques to get useful alignment work done].
I expect that this would lead people to develop clearer, richer models.
Presumably this will take months rather than hours, but it seems worth it (whether or not I'm correct - I expect that [the understanding required to clearly demonstrate to me that I'm wrong] would be useful in a bunch of other ways).
"[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we get doom...
I'm claiming something more like "[given a realistic degree of technical work on current agendas in the time we have], there will be some existentially risky failures left, so if we proceed we're highly likely to get doom.
I'll clarify more below.
Otherwise even in a "free-for-all" world, our actions do influence odds of success, because you can do technical work that people use, and that reduces p(doom).
Sure, but I mostly don't buy p(doom) reduction here, other than through [highlight near-misses] - so that an approach that hides symptoms of fundamental problems is probably net negative.
In the free-for-all world, I think doom is overdetermined, absent miracles [1]- and [significantly improved debate setup] does not strike me as a likely miracle, even after I condition on [a miracle occurred].
Factors that push in the other direction:
But I suppose that on the [usefulness of debate (/scalable oversight techniques generally) research], I'm mainly thinking: [more clearly understanding how and when this may fail catastrophically, and how we'd robustly predict this] seems positive, whereas [show that versions of this technique get higher scores on some benchmarks] probably doesn't.
Even if I'm wrong about the latter, the former seems more important.
Granted, it also seems harder - but I think that having a bunch of researchers focus on it and fail to come up with any principled case is useful too (at least for them).
If we ignore the misunderstanding part then I'm at << 1% probability on "we build transformative AI using GSA with level 6 / level 7 specifications in the nearish future".
(I could imagine a pause on frontier AI R&D, except that you are allowed to proceed if you have level 6 / level 7 specifications; and those specifications are used in a few narrow domains. My probability on that is similar to my probability on a pause.)
Agreed. This is why my main hope on this routes through [work on level 6/7 specifications clarifies the depth and severity of the problem] and [more-formally-specified 6/7 specifications give us something to point to in regulation].
(on the level 7, I'm assuming "in all contexts" must be an overstatement; in particular, we only need something like "...in all contexts plausibly reachable from the current state, given that all powerful AIs developed by us or our AIS follow this specification or this-specification-endorsed specifications")
Clarifications I'd make on my [doom seems likely, but not inevitable; some technical work seems net negative] position:
Not going to respond to everything
No worries at all - I was aiming for [Rohin better understands where I'm coming from]. My response was over-long.
E.g. presumably if you believe in this causal arrow you should also believe [higher perceived risk] --> [actions that decrease risk]. But if all building-safe-AI work were to stop today, I think this would have very little effect on how fast the world pushes forward with capabilities.
Agreed, but I think this is too coarse-grained a view.
I expect that, absent impressive levels of international coordination, we're screwed. I'm not expecting [higher perceived risk] --> [actions that decrease risk] to operate successfully on the "move fast and break things" crowd.
I'm considering:
I think some kind of analysis along these lines makes sense - though clearly it's hard to know where to draw the line between [it's unrealistic to expect decision-makers/influencers this principled] and [it's unrealistic to think things may go well with decision-makers this poor].
I don't think conditioning on the status-quo free-for-all makes sense, since I don't think that's a world where our actions have much influence on our odds of success.
I agree that reference classes are often terrible and a poor guide to the future, but often first-principles reasoning is worse (related: 1, 2).
Agreed (I think your links make good points). However, I'd point out that it can be true both that:
You've listed a bunch of claims about AI, but haven't spelled out why they should make us expect large risk compensation effects
I think almost everything comes down to [perceived level of risk] sometimes dropping hugely more than [actual risk] in the case of AI. So it's about the magnitude of the input.
It depends hugely on the specific stronger safety measure you talk about. E.g. I'd be at < 5% on a complete ban on frontier AI R&D (which includes academic research on the topic). Probably I should be < 1%, but I'm hesitant around such small probabilities on any social claim.
That's useful, thanks. (these numbers don't seem foolish to me - I think we disagree mainly on [how necessary are the stronger measures] rather than [how likely are they])
Hmm, then I don't understand why you like GSA more than debate, given that debate can fit in the GSA framework (it would be a level 2 specification by the definitions in the paper).
Oh sorry, I should have been more specific - I'm only keen on specifications that plausibly give real guarantees: level 6(?) or 7. I'm only keen on the framework conditional on meeting an extremely high bar for the specification.
If that part gets ignored on the basis that it's hard (which it obviously is), then it's not clear to me that the framework is worth much.
I suppose I'm also influenced by the way some of the researchers talk about it - I'm not clear how much focus Davidad is currently putting on level 6/7 specifications, but he seems clear that they'll be necessary.
[apologies for writing so much; you might want to skip/skim the spoilered bit, since it seems largely a statement of the obvious]
Do you agree that in many other safety fields, safety work mostly didn't think about risk compensation, and still drove down absolute risk?
Agreed (I imagine there are exceptions, but I'd be shocked if this weren't usually true).
[I'm responding to this part next, since I think it may resolve some of our mutual misunderstanding]
It seems like your argument here, and in other parts of your comment, is something like "we could do this more costly thing that increases safety even more". This seems like a pretty different argument; it's not about risk compensation (i.e. when you introduce safety measures, people do more risky things), but rather about opportunity cost (i.e. when you introduce weak safety measures, you reduce the will to have stronger safety measures). This is fine, but I want to note the explicit change in argument; my earlier comment and the discussion above was not trying to address this argument.
Perhaps we've been somewhat talking at cross purposes then, since I certainly consider [not acting to bring about stronger safety measures] to be within the category of [doing more risky things].
It fits the pattern of [lower perceived risk] --> [actions that increase risk].
For clarity, risky actions I'm thinking about would include:
Some of these might be both [opportunity cost] and [risk compensation].
I.e.:
If you do agree with that, what makes AI different from those cases? (The arguments you give seem like very general considerations that apply to other fields as well.)
Mostly for other readers, I'm going to spoiler my answer to this: my primary claim is that more people need to think about and answer these questions themselves, so I'd suggest that readers take a little time to do this. I'm significantly more confident that the question is important, than that my answer is correct or near-complete.
I agree that the considerations I'm pointing to are general.
I think the conclusions differ, since AI is different.
First a few clarifications:
How I view AI risk being different:
I'm sure I'm missing various relevant factors.
Overall, I think straightforward reference classes for [safety impact of risk compensation] tell us approximately nothing - too much is different.
I'm all for looking for a variety of reference classes: we need all the evidence we can get.
However, I think people are too ready to fall back on the best reference classes they can find - even when they're terrible.
Briefly on opportunity cost arguments, the key factors are (a) how much will is there to pay large costs for safety, (b) how much time remains to do the necessary research and implement it, and (c) how feasible is the stronger safety measure.
This seems to miss the obvious: (d) how much safety do the weak vs strong measures get us?
I expect that you believe work on debate (and similar weak measures) gets us significantly more than I do. (and hopefully you're correct!)
Anyway for now let's just say that I've thought about these three factors and think it isn't especially realistic to expect that we can get stronger safety measures, and as a result I don't see opportunity cost as a big reason not to do the safety work we currently do.
Given that this seems highly significant, can you:
I'd also note that it's not only [more progress on weak measures] that matters here, but also the signal sent by pursuing these research directions.
If various people with influence on government see [a bunch of lab safety teams pursuing x safety directions] I expect that most will conclude: "There seems to be a decent consensus within the field that x will get us acceptable safety levels", rather than "Probably x is inadequate, but these researchers don't see much hope in getting stronger-than-x measures adopted".
I assume that you personally must have some constraints on the communication strategy you're able to pursue. However, it does seem highly important that if the safety teams at labs are pursuing agendas based on [this isn't great, but it's probably the best we can get], this is clearly and loudly communicated.
Similarly, the theory of change you cite for your examples seems to be "discovers or clarifies problems that shows that we don't have a solution" (including for Guaranteed Safe AI and ARC theory, even though in principle those could be about building safe AI systems). So as far as I can tell, the disagreement is really that you think current work that tries to provide a specific recipe for building safe AI systems is net negative, and I think it is net positive.
This characterization is a little confusing to me: all of these approaches (ARC / Guaranteed Safe AI / Debate) involve identifying problems, and, if possible, solving/mitigating them.
To the extent that the problems can be solved, then the approach contributes to [building safe AI systems]; to the extent that they cannot be solved, the approach contributes to [clarifying that we don't have a solution].
The reason I prefer GSA / ARC is that I expect these approaches to notice more fundamental problems. I then largely expect them to contribute to [clarifying that we don't have a solution], since solving the problems probably won't be practical, and they'll realize that the AI systems they could build won't be safe-with-high-probability.
I expect scalable oversight (alone) to notice a smaller set of less fundamental problems - which I expect to be easier to fix/mitigate. I expect the impact to be [building plausibly-safe AI systems that aren't safe (in any robustly scalable sense)].
Of course I may be wrong, if the fundamental problems I believe exist are mirages (very surprising to me) - or if the indirect [help with alignment research] approach turns out to be more effective than I expect (fairly surprising, but I certainly don't want to dismiss this).
I do still agree that there's a significant difference in that GSA/ARC are taking a worst-case-assumptions approach - so in principle they could be too conservative. In practice, I think the worst-case-assumption approach is the principled way to do things given our lack of understanding.
I think [
resolvingthe existence of this uncertainty is] an important issue to notice when considering research directionsWhy? It doesn't seem especially action guiding, if we've agreed that it's not high value to try to resolve the uncertainty (which is what I take away from your (1)).
(it's not obvious to me that it's not high value to try to resolve this uncertainty, but it's plausible based on your prior comments on this issue, and I'm stipulating that here)
Here I meant that it's important to notice that:
I'm saying that, unless people think carefully and deliberately about this, the mind will tend to conclude [my current estimate is pretty accurate] and/or [this variable changing wouldn't be highly significant] - both of which may be false.
To the extent that these are believed, they're likely to impact other beliefs that are decision-relevant. (e.g. believing that my estimate of x is accurate tells me something about the character of x and my understanding of it)
Thanks for the response. I realize this kind of conversation can be annoying (but I think it's important).
[I've included various links below, but they're largely intended for readers-that-aren't-you]
I don't see why this isn't a fully general counterargument to alignment work. Your argument sounds to me like "there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down".
(Thanks for this too. I don't endorse that description, but it's genuinely useful to know what impression I'm creating - particularly when it's not what I intended)
I'd rephase my position as:
Again, I don't think it's implausible that debate-like approaches come out better than various alternatives after carefully considering risk compensation. I do think it's a serious error not to spend quite a bit of time and effort understanding and reducing uncertainty on this.
Possibly many researchers do this, but don't have any clean, legible way to express their process/conclusions. I don't get that impression: my impression is that arguments along the lines I'm making tend to be perceived as fully general counter-arguments and dismissed (whether they come from outside, or from the researchers themselves).
What's an example of alignment work that you think is net positive with the theory of change "this is a better way to build future powerful AI systems"?
I'm not sure how direct you intend "this is a better way..." (vs e.g. "this will build the foundation of better ways..."). I guess I'd want to reframe it as "This is a better process by which to build future powerful AI systems", so as to avoid baking in a level of concreteness before looking at the problem.
That said, the following seem good to me:
(I'm probably not going to engage with perspectives that say all current [alignment work towards building safer future powerful AI systems] is net negative, sorry. In my experience those discussions typically don't go anywhere useful.)
Not a problem. However, I'd want to highlight the distinction between:
Both are reasonable explanations for not putting much further effort into such discussions.
However, I worry that the system-1s of researchers don't distinguish between (1) and (2) here - so that the impact of concluding (1) will often be to act as if the uncertainty isn't important (and as if the implicit corollaries hold).
I don't see an easy way to fix this - but I think it's an important issue to notice when considering research directions. Not only [consider this uncertainty when picking a research direction], but also [consider that your system 1 is probably underestimating the importance of this uncertainty (and corollaries), since you haven't been paying much attention to it (for understandable reasons)].
Sure, linking to that seems useful, thanks.
That said, I'm expecting that the crux isn't [can a debate setup work for arbitrarily powerful systems?], but rather e.g. [can it be useful in safe automation of alignment research?].
For something like the latter, it's not clear to me that it's not useful.
Mainly my pessimism is about:
Of course little of this is specific to debate. Nor is it clear to me that debate is worse than alternatives in these respects - I just haven't seen an argument that it's better (on what assumptions; in which contexts).
I understand that it's hard to answer the questions I'd want answered.
I also expect that working on debate isn't the way to answer them - so I think it's fine to say [I currently expect debate to be a safer approach than most because ... and hope that research directions x and y will shed more light on this]. But I'm not clear on people's rationale for the first part - why does it seem safer?
Do you have a [link to] / [summary of] your argument/intuitions for [this kind of research on debate makes us safer in expectation]? (e.g. is Geoffrey Irving's AXRP appearance a good summary of the rationale?)
To me it seems likely to lead to [approach that appears to work to many, but fails catastrophically] before it leads to [approach that works]. (This needn't be direct)
I.e. currently I'd expect this direction to make things worse both for:
I tend to assume the idea is more (2) than (1). (presumably [debate fails in the limit] isn't controversial)
I can see a case that it's possible debate helps with some (2).
I don't currently see the case for [net positive in expectation].
More specifically, the upside from [direction/protocol x eliminates various failure modes, so reducing risk] needs to be balanced against the downside from [our (over?)confidence in direction/protocol x leads us to take risks we otherwise wouldn't].
I note here that this isn't a fully-general counterargument, but rather a general consideration.
When I consider this, I currently think "On balance, this seems to make things worse in worlds where we might plausibly have succeeded" (of course I hope to be wrong somewhere! - and, if I'm wrong, I'd like to know).
This may be downstream of different threat models.
Alternatively, it may be that you have in mind some strategy I haven't considered (at some level of abstraction).
It'd be useful to have more clarity here.
What strikes you as a plausible [debate turns out to be helpful in reducing x-risk] story? (concreteness would be nice, but mostly I'm interested in whatever story you have)
I'd enjoy being more constructive and getting into specifics on the paper/direction.
However, from my point of view that'd makes things worse (at 'best' I'd be thinning out the red flags).
To be clear, I only think this becomes obviously significant in a train-against-interpretability-tools context; if we're not training against tools, the things I'm highlighting here seem intuitively highly unlikely.
This still seems important, given that the combination [x is clearly a bad idea] and [x is the easiest path forward] does not robustly result in [nobody does x].
That said, I think I'd disagree on one word of the following:
Any internal representations etc. need to result in the actions being orchestrated (to the extent necessary). They don't need to be used for the purpose of orchestrating the actions. They don't need to be [clearly selected for the purpose of orchestrating such actions].
In some cases this might be improbable without active, deliberate malign selection - but this would need to be argued for. In at least the case of [manipulating the interpreter], this doesn't seem at all improbable, given the right selection pressure. (some manipulation seems obviously possible; whether it's enough for a particular malign strategy would vary)
I haven't thought about this for long, but in general the [this kind of thing ~must require active deliberation] intuition seems highly suspicious to me. All I need to set up a not-directly-deliberated-on malign strategy so that it's implemented via benign cognition is a sufficient set of innocuous correlations (both of [innocuous action tends to have useful-for-malign-strategy effect] and [innocuous actions will tend to occur in these predictable patterns]).
If I have unlimited passive selection pressure against tools that robustly detect active malign deliberation, but not 'instinctive' malign strategies, I'd strongly expect the outcome to be 'instinctive' active manipulation (etc.). (assuming of course that successful manipulation is instrumentally useful)
Whether I'd expect this in practice would come down to how much selection pressure were required for a particular [malign, active-with-no-malign-cognition strategy].
My current answer to this is not [this is extremely improbable]. It's more like [This seems intuitively plausible; I have no idea on the probability so I'm going to take this possibility seriously until I see a non-handwaving argument that shows it to be extremely unlikely].