LessWrong dev & admin as of July 5th, 2022.
Is your perspective something like:
Something like that, though I'm much less sure about "non-norms-violating", because many possible solutions seem like they'd involve something qualitatively new (and therefore de-facto norm-violating, like nearly all new technology). Maybe a very superhuman TAI could arrange matters such that things just seem to randomly end up going well rather than badly, without introducing any new[1] social or material technology, but that does seem quite a bit harder.
I'm pretty uncertain about, if something like that ended up looking norm-violating, it'd be norm-violating like Uber was[2], or like super-persuasian. That question seems very contingent on empirical questions that I think we don't have much insight into, right now.
I'm unsure about the claim that if you put this aside, there is a way to end the acute risk period without needing truly insanely smart AIs.
I didn't mean to make the claim that there's a way to end the acute risk period without needing truly insanely smart AIs (if you put aside centrally-illegal methods); rather, that an AI would probably need to be relatively low on the "smarter than humans" scale to need to resort to methods that were obviously illegal to end the acute risk period.
(Responding in a consolidated way just to this comment.)
Ok, got it. I don't think the US government will be able and willing to coordinate and enforce a worldwide moratorium on superhuman TAI development, if we get to just-barely TAI, at least not without plans that leverage that just-barely TAI in unsafe ways which violate the safety invariants of this plan. It might become more willing than it is now (though I'm not hugely optimistic), but I currently don't think as an institution it's capable of executing on that kind of plan and don't see why that will change in the next five years.
Another way to put this is that the story for needing much smarter AIs is presumably that you need to build crazy weapons/defenses to defend against someone else's crazily powerful AI.
I think I disagree with the framing ("crazy weapons/defenses") but it does seem like you need some kind of qualitatively new technology. This could very well be social technology, rather than something more material.
Building insane weapons/defenses requires US government consent (unless you're commiting massive crimes which seems like a bad idea).
I don't think this is actually true, except in the trivial sense where we have a legal system that allows the government to decide approximately arbitrary behaviors are post-facto illegal if it feels strongly enough about it. Most new things are not explicitly illegal. But even putting that aside[1], I think this is ignoring the legal routes by which a qualitatively superhuman TAI might find to ending the Acute Risk Period, if it was so motivated.
(A reminder that I am not claiming this is Anthropic's plan, nor would I endorse someone trying to build ASI to execute on this kind of plan.)
TBC, I don't think there are plausible alternatives to at least some US government involvement which don't require commiting a bunch of massive crimes.
I think there's a very large difference between plans that involve nominal US government signoff on private actors doing things, in order to avoid comitting massive crimes (or to avoid the appearance of doing so), plans that involve the US government mostly just slowing things down or stopping people from doing things, and plans that involve the US government actually being the entity that makes high-context decisions about e.g. what values to to optimize for, given a slot into which to put values.
I agree that stories which require building things that look very obviously like "insane weapons/defenses" seem bad, both for obvious deontological reasons, but also I wouldn't expect them to work well enough be worth it even under "naive" consequentialist analysis.
I agree with large parts of this comment, but am confused by this:
I think you should instead plan on not building such systems as there isn't a clear reason why you need such systems and they seem super dangerous. That's not to say that you shouldn't also do research into aligning such systems, I just think the focus should instead be on measures to avoid needing to build them.
While I don't endorse it due to disagreeing with some (stated and unstated) premises, I think there's a locally valid line of reasoning that goes something like this:
This seems like the pretty straightforward argument for racing, and if you have a pretty specific combination of beliefs about alignment difficulty, coordination difficulty, capability profiles, etc, I think it basically checks out.
I don't know what set of beliefs implies that it's much more important to avoid building superhuman TAI once you have just-barely TAI, than to avoid building just-barely TAI in the first place. (In particular, how does this end up with the world in a stable equilibrium that doesn't immediately get knocked over by the second actor to reach TAI?)
Here the alignment concern is that we aren’t, actually, able to exert adequate selection pressure in this manner. But this, to me, seems like a notably open empirical question.
I think the usual concern is not whether this is possible in principle, but whether we're likely to make it happen the first time we develop an AI that is both motivated to attempt and likely to succeed at takeover. (My guess is that you understand this, based on your previous writing addressing the idea of first critical tries, but there does exist a niche view that alignment in the relevant sense is impossible and not merely very difficult to achieve under the relevant constraints, and arguments against that view look very different from arguments about the empirical difficulty of value alignment, likelihood of various default outcomes, etc).
I agree that it's useful to model AI's incentives for takeover in worlds where it's not sufficiently superhuman to have a very high likelihood of success. I've tried to do some of that, though I didn't attend to questions about how likely it is that we'd be able to "block off" the (hopefully much smaller number of) plausible routes to takeover for AIs which have a level of capabilities that don't imply an overdetermined success.
I think I am more pessimistic than you are about how much such AIs would value the "best benign alternatives" - my guess is very close to zero, since I expect ~no overlap in values and that we won't be able to succesfully engage in schemes like pre-committing to sharing the future value of the Lightcone conditional on the AI being cooperative[1]. Separately, I expect that if we attempt to maneuver such AIs into positions where their highest-EV plan is something we'd consider to have benign long-run consequences, we will instead end up in situations where their plans are optimized to hit the pareto-frontier of "look benign" and "tilt the playing field further in the AI's favor". (This is part of what the Control agenda is trying to address.)
Credit-assignment actually doesn't seem like the hard part, conditional on reaching aligned ASI. I'm skeptical of the part where we have a sufficiently capable AI that its help is useful in us reaching an aligned ASI, but it still prefers to help us because it thinks that its estimated odds of a successful takeover imply less future utility for itself than a fair post-facto credit assignment would give it, for its help. Having that calculation come out in our favor feels pretty doomed to me, if you've got the AI as a core part of your loop for developing future AIs, since it relies on some kind of scalable verification scheme and none of the existing proposals make me very optimistic.
I think you tried to embed images hosted on some Google product, which our editor should've tried to re-upload to our own image host if you pasted them in as images but might not have if you inserted the images by URL. Hotlinking to images on Google domains often fails, unfortunately.
I'm happy to use a functional definition of "understanding" or "intelligence" or "situational awareness".
But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they're giving us the desired behavior now will continue to give us desired behavior in the future.
My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you're importing expectations from how human outputs reflect the internal processes that generated them. If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences. Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it. Consider Anthropic's Sleeper Agents. Would a situationally aware model use a provided scratch pad to think about how it's in training and needs to pretend to be helpful? No, and neither does the model "understand" your intentions in a way that generalizes out of distribution the way you might expect a human's "understanding" to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the "right" responses during RLHF are not anything like human reasoning.
I'd prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.
Are you asking for a capabilities threshold, beyond which I'd be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is "can it replace humans at all economically valuable tasks", which is probably not that helpful. Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we'll be able to train models capable of doing a lot of economically useful work, but which don't actively try to steer the future. I think we still probably die in those worlds, because automating capabilities research seems much easier than automating alignment research.
GPT-4 seems like a "generic system" that essentially "understands our intentions"
I suspect that a lot of my disagreement with your views comes down to thinking that current systems provide almost no evidence about the difficulty of aligning systems that could pose existential risks, because (I claim) current systems in fact almost certainly don't have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.
In this case, I don't know why you think that GPT-4 "understands our intentions", unless you mean something very different by that than what you'd mean if you said that about another human. It is true that GPT-4 will produce output that, if it came from a human, would be quite strong evidence that our intentions were understood (more or less), but the process which generates that output is extremely different from the one that'd generate it in a human and is probably missing most of the relevant properties that we care about when it comes to "understanding". Like, in general, if you ask GPT-4 to produce output that references its internal state, that output will not have any obvious relationship[1] to its internal state, since (as far as we know) it doesn't have the same kind of introspective access to its internal state that we do. (It might, of course, condition its outputs on previous tokens it output, and some humans do in fact rely on examining their previous externally-observable behavior to try to figure out what they were thinking at the time. But that's not the modality I'm talking about.)
It is also true that GPT-4 usually produces output that seems like it basically corresponds to our intentions, but that being true does not depend on it "understanding our intentions".
That is known to us right now; possibly one exists and could be derived.
I am not covering training setups where we purposefully train an AI to be agentic and autonomous. I just think it's not plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer.[2]
Like Ryan, I'm interested in how much of this claim is conditional on "just keep scaling up networks" being insufficient to produce relevantly-superhuman systems (i.e. systems capable of doing scientific R&D better and faster than humans, without humans in the intellectual part of the loop). If it's "most of it", then my guess is that accounts for a good chunk of the disagreement.
Curated.
The reasons I like this post:
"That being said, I do think there are some cases where gradient hacking might be quite easy, e.g. cases where we give the model access to a database where it can record its pre-commitments or direct access to its own weights and the ability to modify them.")
Arguably missing is a line or two that backtracks from "we could try to get robust understanding via a non-behavioral source such as mechanistic interpretability evaluated throughout the course of training" to (my claim) "it may not be safe to perform capability evaluations via fine-tuning on sufficiently powerful models before we can evaluate them for alignment, and we don't actually know when we're going to hit that threshold", but that might be out of scope.
Do you have a mostly disjoint view of AI capabilities between the "extinction from loss of control" scenarios and "extinction by industrial dehumanization" scenarios? Most of my models for how we might go extinct in next decade from loss of control scenarios require the kinds of technological advancement which make "industrial dehumanization" redundant, with highly unfavorable offense/defense balances, so I don't see how industrial dehumanization itself ends up being the cause of human extinction if we (nominally) solve the control problem, rather than a deliberate or accidental use of technology that ends up killing all humans pretty quickly.
Separately, I don't understand how encouraging human-specific industries is supposed to work in practice. Do you have a model for maintaining "regulatory capture" in a sustained way, despite having no economic, political, or military power by which to enforce it? (Also, even if we do succeed at that, it doesn't sound like we get more than the Earth as a retirement home, but I'm confused enough about the proposed equilibrium that I'm not sure that's the intended implication.)