All of RobertM's Comments + Replies

I understand this post to be claiming (roughly speaking) that you assign >90% likelihood in some cases and ~50% in other cases that LLMs have internal subjective experiences of varying kinds.  The evidence you present in each case is outputs generated by LLMs.

The referents of consciousness for which I understand you to be making claims re: internal subjective experiences are 1, 4, 6, 12, 13, and 14.  I'm unsure about 5.

Do you have sources of evidence (even illegible) other than LLM outputs that updated you that much?  Those seem like very... (read more)

The evidence you present in each case is outputs generated by LLMs.

The total evidence I have (and that everyone has) is more than behavioral. It includes

a) the transformer architecture, in particular the attention module,

b) the training corpus of human writing,

c) the means of execution (recursive calling upon its own outputs and history of QKV vector representations of outputs),

d) as you say, the model's behavior, and

e) "artificial neuroscience" experiments on the model's activation patterns and weights, like mech interp research.

When I think about how... (read more)

Do you have a mostly disjoint view of AI capabilities between the "extinction from loss of control" scenarios and "extinction by industrial dehumanization" scenarios?  Most of my models for how we might go extinct in next decade from loss of control scenarios require the kinds of technological advancement which make "industrial dehumanization" redundant, with highly unfavorable offense/defense balances, so I don't see how industrial dehumanization itself ends up being the cause of human extinction if we (nominally) solve the control problem, rather th... (read more)

Do you have a mostly disjoint view of AI capabilities between the "extinction from loss of control" scenarios and "extinction by industrial dehumanization" scenarios?

a) If we go extinct from a loss of control event, I count that as extinction from a loss of control event, accounting for the 35% probability mentioned in the post.

b) If we don't have a loss of control event but still go extinct from industrial dehumanization, I count that as extinction caused by industrial dehumanization caused by successionism, accounting for the additional 50% probabilit... (read more)

Is your perspective something like:

Something like that, though I'm much less sure about "non-norms-violating", because many possible solutions seem like they'd involve something qualitatively new (and therefore de-facto norm-violating, like nearly all new technology).  Maybe a very superhuman TAI could arrange matters such that things just seem to randomly end up going well rather than badly, without introducing any new[1] social or material technology, but that does seem quite a bit harder.

I'm pretty uncertain about, if something like that ended... (read more)

(Responding in a consolidated way just to this comment.)

Ok, got it.  I don't think the US government will be able and willing to coordinate and enforce a worldwide moratorium on superhuman TAI development, if we get to just-barely TAI, at least not without plans that leverage that just-barely TAI in unsafe ways which violate the safety invariants of this plan.  It might become more willing than it is now (though I'm not hugely optimistic), but I currently don't think as an institution it's capable of executing on that kind of plan and don't see w... (read more)

4Ryan Greenblatt
I'm basically imagining being able to use controlled AIs which aren't qualitatively smarter than humans for whatever R&D purposes we want. (Though not applications like (e.g.) using smart AIs to pilot drone armies live.) Some of these applications will be riskier than others, but I think this can be done while managing risk to a moderate degree. Bootstrapping to some extent should also be possible where you use the first controlled AIs to improve the safety of later deployments (both improving control and possibly alignment).
3Ryan Greenblatt
Is your perspective something like: I think that even having the ability to easily take over the world as a private actor is pretty norms violating. I'm unsure about the claim that if you put this aside, there is a way to end the acute risk period (edit: without US government collaboration and) without needing truly insanely smart AIs. I suppose that if you go smart enough this is possible though pre-existing norms also just get more confusing in the regime where you can steer the world to whatever outcome you want. So overall, I'm not sure I disagree with this perspective exactly. I think the overriding consideration for me is that this seems like a crazy and risky proposal at multiple levels. To be clear, you are explicitly not endorsing this as a plan nor claiming this is Anthropic's plan.

I agree with large parts of this comment, but am confused by this:

I think you should instead plan on not building such systems as there isn't a clear reason why you need such systems and they seem super dangerous. That's not to say that you shouldn't also do research into aligning such systems, I just think the focus should instead be on measures to avoid needing to build them.

While I don't endorse it due to disagreeing with some (stated and unstated) premises, I think there's a locally valid line of reasoning that goes something like this:

  • if Anthropic fin
... (read more)

I don't know what set of beliefs implies that it's much more important to avoid building superhuman TAI once you have just-barely TAI, than to avoid building just-barely TAI in the first place.

AIs which aren't qualitatively much smarter than humans seem plausible to use reasonably effectively while keeping risk decently low (though still unacceptably risky in objective/absolute terms). Keeping risk low seems to require substantial effort, though it seems maybe achievable. Even with token effort, I think risk is "only" around 25% with such AIs because de... (read more)

6Ryan Greenblatt
My proposal would roughly be that the US government (in collaboration with allies etc) enforces no one building AI which are qualitatively smarter than humans and this should be the default plan. (This might be doable without government support via coordination between multiple labs, but I basically doubt it.) Their could be multiple AI projects backed by the US+allies or just one, either could be workable in principle, though multiple seems tricky.

Here the alignment concern is that we aren’t, actually, able to exert adequate selection pressure in this manner. But this, to me, seems like a notably open empirical question.

I think the usual concern is not whether this is possible in principle, but whether we're likely to make it happen the first time we develop an AI that is both motivated to attempt and likely to succeed at takeover.  (My guess is that you understand this, based on your previous writing addressing the idea of first critical tries, but there does exist a niche view that alignment ... (read more)

I think you tried to embed images hosted on some Google product, which our editor should've tried to re-upload to our own image host if you pasted them in as images but might not have if you inserted the images by URL.  Hotlinking to images on Google domains often fails, unfortunately.

I'm happy to use a functional definition of "understanding" or "intelligence" or "situational awareness".

But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they're giving us the desired behavior now will continue to give us desired behavior in the future.

My point re: introspection was trying to provide evidence for the claim that model outputs are not a ... (read more)

GPT-4 seems like a "generic system" that essentially "understands our intentions"

I suspect that a lot of my disagreement with your views comes down to thinking that current systems provide almost no evidence about the difficulty of aligning systems that could pose existential risks, because (I claim) current systems in fact almost certainly don't have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.

In this case, I don't know why you think that GPT-4 "understands our intentions", unless you mean something ve... (read more)

1Matthew Barnett
I'm happy to use a functional definition of "understanding" or "intelligence" or "situational awareness". If a system possesses all relevant behavioral qualities that we associate with those terms, I think it's basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It's possible this is our main disagreement. When I talk to GPT-4, I think it's quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not? I agree that GPT-4 does not understand the world in the same way humans understand the world, but I'm not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things. I'm similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one's own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don't see how that fact bears much on the question of whether you understand human intentions. It's possible there's some connection here, but I'm not seeing it. I'd claim: 1. Current systems have limited situational awareness. It's above zero, but I agree it's below human level. 2. Current systems don't have stable preferences over time. But I think this is a point in favor of the model I'm providing here. I'm claiming that it's plausibly easy to create smart, corr

I am not covering training setups where we purposefully train an AI to be agentic and autonomous. I just think it's not plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer.[2]

Like Ryan, I'm interested in how much of this claim is conditional on "just keep scaling up networks" being insufficient to produce relevantly-superhuman systems (i.e. systems capable of doing scientific R&D better and faster than humans, without humans in the intellectual part of the loop).  If it's "most of it", then my ... (read more)

3Alex Turner
I don't expect the current paradigm will be insufficient (though it seems totally possible). Off the cuff I expect 75% that something like the current paradigm will be sufficient, with some probability that something else happens first. (Note that "something like the current paradigm" doesn't just involve scaling up networks.)

Curated.

The reasons I like this post:

"That being said, I do think there are some cases where gradient hacking might be quite easy, e.g. cases where we give the model access to a database where it can record its pre-commitments or direct access to its own weights and the ability to modify them.")

  • it has direct, practical implications for e.g. regulatory proposals
  • it points out the critical fact that we're missing the ability to evaluate for alignment given current techniques

Arguably missing is a line or two th... (read more)

3Evan Hubinger
I mean, like I say in the post, if you have some strong reason to believe that there's no gradient-hacking going on, then I think this is safe in the i.i.d. setting, and likewise for exploration hacking in the RL setting. You just have to have that strong reason somehow (which is maybe what you mean by saying we can evaluate them for alignment?).