All of Rob Bensinger's Comments + Replies

Counterarguments to the basic AI x-risk case

(And we aren't perfect recognizers of 'functional, safe-to-use nanofactory' or other known-to-me things that might save the world.)

Rob Bensinger3y62

Also from Ronny:

There's also an important disanalogy between generating/recognizing faces and learning 'human values', which is that humans are perfect human face recognizers but not perfect recognizers of worlds high in 'human values'.
That means that there might be world states or plans in the training data or generated by adversarial training that look to us, and ML trained to recognize these things the way we recognize them, like they are awesome, but are actually awful.

2Rob Bensinger3y

(And we aren't perfect recognizers of 'functional, safe-to-use nanofactory' or other known-to-me things that might save the world.)

Counterarguments to the basic AI x-risk case

Rob Bensinger3y2010

Ronny Fernandez asked me, Nate, and Eliezer for our take on Twitter. Copying over Nate's reply:

briefly: A) narrow non-optimizers can exist but won't matter; B) wake me when the allegedly maximally-facelike image looks human; C) ribosomes show that cognition-bound superpowers exist; D) humans can't stack into superintelligent corps, but if they could then yes plz value-load
(tbc, I appreciate Katja saying all that. Hooray for stating what you think, and hooray again when it's potentially locally unpopular! If I were less harried I might give more than a twee

... (read more)

3Alex Turner2y

Nate's B) currently seems confused. I read a connotation "we need the AGI's learned concepts to be safe under extreme optimization pressure, such that, when extremized, they yield reasonable results (e.g. human faces from maximizing the AI-faceishness-concept-activation of an image)." But I think agents will not maximize their own concept activations, when choosing plans. An agent's values will optimize the world; the values don't optimize themselves. For example, I think that I am not looking for a romantic relationship which maximally activates my "awesome relationship" concept, if that's a thing I have. It's true that conditional on such a plan being considered, my relationship-shard might bid for that plan with strength monotonically increasing on "predicted activation of awesome-relationship". And conditional on such a plan getting considered, where that concept activation is maximized, I would therefore be very inclined to pursue that plan. But I think it's not true that my relationship-shard is optimizing its own future activations by extremizing future concept activations. I think that this plan won't get found, and the agent won't want to find this plan. Values are not the optimization target. (This point explained in more detail: Alignment allows "nonrobust" decision-influences and doesn't require robust grading)

Chris_Leong3y40

Could someone clarify the relevance of ribosomes?

Rob Bensinger3y62

Also from Ronny:

There's also an important disanalogy between generating/recognizing faces and learning 'human values', which is that humans are perfect human face recognizers but not perfect recognizers of worlds high in 'human values'.
That means that there might be world states or plans in the training data or generated by adversarial training that look to us, and ML trained to recognize these things the way we recognize them, like they are awesome, but are actually awful.

Charles Foster3y2633

Note: "ask them for the faciest possible thing" seems confused.

How I would've interpreted this if I were talking with another ML researcher is "Sample the face at the point of highest probability density in the generative model's latent space". For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.

I'm guessing what he has in mind is more like "take a GAN discriminator / image classifier &... (read more)

Rob Bensinger3y*85

e.g. by trying to apply standards of epistemic uncertainty to the state of this essence?

I would say that there's a logical object that a large chunk of human moral discourse is trying to point at — something like "the rules of the logical game Morality", analogous to "the rules of the logical game Chess". Two people can both be discussing the same logical object "the rules of Chess", but have different beliefs about what that logical object's properties are. And just as someone can be mistaken or uncertain bout the rules of chess — or about their int... (read more)

4Charlie Steiner3y

When I think about the rules of chess, I basically treat them as having some external essence that I have epistemic uncertainty about. What this means mechanistically is: * When I'm unsure about the rules of chess, this raises the value of certain information-gathering actions, like checking the FIDE website, asking a friend, reading a book. * If I knew the outcomes of all those actions, that would resolve my uncertainty. * I have probabilities associated with my uncertainty, and updates to those probabilities based on evidence should follow Bayesian logic. * Decision-making under uncertainty should linearly aggregate the different possibilities that I'm uncertain over, weighted by their probability. So the rules of chess are basically just a pattern out in the world that I can go look at. When I say I'm uncertain about the rules of chess, this is epistemic uncertainty that I manage the same as if I'm uncertain about anything else out there in the world. The "rules of Morality" are not like this. * When I'm unsure about whether I care about fish suffering, this does raise the value of certain information-gathering actions like learning more about fish. * But if I knew the outcomes of all those actions, this wouldn't resolve all my uncertainty. * I can put probabilities to various possibilities, and can update them on evidence using Bayesian logic - that part still works. * Decision-making under the remaining-after-evidence part of the uncertainty doesn't have to look like linear aggregation. In fact it shouldn't - I have meta-preferences like "conservatism," which says that I should trust models differently depending on whether they seem to be inside their domain of validity or not. So there's a lot of my uncertainty about morality that doesn't stem from being unaware about facts. Where does it come from? One source is self-modeling uncertainty - how do I take the empirical facts about me and the world, and use that to construct a model of myself in whi

Rob Bensinger3y30

I also contend that it's the more epistemically humble position, because you're not saying that it's for sure that a pivotal act should be performed, but just that it's quite plausible given the specifics of the current world situation

The thing I'd say in favor of this position is that I think it better fits the evidence. I think the problem with the opposing view is that it's wrong, not that it's more confident. E.g., if I learned that Nate assigns probability .9 to "a pivotal act is necessary" (for some operationalization of "necessary") while Critch ass... (read more)

Rob Bensinger3y*1014

"The goal should be to cause the future to be great on its own terms"
What the heck is this supposed to mean? Great according to the Inherent Essence Of Goodness that lives inside futures, rather than as part of human evaluations?

The rest of the quote explains what this means:

The goal should be to cause the future to be great on its own terms, without locking in the particular moral opinions of humanity today — and without locking in the moral opinions of any subset of humans, whether that’s a corporation, a government, or a nation.
(If you can't s

... (read more)

Charlie Steiner3y713

The present is "good on its own terms", rather than "good on Ancient Romans' terms", because the Ancient Romans weren't able to lock in their values. If you think this makes sense (and is a good thing) in the absence of an Inherent Essence Of Goodness, then there's no reason to posit an Inherent Essence Of Goodness when we switch from discussing "moral progress after Ancient Rome" to "moral progress after circa-2022 civilization".

The present is certainly good on my terms (relative to ancient Rome). But the present itself doesn't care. It's not the type of ... (read more)

What does it mean for an AGI to be 'safe'?

Rob Bensinger3y43

The wisest moves we've made as a species to date (ending slavery? ending smallpox? landing on the moon?) didn't particularly look like "worldwide collaborations".

I think Nate might've been thinking of things like:

Having all AGI research occur in one place is good (ceteris paribus), because then the AGI project can take as much time as it needs to figure out alignment, without worrying that some competitor will destroy the world with AGI if they go too slowly.
This is even truer if the global coordination is strong enough to prevent other x-risks (e.g., bio-

... (read more)

4Alex Flint3y

Yeah I also have the sense that we mostly agree here. I have the sense that CEV stands for, very roughly, "what such-and-such a person would do if they became extremely wise", and the hope (which I think is a reasonable hope) is that there is a direction called "wisdom" such that if you move a person far enough in that direction then they become both intelligent and benevolent, and that this eventually doesn't depend super much on where you started. The tricky part is that we are in this time where we have the option of making some moves that might be quite disruptive, and we don't yet have direct access to the wisdom that we would ideally use to guide our most significant decisions. And the key question is really: what do you do if you come into a position of really significant influence, at a time when you don't yet have the tools to access the CEV-level wisdom that you might later get? And some people say it's flat-out antisocial to even contemplate taking any disruptive actions, while others say that given the particular configuration of the world right now and the particular problems we face, it actually seems plausible that a person in such a position of influence ought to seriously consider disruptive actions. I really agree with the latter, and I also contend that it's the more epistemically humble position, because you're not saying that it's for sure that a pivotal act should be performed, but just that it's quite plausible given the specifics of the current world situation. The other side of the argument seems to be saying that no no no it's definitely better not to do anything like that in anything like the current world situation.

What does it mean for an AGI to be 'safe'?

Seems right to me.

Rob Bensinger3y31

I'd guess Nate might say one of:

Current SotA systems are very opaque — we more-or-less can't inspect or intervene on their thoughts — and it isn't clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)
Much more generally: we don't have a alignment approach that could realistically work fast (say, within ten months of inventing AGI rather than ten years), in the face of a sharp left turn, given inevitable p

... (read more)

3Noosphere893y

Yeah, it does seem like interpreterability is a bottleneck for a lot of alignment proposals, and in particular as long as neutral networks are essentially black boxes, deceptive alignment/inner alignment issues seem almost impossible to address.

AGI ruin scenarios are likely (and disjunctive)

Two-year update on my personal AI timelines

Alex Lintz's take: https://forum.effectivealtruism.org/posts/eggdG27y75ot8dNn7/three-pillars-for-avoiding-agi-catastrophe-technical

Rob Bensinger3y63

Thanks for the update, Ajeya! I found the details here super interesting.

I already thought that timelines disagreements within EA weren't very cruxy, and this is another small update in that direction: I see you and various MIRI people and Metaculans give very different arguments about how to think about timelines, and then the actual median year I tend to hear is quite similar.

(And also, all of the stated arguments on all sides continue to seem weak/inconclusive to me! So IMO there's not much disagreement, and it would be very easy for all of us to be wro... (read more)

3Ajeya Cotra3y

Yeah I agree more of the value of this kind of exercise (at least within the community) is in revealing more granular disagreements about various things. But I do think there's value in establishing to more external people something high level like "It really could be soon and it's not crazy or sci fi to think so."

Brainstorm of things that could force an AI team to burn their lead

Rob Bensinger3y92

Some added context for this list: Nate and Eliezer expect the first AGI developers to encounter many difficulties in the “something forces you to stop and redesign (and/or recode, and/or retrain) large parts of the system” category, with the result that alignment adds significant development time.

By default, safety-conscious groups won't be able to stabilize the game board before less safety-conscious groups race ahead and destroy the world. To avoid this outcome, humanity needs there to exist an AGI group that…

is highly safety-conscious.
has a la

A note about differential technological development

One caveat to the claim that we should prioritize serial alignment work over parallelizable work, is that this assumes an omniscient and optimal allocator of researcher-hours to problems.

Why do you think it assumes that?

Rob Bensinger3y54

This isn't a coincidence; the state of alignment knowledge is currently "we have no idea what would be involved in doing it even in principle, given realistic research paths and constraints", very far from being a well-specified engineering problem. Cf. https://intelligence.org/2013/11/04/from-philosophy-to-math-to-engineering/.

If you succeed at the framework-inventing "how does one even do this?" stage, then you can probably deploy an enormous amount of engineering talent in parallel to help with implementation, small iterative improvements, building-upon-foundations, targeting-established-metrics, etc. tasks.

On how various plans miss the hard bits of the alignment challenge

Rob Bensinger3y40

From A central AI alignment problem: capabilities generalization, and the sharp left turn:

Suppose that the fictional team OpenMind is training up a variety of AI systems, before one of them takes that sharp left turn. Suppose they've put the AI in lots of different video-game and simulated environments, and they've had good luck training it to pursue an objective that the operators described in English. "I don't know what those MIRI folks were talking about; these systems are easy to direct; simple training suffices", they say. At the same time, they apply

... (read more)

1Lauro Langosco3y

Thanks!

On how various plans miss the hard bits of the alignment challenge

Rob Bensinger3y30

(Most of the QR-upvotes at the moment are from me. I think 1-4 are all good questions, for Nate or others; but I'm extra excited about people coming up with ideas for 3.)

A central AI alignment problem: capabilities generalization, and the sharp left turn

Rob Bensinger3y70

Nate's follow-up post is now up: On how various plans miss the hard bits of the alignment challenge.

A central AI alignment problem: capabilities generalization, and the sharp left turn

Rob Bensinger3y7-1

When I think about the strawberry problem, it seems unnatural, and perhaps misleading of our attention, since there's no guarantee there's even a reasonable solution.

Why would there not be a solution?

Alex Turner3y77

To clarify, I said there might not be a reasonable solution (i.e. such that solving the strawberry problem isn't significantly harder than solving pivotal-act alignment).

Not directly answering your Q, but here's why it seems unnatural and maybe misleading-of-attention. Copied from a Slack message I sent:

First, I suspect that even an aligned AI would fail the "duplicate a strawberry and do nothing else" challenge, because such an AI would care about human life and/or about cooperating with humans, and would be asked to stand by while 1.8 humans

... (read more)

Rob B's Shortform Feed

Rob Bensinger3y80

On my model, the point of ass numbers isn't to demand perfection of your gut (e.g., of the sort that would be needed to avoid multiple-stage fallacies when trying to conditionalize a lot), but to:

Communicate with more precision than English-language words like 'likely' or 'unlikely' allow. Even very vague or uncertain numbers will, at least some of the time, be a better guide than natural-language terms that weren't designed to cover the space of probabilities (and that can vary somewhat in meaning from person to person).
At least very vaguely and roughly b

... (read more)

Rob B's Shortform Feed

Rob Bensinger3y120

Collecting all of the quantitative AI predictions I know of MIRI leadership making on Arbital (let me know if I missed any):

Aligning an AGI adds significant development time: Eliezer 95%
Almost all real-world domains are rich: Eliezer 80%
Complexity of value: Eliezer 97%, Nate 97%
Distant superintelligences can coerce the most probable environment of your AI: Eliezer 66%
Meta-rules for (narrow) value learning are still unsolved: Eliezer 95%
Natural language understanding of "right" will yield normativity: Eliezer 10%
Relevant powerful agents will be highly optim

... (read more)

Rob Bensinger3y80

On my model, the point of ass numbers isn't to demand perfection of your gut (e.g., of the sort that would be needed to avoid multiple-stage fallacies when trying to conditionalize a lot), but to:

Communicate with more precision than English-language words like 'likely' or 'unlikely' allow. Even very vague or uncertain numbers will, at least some of the time, be a better guide than natural-language terms that weren't designed to cover the space of probabilities (and that can vary somewhat in meaning from person to person).
At least very vaguely and roughly b

... (read more)

Rob Bensinger3y31

Note that I advocate for considering much more weird solutions, and also thinking much more weird world states when talking with the "general world". While in contrast, on LW and AF, I'd like to see more discussion of various "boring" solutions on which the world can roughly agree.

Can I get us all to agree to push for including pivotal acts and pivotal processes in the Overton window, then? :) I'm happy to publicly talk about pivotal processes and encourage people to take them seriously as options to evaluate, while flagging that I'm ~2-5% on them be... (read more)

Rob Bensinger3y113

With pretty high confidence, you expect sharp left turn to happen (in almost all trajectories)
This is to a large extent based on the belief that at some point "systems start to work really well in domains really far beyond the environments of their training" which is roughly the same as "discovering a core of generality" and few other formulations. These systems will be in some meaningful sense fundamentally different from eg Gato

That's right, though the phrasing "discovering a core of generality" here sounds sort of mystical and mysterious to me, which ma... (read more)

Rob Bensinger3y6-2

In my view, in practice, the pivotal acts framing actually pushes people to consider a more narrow space of discrete powerful actions, "sharp turns", "events that have a game-changing impact on astronomical stakes".

My objection to Critch's post wasn't 'you shouldn't talk about pivotal processes, just pivotal acts'. On the contrary, I think bringing in pivotal processes is awesome.

My objection (more so to "Pivotal Act" Intentions, but also to the new one) is specifically to the idea that we should socially shun the concept of "pivotal acts", and socia... (read more)

3Jan_Kulveit3y

With the last point: I think can roughly pass your ITT - we can try that, if you are interested. So, here is what I believe are your beliefs * With pretty high confidence, you expect sharp left turn to happen (in almost all trajectories) * This is to a large extent based on the belief that at some point "systems start to work really well in domains really far beyond the environments of their training" which is roughly the same as "discovering a core of generality" and few other formulations. These systems will be in some meaningful sense fundamentally different from eg Gato * From your perspective, this is based on thinking deeply about the nature of such system (note that this mostly based on hypothetical systems, and an analogy with evolution) * My claim roughly is this is only part of what's going on, where the actual think is: people start with a deep prior on "continuity in the space of intelligent systems". Looking into a specific question about hypothetical systems, their search in argument space is guided by this prior, and they end up mostly sampling arguments supporting their prior. (This is not to say the arguments are wrong.) * You probably don't agree with the above point, but notice the correlations: * You expect sharp left turn due to discontinuity in "architectures" dimensions (which is the crux according to you) * But you also expect jumps in capabilities of individual systems (at least I think so) * Also, you expect majority of hope in a "sharp right turn" histories (in contrast to smooth right turn histories) * And more * In my view yours (or rather MIRI-esque) views on the above dimensions are correlated more than expected, which suggest the existence of hidden variable/hidden model explaining the correlation. Can't speak for Critch, but my view is pivotal acts planned as pivotal acts, in the way how most people in LW community think about them, have only a very small chance of being the solution. (my guess is one or