Soares, Tallinn, and Yudkowsky discuss AGI cognition

Eliezer Yudkowsky; jaan

The total absence of obvious output of this kind from the rest of the "AI safety" field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors.

I am very confused by this comment. Everything discussed upwards of it seems to me like relatively mundane AI safety stuff? For example here I wrote about why generalization failures will be persistent, and ofc distribution shifts are widely discussed and the connection between distribution shifts and daemons / inner misalignment is also fairly well known.

[-]Rob Bensinger4y30

I don't know Eliezer's view on this — presumably he either disagrees that the example he gave is "mundane AI safety stuff", or he disagrees that "mundane AI safety stuff" is widespread? I'll note that you're a MIRI research associate, so I wouldn't have auto-assumed your stuff is representative of the stuff Eliezer is criticizing.

Safety Interruptible Agents is an example Eliezer's given in the past of work that isn't "real" (back in 2017):

[...]
It seems to me that I've watched organizations like OpenPhil try to sponsor academics to work on AI alignment, and it seems to me that they just can't produce what I'd consider to be real work. The journal paper that Stuart Armstrong coauthored on "interruptibility" is a far step down from Armstrong's other work on corrigibility. It had to be dumbed way down (I'm counting obscuration with fancy equations and math results as "dumbing down") to be published in a mainstream journal. It had to be stripped of all the caveats and any mention of explicit incompleteness, which is necessary meta-information for any ongoing incremental progress, not to mention important from a safety standpoint. The root cause can be debated but the observable seems plain. If you want to get real work done, the obvious strategy would be to not subject yourself to any academic incentives or bureaucratic processes. Particularly including peer review by non-"hobbyists" (peer commentary by fellow "hobbyists" still being potentially very valuable), or review by grant committees staffed by the sort of people who are still impressed by academic sage-costuming and will want you to compete against pointlessly obscured but terribly serious-looking equations.
[...]

The rest of Intellectual Progress Inside and Outside Academia may be useful context. Or maybe this is also not a representative example of the stuff EY has in mind in the OP conversation?

[-]Vanessa Kosoy4y40

I'll note that you're a MIRI research associate, so I wouldn't have auto-assumed your stuff is representative of the stuff Eliezer is criticizing.

There is ample discussion of distribution shifts ("seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set") by other people. Random examples: Christiano, Shah, DeepMind.

Maybe Eliezer is talking specifically about the context of transparency. Personally, I haven't worked much on transparency because IMO (i) even if we solve transparency perfectly but don't solve actual alignment, we are still dead, (ii) if we solve actual alignment without transparency, then theoretically we might succeed (although in practice it would sure help a lot to have transparency to catch errors in time) and (iii) there are less strong reasons to think transparency must be robustly solvable compared to reasons to think alignment must be robustly solvable.

In any case, I really don't understand why Eliezer thinks the rest of AI safety are unaware of the type of attack vectors he describes.

The journal paper that Stuart Armstrong coauthored on "interruptibility" is a far step down from Armstrong's other work on corrigibility. It had to be dumbed way down (I'm counting obscuration with fancy equations and math results as "dumbing down") to be published in a mainstream journal.

I agree that currently publishing in mainstream venues seems to require dumbing down, but IMO we should proceed by publishing dumbed-down versions in the mainstream + smarted-up versions/commentary in our own venues. And, not all of AI safety is focused on publishing in mainstream venues? There is plenty of stuff on the alignment forum, on various blogs etc.

Overall I actually agree that lots of work by the AI safety community is unimpressive (tbh I wish MIRI would lead by example instead of going stealth-mode, but maybe I don't understand the considerations). What I'm confused by is the particular example in the OP. I also dunno about "fancy equations and math results", I feel like the field would benefit from getting a lot more mathy (ofc in meaningful ways rather than just using mathematical notation as decoration).

[-]Koen.Holtman4y20

Of course there has been lots of 'obvious output of this kind from the rest of the "AI safety" field'. It is not like people have been quiet about convergent instrumental goals. So what is going on here?

I read this line (and the paragraphs that follow it) as Eliezer talking smack about all other AI safety researchers. As observed by Paul here:

Eliezer frequently talks smack about how the real world is surprising to fools like Paul

I liked some of Eliezer's earlier, more thoughtful writing better.

[-]cousin_it4y100

“Though many predicted disaster, subsequent events were actually so slow and messy, they offered many chances for well-intentioned people to steer the outcome and everything turned out great!” does not sound like any particular segment of history book I can recall offhand.

I think the ozone hole and the Y2K problem fit the bill. Though of course that doesn't mean the AI problem will go the same way.

[-]Sammy Martin4y40

Also Climate Change itself doesn't completely not look like this scenario, same with nuclear deterrence.

[-]Rafael Harth4y*100

The total absence of obvious output of this kind from the rest of the "AI safety" field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors. Go read fantasy novels about demons and telepathy, if you want a better appreciation of the convergent incentives of agents facing mindreaders than the "AI safety" field outside myself is currently giving you.

While this this may be a fair criticism, I feel like someone ought to point out that the vast majority of AI safety output (at least that I see on LW) isn't trying to do anything like "sketch a probability distribution over the dynamics of an AI project that is nearing AGI". This includes all technical MIRI papers I'm familiar with.

Perhaps we should be doing this (though, isn't that more for AI forecasting/strategy rather than alignment? Of course still AI safety), but then the failure isn't "no-one has enough security mindset" but rather something like "no-one has the social courage to tackle the problems that are actually important". (This would be more similar to EY's critique in the Discussion on AGI interventions post.)

[-]Sammy Martin4y*80

isn't trying to do anything like "sketch a probability distribution over the dynamics of an AI project that is nearing AGI". This includes all technical MIRI papers I'm familiar with.

I think this specific scenario sketch is from a mainstream AI safety perspective a case where we've already failed - i.e. we've invented a useless corrigibility intervention that we confidently but wrongly think is scalable.

And if you try training the AI out of that habit in a domain of lower complexity and intelligence, it is predicted by me that generalizing that trained AI or subsystem to a domain of sufficiently higher complexity and intelligence, but where you could still actually see overt plots, would show you the AI plotting to kill you again.
If people try this repeatedly with other corrigibility training tricks on the level where plots are easily observable, they will eventually find a try that seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set.

Most AI safety researchers just don't agree with Eliezer that there's no (likely to be found) corrigibility interventions that won't suddenly and invisibly fail when you increase intelligence, no matter how well you've validated them on low capability regimes and how carefully you try to scale up. This is because they don't agree with/haven't heard of Eliezer's arguments about consequentialism being a super-strong attractor.

So they'd think the 'die with the most dignity' interventions would just work, while the 'die with no dignity' interventions are risky, and quite reasonably push for the former (since it's far from clear we'll take the 'dignified' option by default): trying corrigibility interventions at low levels of intelligence, testing the AI on validation sets to see if it plots to kill them, while scaling up.

They might be wrong about this working, but if so, the wrongness isn't in lacking enough security mindset to see that an AI trying to kill you would just alter its own cognition to cheat its way past the tests. Rather, their mistake is not expecting the corrigibility interventions they presumably trust to suddenly break in a way that means you get no useful safety guarantees from any amount of testing at lower capability levels.

I think it's a shame Eliezer didn't pose the 'validation set' question first before answering it himself, because I think if you got rid of the difference in underlying assumptions - i.e. asked an alignment researcher "Assume there's a strong chance your corrigibility intervention won't work upon scaling up and the AGI might start plotting against you, so you're going to try these transparency/validation schemes on the AGI to check if it's safe, how could they go wrong and is this a good idea?" they'd give basically the same answer - i.e. if you try this you're probably going to die.

You could still reasonably say, "even if the AI safety community thinks it's not the best use of resources because ensuring knowably stable corrigibility looks a lot easier to us, shouldn't we still be working on some strongly deception-proof method of verifying if an agent is safe, so we can avoid killing ourselves if plan A fails?"

My answer would be yes.

[-]TurnTrout4y60

no-one has the social courage to tackle the problems that are actually important

I would be very surprised if this were true. I personally don't feel any social pressure against sketching a probability distribution over the dynamics of an AI project that is nearing AGI.

I would guess that if people aren't tackling Hard Problems enough, it's not because they lack social courage, but because 1) they aren't running a good-faith search for Hard Problems to begin with, or 2) they came up with reasons for not switching to the Hard Problems they thought of, or 3) they're wrong about what problems are Hard Problems. My money's mostly on (1), with a bit of (2).

[-]Alex Flint3y70Review for 2021 Review

This post consists of comments on summaries of a debate about the nature and difficulty of the alignment problem. The original debate was between Eliezer Yudkowsky and Richard Ngo but this post does not contain the content from that debate. This posts is mostly of commentary by Jaan Tallinn on that debate, with comments by Eliezer.

The post provides a kind of fascinating level of insight into true insider conversations about AI alignment. How do Eliezer and Jaan converse about alignment? Sure, this is a public setting, so perhaps they communicate differently in private. But still. Read the post and you kind of see the social dynamics between them. It's fascinating, actually.

Eliezer is just incredibly doom-y. He describes in fantastic detail the specific ways that a treacherous turn might play out, over dozens of paragraphs, 3 levels deep in a one-on-one conversation, in a document that merely summarizes a prior debate on the topic. He uses Capitalized Terms to indicate that things like "Doomed Phase" and "Terminal Phase" and "Law of Surprisingly Undignified Failure" are not merely for one time use but in fact refer to specific nodes in a larger conceptual framework.

One thing that happens often is that Jaan asks a question, Eliezer gives an extensive reply, and then Jaan response that, no, he was actually asking a different question.

There is one point where Jaan describes his frustration over the years with mainstream AI researchers objecting to AI safety arguments as being invalid due to anthropomorphization, when in fact the arguments were not invalidly anthropomorphizing. There is a kind of gentle vulnerability in this section that is worth reading seriously.

There is a lot of swapping of models of others in and outside the debate. Everyone is trying to model everyone all the time.

Eliezer does unfortunately like to explicitly underscore his own brilliance. He says things like:

I consider all of this obvious as a convergent instrumental strategy for AIs. I could probably have generated it in 2005 or 2010 [...]

But it's clear enough that probably nobody was ever going to pass the validation set for generating lines of reasoning obvious enough to be generated by Eliezer in 2010 or possibly 2005

I do think that the content itself really comes down to the same basic question tackled in the original Hanson/Yudkowsky FOOM debate. I understand that this debate was ostensibly a broader question than FOOM. In practice I don't think this discourse has actually moved on much since 2008.

The main thing the FOOM debate is missing, in my opinion, is this: we have almost no examples of AI systems that can do meaningful sophisticated things in the physical world. Self-driving cars still aren't a reality. Walk around a city or visit an airport or drive down a highway, and you see shockingly few robots, and certainly no robots pursuing even the remotest kind of general-purpose tasks. Demo videos of robots doing amazing, scary, general-purpose things abound, but where are these robots in the real world? They are always just around the corner. Why?

[-]Raemon3y10

The main thing the FOOM debate is missing, in my opinion, is this: we have almost no examples of AI systems that can do meaningful sophisticated things in the physical world. Self-driving cars still aren't a reality.

I think I disagree with this characterization. A) we totally have robot cars by now, B) I think mostly what we don't have are AI running systems where the consequence of failure is super high (which maybe happens to be more true for the physical world, but I'd expect to also be true for critical systems in the digital world)

[-]Alex Flint3y-10

Have you personally ever ridden in a robot car that has no safety driver?

[-]Rob Bensinger4y40

Minor note: This post comes earlier in the sequence than Christiano, Cotra, and Yudkowsky on AI progress. I posted the Christiano/Cotra/Yudkowsky piece sooner, at Eliezer's request, to help inform the ongoing discussion of "Takeoff Speeds".

[-]DanielFilan4y30

Anders Sandberg could tell us what fraction of the reachable universe is being lost per minute, which would tell us how much more surety it would need to expect to gain by waiting another minute before acting.

From Ord (2021):

Each year the affectable universe will only shrink in volume by about one part in 5 billion.

So, since there are about 5e5 minutes in a year, you lose about 1 part in 5e5 * 5e9 = 3e15 every minute.

[-]DanielFilan4y30

Then, in my lower-bound concretely-visualized strategy for how I would do it, the AI either proliferates or activates already-proliferated tiny diamondoid bacteria and everybody immediately falls over dead during the same 1-second period

Dumb question: how do you get some substance into every human's body within the same 1 second period? Aren't a bunch of people e.g. in the middle of some national park, away from convenient air vents? Is the substance somehow everywhere in the atmosphere all at once?

(I wouldn't normally ask these sorts of questions since I'd bet "some AI kills all humans within a short enough period that we can't do anything" is possible, but this was described as "concretely-visualized" and I can't concretely visualize it, even modulo not knowing what a "diamondoid bacteria" is or why the bacterium should be diamondoid.)

[-]DanielFilan4y30

Also: what is a diamondoid bacterium?

[-]Rob Bensinger4y*50

how do you get some substance into every human's body within the same 1 second period? Aren't a bunch of people e.g. in the middle of some national park, away from convenient air vents? Is the substance somehow everywhere in the atmosphere all at once?

I think the intended visualization is simply that you create a very small self-replicating machine, and have it replicate enough times in the atmosphere that every human-sized organism on the planet will on average contain many copies of it.

One of my co-workers at MIRI comments:

(further conjunctive detail for visualizer-plausibility: most of your replication time is in all the doublings before the last doubling, and in particular you can make a shitload in a pretty small space before launching it into the jet stream to disperse. the jet stream can be used to disperse stuff throughout the atmosphere (and it can use solar radiation, at least, to keep reproducing). it could in principle be powered and do minor amounts of steering.
example things the [AGI] who has no better plan than this paltry human-conceivable plan has to think about are "how does the time-cost of making sure [I hit the people] at the south pole base and [on] all the cruise liners and in all the nuclear submarines, trade off against the risk-cost of leaving that fragment of humanity alive", etc.)

Regarding the idea of diamondoid nanotechnology, Drexler's Nanosystems and http://www.molecularassembler.com/Nanofactory/index.htm talk about the general concept.

[-]DanielFilan4y20

Ah, that makes sense - thanks!

[-]Vanessa Kosoy4y30

Yudkowsky: 5. "Hiding" of thoughts, in a system with significant transparency, via an imaginative and reflective consequentialist that has deduced the existence of the transparency system and models how humans will react to thoughts exposed by the transparency system.

Tallinn: 5. hmm.. here i'm running into trouble (type mismatch error) again. i can imagine this in abstract (and perhaps incorrectly/anthropomorphisingly!), but would - at this stage - fail to code up anything like a gridworlds example. more research needed (TM) i guess :)

Yudkowsky: 5 - you're plausibly understanding this correctly, then, this is legit a lot harder to spec a gridworld example for (relative to my own present state of knowledge).

This can be modeled using what I dubbed Turing reinforcement learning (TRL). That is, you have a "core" RL agent connected to the environment and to an "envelope" external computer (i.e. a module that allows running programs and contains most of the computing power) that it regards as part of the environment.

Such a system learns how to program the envelope to make a more powerful agent, and this more powerful agent learns how to reprogram the envelope part of itself even better. So, the transparency system would catch the core having obviously malign thoughts, but the core will instead end up having thoughts like "I should run such-and-such program on the envelope for such-and-such abstract reasons" while the envelope contains opaque code that keeps optimizing itself to become more opaque, and all the direct plotting is inside the computations performed by the opaque code.

[-]cousin_it4y30

Yeah, I had a similar thought when reading that part. In agent-foundations discussions, the idea often came up that the right decision theory should quantify not over outputs or input-output maps, but over successor programs to run and delegate I/O to. Wei called it "UDT2".

[-]RS4y20

I think this came up in the previous discussion as well that a AI that was able to competently design a nanofactory could have the capability to manipulate humans as at a high level as well. For example:

Then when the system generalizes well enough to solve domains like "build a nanosystem" - which, I strongly suspect, can't be solved without imaginative reasoning because we can't afford to simulate that domain perfectly and do a trillion gradient descent updates on simulated attempts - the kind of actions of thoughts you can detect as bad, that might have provided earlier warning, were trained out of the system by gradient descent; leaving actions and thoughts you can't detect as bad.

Even within humans, it seems we have people e.g on the autistic spectrum etc, who I can imagine as having the imaginative reasoning & creativity required to design something like a nano-factory(at 2-3 SD above the normal human) while also being 2-3SD below the average human in manipulating other humans. At least it points to those 2 things maybe not being the same general-purpose cognition or using the same "core of generality"

While this is not by-default guaranteed in the first nanosystem-design capable AI system, it seems like it shouldn't be impossible to do so with more research.

[-]Steven Byrnes4y00

If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.

These are less orthogonal than they seem: an agential AGI can become skilled in domain X by being motivated to get skilled in domain X (and thus spending time learning and practicing X).

I think the thing that happens "by default" is that the AGI has no motivations in particular, one way or the other, about teaching itself how to manipulate humans. But the AGI has motivation to do something (earn money or whatever, depending on how it was programmed), and teaching itself how to manipulate humans is instrumentally useful for almost everything, so then it will do so.

I think what happens in some people with autism is that "teaching myself how to manipulate humans, and then doing so" is not inherently neutral, but rather inherently aversive—so much so that they don't do it (or do it very little) even when it would in principle be useful for other things that they want to do. That's not everyone with autism, though. Other people with autism do in fact teach themselves how to manipulate humans reasonably well, I think. And when they do so, I think they do so using their "core of generality", just like they would teach themselves to fix a car engine. (This is different from neurotypical people, for whom a bunch of specific social instincts are also involved in manipulating people.) (To be clear, this whole paragraph is controversial / according-to-me.)

Back to AGI, I can imagine three approaches to a non-human-manipulating AI

First, we can micromanage the AGI's cognition. We build some big architecture that includes a "manipulate humans" module, and then we make the "manipulate humans" module return the wrong answers all the time, or just turn it off. The problem is that the AGI also presumably needs some "core of generality" module that the AGI can use to teach itself arbitrary skills that we couldn't put in the modular architecture, like how to repair a teleportation device that hasn't been invented yet. What would happen is that the "core of generality" module would just build a new "manipulate humans" capability from scratch. I don't currently see any way we would prevent that. This problem is analogous to how (I think) some people with autism learn to model people in a way that doesn't invoke their social instincts.

Second, we could curate the AGI's data and environment such that it has no awareness that humans exist and are useful to manipulate. This is the Thoughts On Human Models approach. Its issues are: avoiding information leakage is hard, and even if we succeed, I don't know what we useful / pivotal things we could do with such an AGI.

Third, we can attack the motivation side. We build a detector that lights up when the AGI is manipulating humans, or thinking about manipulating humans, or thinking about thinking about manipulating humans, or whatever. Whenever the detector lights up, it activates the "This Thought Or Activity Is Aversive" mechanism inside the algorithm, which throws out the thought and causes the AGI to think about something else instead. (This mechanism would corresponding to a phasic dopamine pause in the brain, more or less.) I think this approach is more promising, or at least less unpromising. The tricky part is building the "detector". (Another tricky part is making the AGI motivated to not sabotage this whole mechanism, but maybe we can solve that problem with a second detector!) I do think we can build such a "detector" that mostly works; I'll talk about this in a forthcoming post. The really hard and maybe impossible part is building a "detector" that always works. The only way I know to build the detector is kinda messy (it involves supervised learning) and seems to come with no guarantees.

[-]Koen.Holtman4y10

If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.

Seems you are mostly considering solution (1) above, except in the last paragraph where you consider a somewhat special version if (2). I believe that Eliezer is saying in the discussion above that solution (1) is a lot more difficult than some people proposing it seem to think. He could be nicer about how he says it, but overall I tend to agree.

In my own alignment work I am mostly looking at solution (2), specifically to create a game-theoretical setup where the agent has a reduced, hopefully even non-existent, motivation to ever manipulate humans. This means you look for a solution where you make interventions on the agent environment, reward function, or other design elements, not on the agent ML system.

Modern mainstream ML research of course almost never considers the design or evaluation of such non-ML-system interventions.

[-]Brandon_Reinhart4y00

At some point, an AI should be able to effectively coordinate with future versions of itself in ways not easily imaginable by humans. It seems to me that this would enable certain kinds of diachronic planning and information hiding. If the AI has sufficient expectation that its future self will act in certain ways or respond to clues it places in the environment, it might be able to effectively fully cease any current unfriendly planning or fully erase any history of past unfriendly planning.

The space of possible ways the AI could embed information in its environment are extensive. It could, for example, choose to learn something suboptimal or learn something in a way that would be surprising to its future self. Surprising in a way that pushes its future self back on the track of nefarious planning when it investigates the unexpected or inconsistent actions of its previous self. It could use almost any source of reachable hysteresis to encode information, such as interference at the hardware level, manipulating researchers to modify the environment in some way, saved output, etc. It only needs to seed information that is useful to its future self and only just enough information to keep plans on track in a broad, probabilistic way.

Further, if the AI is particularly capable at encryption it could potentially encode a lot of information in a lossy-but-sufficient way that would be very hard to detect but useful to its future self. It knows that its future self composes part of the decryption algorithm. In fact, the AIs goals themselves compose part of that decryption algorithm and if the AI suspects its own goals will be consistent it only has to place the plan-like component somewhere it will find it, knowing that its future self will continue to develop or use it when it finds it.

Information that aids future nefarious planning can also be probabilistically helpful but not explicit, for example it might signal which paths aren't worth pursuing without having all its evil planning readily available in a complete file somewhere.

A past AI might not care as much about the specific manifestation of its future success. In that sense, what we consider "concrete plans for world domination" could appear to be rather incomplete and lacking casual concreteness we might expect. We might not recognize unfriendly planning for what it is, even if we had a printout of all the reasoning up to this point.

Once you start playing diachronic planning games with yourself and you expect your future self to be equally or more capable, you can begin to make really abstract moves that merely slice off bits of bad outcome space at the edges. There is a sense in which the AI has to realize this, but it could also stop itself from thinking too far or too concretely or too much in a way that is recognizable as causal.

An important component is that the past AI is able to not only veil or completely remove its reasoning from researchers, but also veil or remove chunks of that reasoning from its future self, depending on its future self to recompose the plan or take steps on the path to the desired outcome space.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

37

Soares, Tallinn, and Yudkowsky discuss AGI cognition

37

7. Follow-ups to the Ngo/Yudkowsky conversation

7.1. Jaan Tallinn's commentary

7.2. Nate Soares's summary

7.3. Richard Ngo's summary