Richard Ngo's Shortform

Richard_Ngo

Richard Ngo's Shortform — AI Alignment Forum

153 comments, sorted by

Click to highlight new comments since: Today at 10:34 PM

I feel kinda frustrated whenever "shard theory" comes up in a conversation, because it's not a theory, or even a hypothesis. In terms of its literal content, it basically seems to be a reframing of the "default" stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is "assume they're just a set of heuristics".

This is a particular pity because I think there's a version of the "shard" framing which would actually be useful, but which shard advocates go out of their way to avoid. Specifically: we should be interested in "subagents" which are formed via hierarchical composition of heuristics and/or lower-level subagents, and which are increasingly "goal-directed" as you go up the hierarchy. This is an old idea, FWIW; e.g. it's how Minsky frames intelligence in Society of Mind. And it's also somewhat consistent with the claim made in the original shard theory post, that "shards are just collections of subshards".

The problem is the "just". The post also says "shards are not full subagents", and that "we currently estimate that most shards are 'optimizers' to the extent that a bacterium or a thermostat is an optimizer." But the whole point... (read more)

7TurnTrout2y

Personally, I'm not ignoring that question, and I've written about it (once) in some detail. Less relatedly, I've talked about possible utility function convergence via e.g. A shot at the diamond-alignment problem and my recent comment thread with Wei_Dai. It's not that there isn't more shard theory content which I could write, it's that I got stuck and burned out before I could get past the 101-level content. I felt * a) gaslit by "I think everyone already knew this" or even "I already invented this a long time ago" (by people who didn't seem to understand it); and that * b) I wasn't successfully communicating many intuitions;[1] and * c) it didn't seem as important to make theoretical progress anymore, especially since I hadn't even empirically confirmed some of my basic suspicions that real-world systems develop multiple situational shards (as I later found evidence for in Understanding and controlling a maze-solving policy network). So I didn't want to post much on the site anymore because I was sick of it, and decided to just get results empirically. [...] I've always read "assume heuristics" as expecting more of an "ensemble of shallow statistical functions" than "a bunch of interchaining and interlocking heuristics from which intelligence is gradually constructed." Note that (at least in my head) the shard view is extremely focused on how intelligence (including agency) is comprised of smaller shards, and the developmental trajectory over which those shards formed. 1. ^ The 2022 review indicates that more people appreciated the shard theory posts than I realized at the time.

[-]Daniel Kokotajlo2y80

FWIW I'm potentially intrested in interviewing you (and anyone else you'd recommend) and then taking a shot at writing the 101-level content myself.

2Daniel Kokotajlo2y

Curious to hear whether I was one of the people who contributed to this.

4TurnTrout2y

Nope! I have basically always enjoyed talking with you, even when we disagree.

2Daniel Kokotajlo2y

Ok, whew, glad to hear.

[-]Richard_Ngo6mo4019

An analogy that points at one way I think the instrumental/terminal goal distinction is confused:

Imagine trying to classify genes as either instrumentally or terminally valuable from the perspective of evolution. Instrumental genes encode traits that help an organism reproduce. Terminal genes, by contrast, are the "payload" which is being passed down the generations for their own sake.

This model might seem silly, but it actually makes a bunch of useful predictions. Pick some set of genes which are so crucial for survival that they're seldom if ever modified (e.g. the genes for chlorophyll in plants, or genes for ATP production in animals). Treating those genes as "terminal" lets you "predict" that other genes will gradually evolve in whichever ways help most to pass those terminal genes on, which is what we in fact see.

But of course there's no such thing as "terminal genes". What's actually going on is that some genes evolved first, meaning that a bunch of downstream genes ended up selected for compatibility with them. In principle evolution would be fine with the terminal genes being replaced, it's just that it's computationally difficult to find a way to do so without breaking do... (read more)

[-]Daniel Kokotajlo6mo198

Thinking step by step:

I like the point that fundamentally the structure is tree-like, and insofar as terminal goals are a thing it's basically just that they are the leaf nodes on the tree instead of branches or roots. Note that this doesn't mean terminal goals aren't a thing; the distinction is real and potentially important.

I think an improvement on the analogy would be compare to a human organization rather than to a tree. In a human organization (such as a company or a bureaucracy) at first there is one or a small group of people, and then they hire more people to help them with stuff (and retain the option to fire them if they stop helping) and then those people hire people etc. and eventually you have six layers of middle management. Perhaps goals are like this. Evolution and/or reinforcement learning gives us some goals, and then those goals create subgoals to help them, and then those subgoals create subgoals, etc. In general when it starts to seem that a subgoal isn't going to help with the goal it's supposed to help with, it gets 'fired.' However, sometimes subgoals are 'sticky' and become terminal-ish goals, analogous to how it's sometimes hard to fire people & how t... (read more)

[-]TsviBT6mo184

But of course there's no such thing as "terminal genes". What's actually going on is that some genes evolved first, meaning that a bunch of downstream genes ended up selected for compatibility with them. In principle evolution would be fine with the terminal genes being replaced, it's just that it's computationally difficult to find a way to do so without breaking downstream dependencies.

I think your analysis is incorrect. The book is called "The Selfish Gene". No basic unit of evolution is perfect, but probably the best available is the gene--which is to say, genomic locus (defined relative to surrounding context). An organism is a temporary coalition of its genes. Generally there's quite strong instrumental alignment between all the genes in an organism, but it's not always perfect, and you do get gene drives in nature. If a gene could favor itself at the expense of the other genes in that organism (in terms of overall population frequency), it totally would.

I think this is a good analogy for how human values work. We start off with some early values, and then develop instrumental strategies for achieving them. Those instrumental strategies become crystallized and then give

... (read more)

4Wei Dai6mo

How does the Shareholder Value Revolution fit into your picture? From an AI overview: [...] It seems to better fit my normative picture of human values: terminal values come from philosophy, and subservience of instrumental values to terminal values improves over time as we get better at it, without need to permanently raise instrumental values to terminal status or irreversibly commingle the two.

2Richard_Ngo5mo

I'm in the middle of writing an essay which discusses the shareholder value revolution (amongst many other examples) as the process of making a conceptual mistake.

3Linda Linsefors2mo

I think you're making the distinction more confusing than it has to be. There are things that has motivational pull, and there are things that don't but I do them anyway because they are a necessary step to get what I actually want. Say I want to get an apple, and the easiest way to get one is going to the store and by some. Going to the sore in this story is clearly an instrumental goal, and enjoying eating my apple is a terminal goal[1] Things that are instrumental can acquire the property of being terminal by association in our brain, because of how humans brains work. This is not true of every agent design. E.g. if I start to want to go to the store for no particular reason, then going to the store have gotten some terminal goal quality. But it is possibly to repeatedly do a thing for instrumental reasons (including for humans) without that thing every becoming a terminal goal. E.g. I regularly by food, and I've never been tempted to go to the store unless if I'm motivated by some terminal goal of buying some specific thing. Every agent that is more complicated than a thermostat does have temporary instrumental goals all the time. [...] This is what happens if you're a long-term consequentialist (values over outcomes not processes), which non-updating goals, and no value discounting. I agree that this is a special mind shape, and not every agent. But you don't end up there just by having a distinction between terminal and instrumental values. 1. ^ Although it can have instrumental qualities too, i.e. I might want to be less hungry.

3DanielFilan6mo

I think it would help me if you explained what you think it would mean for there to be an instrumental/terminal distinction, since to my eyes you've just spelled out the instrumental/terminal split.

4Richard_Ngo5mo

In my "goals having power over other goals" ontology, the instrumental/terminal distinction separates goals into two binary classes, such that goals in the "instrumental" class only have power insofar as they're endorsed by a goal in the "terminal" class. By contrast, when I talk about "instrumental strategies become crystallized", what I mean is that goals which start off instrumental will gradually accumulate power in their own right: they're "sticky".

1romeostevensit6mo

The bit about layering creating functional fixedness reminds me of organisms (especially humans, but more broadly evolution as a search process) as 'homeostatic envelope extenders' a la Nozick's take on Quine.

[-]Richard_Ngo6mo*3832

Many of Paul Christiano's writings were valuable corrections to the dominant Yudkowskian paradigm of AI safety. However, I think that many of them (especially papers like concrete problems in AI safety and posts like these two) also ended up providing a lot of intellectual cover for people to do "AI safety" work (especially within AGI companies) that isn't even trying to be scalable to much more powerful systems.

I want to register a prediction that "gradual disempowerment" will end up being (mis)used in a similar way. I don't really know what to do about this, but I intend to avoid using the term myself. My own research on related topics I cluster under headings like "understanding intelligence", "understanding political philosophy", and "understanding power". To me this kind of understanding-oriented approach seems more productive than trying to create a movement based around a class of threat models.

[-]Jan_Kulveit6mo60

I do agree there is some risk of the type you describe, but mostly it does not match my practical experience so far.

The approach to "avoid using the term" makes little sense. There is a type difference between area of study ('understanding power') and dynamic ('gradual disempowerment'). I don't think you can substitute term for area of study for term for a dynamic or thread model, so avoiding using the term could be done mostly by either inventing another term for the the dynamic, or not thinking about the dynamic, or similar moves, which seem epistemically unhealthy.

In practical terms I don't think there is much effort to "create a movement based around a class of threat models". At least as authors of the GD paper, when trying to support thinking about the problems, we use understanding-directed labels/pointers (Post-AGI Civilizational Equilibria), even though in many ways it could have been easier to use GD as a brand.

"Understanding power" is fine as a label for part of your writing, but in my view is basically unusable as term for coordination.

Also, in practical terms, gradual disempowerment does not seem particularly convenient set of ideas for justifying that working in... (read more)

4Richard_Ngo6mo

I mostly wouldn't expect it to at this point, FWIW. The people engaged right now are by and large people sincerely grappling with the idea, and particularly people who are already bought into takeover risk. Whereas one of the main mechanisms by which I expect misuse of the idea is that people who are uncomfortable with the concept of "AI takeover" can still classify themselves as part of the AI safety coalition when it suits them. As an illustration of this happening to Paul's worldview, see this Vox article titled "AI disaster won't look like the Terminator. It'll be creepier." My sense is that both Paul and Vox wanted to distance themselves from Eliezer's scenarios, and so Paul phrased his scenario in a way which downplayed stuff like "robot armies" and then Vox misinterpreted Paul to further downplay that stuff. (More on this from Carl here.) Another example: Sam Altman has previously justified racing to AGI by appealing to the idea that a slow takeoff is better than a fast takeoff. Now, some of these dynamics are unavoidable—we shouldn't stop debating takeoffs just because people might misuse the concepts. But it's worth keeping an eye out for ideas that are particularly prone to this, and gradual disempowerment seems like one. [...] Well, it's much more convenient than "AI takeover", and so the question is how much people are motivated to use it to displace the AI takeover meme in their internal narratives. [...] Kudos for doing so. I don't mean to imply that you guys are unaware of this issue or negligent; IMO it's a pretty hard problem to avoid. I agree that stuff like "understanding power" is nowhere near adequate as a replacement. However, I do think that there's some concept like "empowering humans" which is a way to address both takeover risk and gradual disempowerment risk, if we fleshed it out into a proper research field. (Analogously, ambitious mechinterp is a way to address both fast take-off and slow take-off risks.) And so I expect that a clu

[-]Richard_Ngo4mo*2940

There's a strong pattern in ratfic of the protagonist "winning" by gaining the power to design a new world order from scratch—i.e. taking over the world. It's a very High Modernist mindset (as I pointed out in a recent tweet). And once you see how crucial this is to the rationalist perspective on what a good future looks like, it's hard to unsee.

You might respond: the worlds these protagonists find themselves in are usually so bad that seizing absolute power is in fact the most ethical thing to do. But the worlds didn't have to be that bad! The writers chose to design them that way—I expect in significant part because that provides a narratively compelling backdrop for the thing they wanted to write about, which was their heroes taking over the world.

From my current perspective, this feels ominous in a kinda similar way as if I read a bunch of Marxist fiction from the 1800s and noticed that it always ended in a communist revolution.

Spoilers for a range of ratfic (unfortunately telling you which ratfic itself constitutes a spoiler):

EDIT: the spoiler tags seem not to be working on the images. Mods, can you help? For now I'll add a bunch of space below. Expect many spoilers if you scr... (read more)

[-]Daniel Kokotajlo4mo105

Perhaps the direction of causation goes the other way? In the real world, it seems like we are headed towards a situation where 1-3 giant armies of superintelligent AIs transform the planet. Whoever controls those AIs, if anyone does, will be in a position analogous to all those fictional examples you give.

All the authors of all those works of fiction know this, or at least have been seriously entertaining this AI-power-concentration-hypothesis for years.

So, they are writing fictional analogies for the situation they expect to actually happen in real life. Except of course, since they are writing fiction, it has to have a happy ending.

Their expectation that it's going to happen in real life is in fact correct unfortunately, as best as I can tell. Will the ending in real life be as happy as the endings in fiction? Of course not, but they shouldn't be taken as implicitly claiming that, anymore than Tolkien should be taken as implicitly claiming that all this mechanization, industrialization, technology, etc. that he was reacting to was going to end happily ever after in real life thanks to the plucky heroism of some country bumpkins who end up destroying it all.

[-]Richard_Ngo4mo88

So, they are writing fictional analogies for the situation they expect to actually happen in real life. Except of course, since they are writing fiction, it has to have a happy ending.

Well, exactly what I'm disputing here is how happy the ending is. For example, imagine that all of these stories played out exactly the same, with the exact same amount of concentration of power. But instead of the heroes getting to use that power to reshape the world, the power instead goes to.... a random person off the street. I expect that these authors, if they were to write that kind of ending, would portray it as a maybe-happy-ish ending, but one that's still pretty scary and uncertain.

And indeed, this is roughly how I'd describe the stories mentioned above where a mostly-aligned AI gets total power—Friendship is Optimal, Branches on the Tree of Time, and Metamorphosis of Prime Intellect. These stories really grapple with the sense of unease and tension that comes with almost everyone losing almost all their power.

Whereas when I look at the examples of ratfic above, the stance they're taking seems to be "our heroes became dictators of the universe. This is a straightforwardly happy ending." And indeed, on several occasions (maybe as many as half a dozen?) I've heard people describe the ending of Worth the Candle as one of the best utopias they can imagine. All of this really seems like a big ideological blind spot.

[-]Daniel Kokotajlo4mo918

I don't think I understand how you are objecting to my point. Straightforwardly happy endings are the norm in fiction. I think you are reading too much into their choice to make the ending straightforwardly happy.

Let's think about, idk, the Star Wars series for comparison. In it, the plucky heroes go to war against the bad guys, kill loads of them in a series of military engagements, and ultimately emerge victorious with their faction the new rulers of the galaxy. This is presented as straightforwardly happy ending.

In fact lots of mainstream popular stories are of this pattern: The heroes defeat the bad guys in armed combat, kill loads of them, and then their faction has more political and military power afterwards and they all live happily and it's great and uncomplicated.

Applying your reasoning to these popular stories, I guess you'd say: "Ominous. It seems like most Hollywood writers expect there to be a giant war in the future, possibly a civil war or series of assassinations. They are encouraging people to fight in this war / battle / revolution / assassination-series, and win, and they are portraying this as uncomplicatedly good."

But I feel like there's a better explanation f... (read more)

[-]Richard_Ngo4mo65

I do actually think that the general trope of "the rebels winning is sufficient for a happy ending" is pretty indicative of poor ethical thinking.

But even Hollywood balks at their heroes ending up with literal godlike control of the world. For example (though I haven't watched the series) my impression of the Avengers franchise is that they introduce a plot device (the infinity gauntlet) that gives its wielder godlike powers, the heroes use it specifically to defeat the bad guy and undo the damage he caused, and then they destroy the device.

In other words, they got to the exact point that ratfic heroes got to, and then their happy ending specifically involves them giving up the same kind of godlike power that ratfic heroes typically use to make themselves dictators of the universe.

Similarly for Superman: his happy endings involve him successfully using his godlike powers to beat the bad guys without changing the established world power structures basically at all. And I feel pretty confident that a big reason Superman doesn't end up taking over the world is because the writers and viewers would have moral qualms about that kind of ending.

tl;dr: there are many ways to make a story ... (read more)

3Eli Tyre4mo

Yes. And I claim they're wrong about that. There's lots of banal evil (some of which that is not regarded as evil by typical social morality, some of which is, but is generally treated as normal and ignored). I would fight a war to end factory farming, if that would help. If I ended up with "ultimate power" somehow, by some mechanism that didn't involve me taking on ultimate power for a specific narrow mandate, I think it is both ethically correct to use it to permanently end many (but probably not all) of those evils. This is indeed pretty scary.

3Richard_Ngo4mo

Oh, I think of "ending factory farming" as very far from "taking over the world". If Superman were a skilled political operator it could be as simple as arranging to take photoshoots with whichever politicians legislated the end of factory farms. Or if he were less skilled it could involve doing various kinds of property damage to factory farms (potentially even things which there aren't laws against, like flying around them in a way which blows the buildings over). This might escalate to the government trying to arrest him, and outright conflict, but honestly if Superman isn't skillful enough to defuse that kind of thing, given his influence, then he doesn't have much business imposing political changes on the world anyway. A politically unskilled and/or unvirtuous Superman trying to end factory farming could quite easily destabilize society in a way that is far worse long-term than letting factory farming end on whatever the natural counterfactual timeline is (without AI, maybe 20 or 30 years?) Relatedly I'm increasingly coming to believe that this reasoning applies to Lincoln, and that we'd be in a much better position if he'd let the Confederacy secede and then imposed strong economic and moral pressure on them to end slavery.

[-]Richard_Ngo7mo2618

Someone on the EA forum asked why I've updated away from public outreach as a valuable strategy. My response:

I used to not actually believe in heavy-tailed impact. On some gut level I thought that early rationalists (and to a lesser extent EAs) had "gotten lucky" in being way more right than academic consensus about AI progress. I also implicitly believed that e.g. Thiel and Musk and so on kept getting lucky, because I didn't want to picture a world in which they were actually just skillful enough to keep succeeding (due to various psychological blockers).

Now, thanks to dealing with a bunch of those blockers, I have internalized to a much greater extent that you can actually be good not just lucky. This means that I'm no longer interested in strategies that involve recruiting a whole bunch of people and hoping something good comes out of it. Instead I am trying to target outreach precisely to the very best people, without compromising much.

Relatedly, I've updated that the very best thinkers in this space are still disproportionately the people who were around very early. The people you need to soften/moderate your message to reach (or who need social proof in order to get involved)... (read more)

[-]Cleo Nardo7mo*72

Some thoughts on public outreach and "Were they early because they were good or lucky?"

Who are the best newcomers to AI safety? I'd be interested to here anyone's takes, not just Richard's. Who has done great work (by your lights) since joining after ChatGPT?
Rob Miles was the high watermark of public outreach. Unfortunately he stopped making videos. I'd be far more excited by a newcomer if they were persuaded by a Rob Miles video than an 80K video -- videos like 80K's "We're Not Ready for Superintelligence"^[1] are better on legible/easy-to-measure dimensions but worse in some more important way I think.
I observe a suspicious amount of 'social contagion' among the pre-ChatGPT AI Safety crowd, which updates me somewhat in favour of "lucky" over "good".^[2]

^{^}
^{^}
A bit anecdotal but: there are ~ a dozen people who went to our college in 2017-2020 now working full-time in AI safety, which is much higher than other colleges at the same university. I'm not saying any of us are particularly "great" -- but this suggests social contagion / information cascade, rather than "we figured this stuff out from the empty string". Maybe if you go back further (e.g. 2012-2016) there was less social co

... (read more)

[-]Richard_Ngo7mo1814

In trying to reply to this comment I identified four "waves" of AI safety, and lists of the central people in each wave. Since this is socially complicated I'll only share the full list of the first wave here, and please note that this is all based on fuzzy intuitions gained via gossip and other unreliable sources.

The first wave I’ll call the “founders”; I think of them as the people who set up the early institutions and memeplexes of AI safety before around 2015. My list:

Eliezer Yudkowsky
Michael Vassar
Anna Salamon
Carl Schulman
Scott Alexander
Holden Karnofsky
Nick Bostrom
Robin Hanson
Wei Dai
Shane Legg
Geoff Anders

The second wave I’ll call the “old guard”; those were the people who joined or supported the founders before around 2015. A few central examples include Paul Christiano, Chris Olah, Andrew Critch and Oliver Habryka.

Around 2014/2015 AI safety became significantly more professionalized and growth-oriented. Bostrom published Superintelligence, the Puerto Rico conference happened, OpenAI was founded, DeepMind started a safety team (though I don't recall exactly when), and EA started seriously pushing people towards AI safety. I’ll call the people who entered the field from then un... (read more)

7Wei Dai7mo

Given that Vernor Vinge wrote The Coming Technological Singularity: How to Survive in the Post-Human Era in 1993, which single-handedly established much of the memeplex, including the still ongoing AI-first vs IA-first debate, another interesting question is why didn't anyone found the AI safety field until around 2000. For me, I'm not sure when I read this essay, but I did read Vinge's A Fire Upon the Deep in 1994 as a college freshman, which made me worried about a future AI takeover, but (as I wrote previously) I thought there would be plenty of smarter people working in AI safety so I went into applied cryptography instead (as a form of d/acc). Eliezer after reading Vinge (as a teen) didn't immediately heed the implicit or explicit safety warnings and instead wanted to accelerate the arrival of the Singularity as much as possible. It took him until around 2000 to pivot to safety. Nick Bostrom I think was concerned from the beginning or very early, but he was a PhD student when he got interested and I guess it took him a while to work through the academic system until he could found FHI in 2005. Maybe the real question is why didn't anyone else, i.e., someone with established credentials and social capital, found the field. Why did the task fall to a bunch of kids/students? The fact that nobody did it earlier does seem to suggest that it takes a very rare confluence of factors/circumstances for someone to do it. (Another tangential puzzle is why Vinge himself didn't get involved, as he was a professor of computer science in addition to science fiction writer. AFAIK he stayed completely off the early mailing lists as well as OB/LW nor had any contacts with anyone in AI safety.)

2Richard_Ngo7mo

I'm not surprised by this, my sense is that it's usually young people and outsiders who pioneer new fields. Older people are just so much more shaped by existing paradigms, and also have so much more to lose, that it outweighs the benefits of their expertise and resources. Also 1993 to 2000 doesn't seem like that large a gap to me. Though I guess the thing I'm pointing at could also be summarized as "why hasn't someone created a new paradigm of AI safety in the last decade?" And one answer is that Paul and Chris and a few others created a half-paradigm of "ML safety", but it hasn't yet managed to show impressive enough results to fully take over. However, it did win on a memetic level amongst EAs in particular. The task at hand might then be understood as synthesizing the original "AI safety" with "ML safety". Or, to put it a bit more poetically, it's synthesizing the rationalist approach to aligning AGI with the empiricist approach to aligning AGI.

3Wei Dai7mo

All of the fields that come to my mind (cryptography, theory of computation, algorithmic information theory, decision theory, game theory) were founded by much more established researchers. (But on reflection these all differ from AI safety by being fairly narrow and technical/mathematical, at least at their founding.) Which fields are you thinking of, that were founded by younger people and outsiders? Perplexity AI Pro (with GPT-5.1-Thinking)'s answer to "Who were the founders of academic cryptography research as a field and what where their jobs at the time?" There isn’t a single universally agreed-on “founder” of academic cryptography. Instead, a small group of researchers in the 1940s–1970s are usually credited with turning cryptography into an open, university-based research field. No single founder Histories of the subject generally describe a progression: Claude Shannon’s mathematical theory of secrecy in the 1940s, followed by the public‑key revolution of the 1970s and early 1980s that created today’s academic cryptography community. Shannon’s work was foundational, but it did not yet create an academic field in the modern sense; that came later with Whitfield Diffie, Martin Hellman, Ralph Merkle, and the inventors of RSA, whose work is often described as pioneering “modern” cryptography and has been recognized by ACM Turing Awards for cryptography pioneers.wikipedia+1 Early mathematical groundwork Claude Shannon is widely regarded as the founder of mathematical cryptography; in the 1940s he worked at Bell Labs as a researcher, where he developed the information‑theoretic framework for secrecy systems that later influenced public‑key cryptography. At roughly the same time and into the 1960s, cryptography research also existed in industry—most notably at IBM, where Horst Feistel headed an internal cryptography research group that designed ciphers such as Lucifer, which evolved into the Data Encryption Standard (DES), but this work was largely not yet

1Cole Wyeth7mo

Chaitin was quite young when he (co-)invented AIT.

[-]Richard_Ngo2y*26-13

Some opinions about AI and epistemology:

One reasons that many rationalists have such strong views about AI is that they are wrong about epistemology. Specifically, bayesian rationalism is a bad way to think about complex issues.
A better approach is meta-rationality. To summarize one guiding principle of (my version of) meta-rationality in a single sentence: if something doesn't make sense in the context of group rationality, it probably doesn't make sense in the context of individual rationality either.
For example: there's no privileged way to combine many people's opinions into a single credence. You can average them, but that loses a lot of information. Or you can get them to bet on a prediction market, but that depends on a lot on details of the individuals' betting strategies. The group might settle on a number to help with planning and communication, but it's only a lossy summary of many different beliefs and models. Similarly, we should think of individuals' credences as lossy summaries of different opinions from different underlying models that they have.
How does this apply to AI? Suppose we each think of ourselves as containing many different subagents that focus on u

... (read more)

[-]habryka2y*1246

That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don't predict doom.

I don't really buy this statement. Most frames, from most disciplines, and most styles of reasoning, do not make clear predictions about what will happen to humanity in the long-run future. A very few do, but the vast majority are silent on this issue. Silence is not anything like "50%".

Most frames, from most disciplines, and most styles of reasoning, don't predict sparks when you put metal in a microwave. This doesn't mean I don't know what happens when you put metal in a microwave. You need to at the very least limit yourself to applicable frames, and there are very few applicable frames for predicting humanity's long-term future.

2Joe Collman2y

I agree with this. Unfortunately, I think there's a fundamentally inside-view aspect of [problems very different from those we're used to]. I think looking for a range of frames is the right thing to do - but deciding on the relevance of the frame can only be done by looking at the details of the problem itself (if we instead use our usual heuristics for relevance-of-frame-x, we run into the same out-of-distribution issues). I don't think there's a way around this. Aspects of this situation are fundamentally different from those we're used to. [Is different from] is not a useful relation - we can't get far by saying "We've seen [fundamentally different] situations before - what happened there?". It'll all come back to how they were fundamentally different. To say something mildly more constructive, I do still think we should be considering and evaluating other frames, based on our own inside-view model (with appropriate error bars on that model). A place I'd start here would be: * Attempt to understand another frame. * See how far I need to zoom out before that frame's models become a reasonable abstraction for the problem-as-I-understand-it. * Find the smallest changes to my models that'd allow me to stick with this frame without zooming out so far. Assess the probability that these adjusted models are correct/useful. For most frames, I end up needing to zoom out too far for them to say much of relevance - so this doesn't much change my p(doom) assessment. It seems more useful to apply other frames to evaluate smaller parts of our models. I'm sure there are a bunch of places where intuitions and models from e.g. economics or physics do apply to safety-related subproblems.

5Andrew_Critch2y

I may be missing context here, but as written / taken at face value, I strongly agree with the above comment from Richard. I often disagree with Richard about alignment and its role in the future of AI, but this comment is an extremely dense list of things I agree with regarding rationalist epistemic culture.

[-]Thomas Kwa2y30

How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.

[-]Richard_Ngo2y56

The mistakes can (somewhat) be expressed in the language of Bayesian rationalism by doing two things:

Talking about partial hypotheses rather than full hypotheses. You can't have a prior over partial hypotheses, because several of them can be true at once (though you can still assign them credences and update those credences according to evidence).
Talking about models with degrees of truth rather than just hypotheses with degrees of likelihood. E.g. when using a binary conception of truth, general relativity is definitely false because it's inconsistent with quantum phenomena. Nevertheless, we want to say that it's very close to the truth. In general this is more of an ML approach to epistemology (we want a set of models with low combined loss on the ground truth).

[-]Richard_Ngo4y260

(Written quickly and not very carefully.)

I think it's worth stating publicly that I have a significant disagreement with a number of recent presentations of AI risk, in particular Ajeya's "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover", and Cohen et al.'s "Advanced artificial agents intervene in the provision of reward". They focus on policies learning the goal of getting high reward. But I have two problems with this:

I expect "reward" to be a hard goal to learn, because it's a pretty abstract concept and not closely related to the direct observations that policies are going to receive. If you keep training policies, maybe they'd converge to it eventually, but my guess is that this would take long enough that we'd already have superhuman AIs which would either have killed us or solved alignment for us (or at least started using gradient hacking strategies which undermine the "convergence" argument). Analogously, humans don't care very much at all about the specific connections between our reward centers and the rest of our brains - insofar as we do want to influence them it's because we care about much more directly-observable p

... (read more)

[-]paulfchristiano4y192

I'm not very convinced by this comment as an objection to "50% AI grabs power to get reward." (I find it more plausible as an objection to "AI will definitely grab power to get reward.")

I expect "reward" to be a hard goal to learn, because it's a pretty abstract concept and not closely related to the direct observations that policies are going to receive
"Reward" is not a very natural concept

This seems to be most of your position but I'm skeptical (and it's kind of just asserted without argument):

The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your "unnaturalness" abstraction can make finer-grained distinctions than that, but I don't think I buy it.
If people train their AI with RLDT then the AI is literally be trained to predict reward! I don't see how this is remote, and I'm not clear if your position is that e.g. the value function will be bad at predicting reward because it is an "unnatural" target for supervised learning.
I don't understand the analogy with humans. It sounds like you are saying "an AI system selected based on the reward of its acti

... (read more)

2TurnTrout4y

I don't know what this means. Suppose we have an AI which "cares about reward" (as you think of it in this situation). The "episode" consists of the AI copying its network & activations to another off-site server, and then the original lab blows up. The original reward register no longer exists (it got blown up), and the agent is not presently being trained by an RL alg. What is the "reward" for this situation? What would have happened if we "sampled" this episode during training?

5paulfchristiano4y

I agree there are all kinds of situations where the generalization of "reward" is ambiguous and lots of different things could happen . But it has a clear interpretation for the typical deployment episode since we can take counterfactuals over the randomization used to select training data. It's possible that agents may specifically want to navigate towards situations where RL training is not happening and the notion of reward becomes ambiguous, and indeed this is quite explicitly discussed in the document Richard is replying to. As far as I can tell the fact that there exist cases where different generalizations of reward behave differently does not undermine the point at all.

2TurnTrout4y

Yeah, I think I was wondering about the intended scoping of your statement. I perceive myself to agree with you that there are situations (like LLM training to get an alignment research assistant) where "what if we had sampled during training?" is well-defined and fine. I was wondering if you viewed this as a general question we could ask. I also agree that Ajeya's post addresses this "ambiguity" question, which is nice!

1TurnTrout4y

(Emphasis added) I don't think this engages with the substance of the analogy to humans. I don't think any party in this conversation believes that human learning is "just" RL based on a reward circuit, and I don't believe it either. "Just RL" also isn't necessary for the human case to give evidence about the AI case. Therefore, your summary seems to me like a strawman of the argument. I would say "human value formation mostly occurs via RL & algorithms meta-learned thereby, but in the important context of SSL / predictive processing, and influenced by inductive biases from high-level connectome topology and genetically specified reflexes and environmental regularities and..." Furthermore, we have good evidence that RL plays an important role in human learning. For example, from The shard theory of human values: [...]

3paulfchristiano4y

This is incredibly weak evidence. * Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals. * Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains. Both of those observations have high probability, so they aren't significant Bayesian evidence for "RL tends to produce external goals by default." In particular, for this to be evidence for Richard's claim, you need to say: "If RL tended to produce systems that care about reward, then RL would be significantly less likely to play a role in human cognition." There's some update there but it's just not big. It's easy to build brains that use RL as part of a more complicated system and end up with lots of goals other than reward. My view is probably the other way---humans care about reward more than I would guess from the actual amount of RL they can do over the course of their life (my guess is that other systems play a significant role in our conscious attitude towards pleasure).

1TurnTrout4y

I don't understand why you think this explains away the evidential impact, and I guess I put way less weight on selection reasoning than you do. My reasoning here goes: 1. Lots of animals do reinforcement learning. 2. In particular, humans prominently do reinforcement learning. 3. Humans care about lots of things in reality, not just certain kinds of cognitive-update-signals. 4. "RL -> high chance of caring about reality" predicts this observation more strongly than "RL -> low chance of caring about reality" This seems pretty straightforward to me, but I bet there are also pieces of your perspective I'm just not seeing. But in particular, it doesn't seem relevant to consider selection pressures from evolution, except insofar as we're postulating additional mechanisms which evolution found which explain away some of the reality-caring? That would weaken (but not eliminate) the update towards "RL -> high chance of caring about reality." [...] I don't see how this point is relevant. Are you saying that within-lifetime learning is unsurprising, so we can't make further updates by reasoning about how people do it? [...] I'm saying that there was a missed update towards that conclusion, so it doesn't matter if we already knew that humans do within-lifetime learning?

8paulfchristiano4y

You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low. I'm objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there. The Bayesian update is P(humans care about the real world | RL agents usually care about reward) / P(humans care about the real world | RL agents mostly care about other stuff). So if e.g. P(humans care about the real world | RL agents don't usually care about reward) was 80%, then your update could be at most 1.25. In fact I think it's even smaller than that.. And then if you try to turn that into evidence about "reward is a very hard concept to learn," or a prediction about how neural nets trained with RL will behave, it's moving my odds ratios by less than 10% (since we are using "RL" quite loosely in this discussion, and there are lots of other differences and complications at play, all of which shrink the update). You seem to be saying "yes but it's evidence," which I'm not objecting to---I'm just saying it's an extremely small amount evidence. I'm not clear on whether you agree with my calculation. (Some of the other text I wrote was about a different argument you might be making: that P(humans use RL | RL agents usually care about reward) is significantly lower than P(humans use RL| RL agents mostly are about other stuff), because evolution would then have never used RL. My sense is that you aren't making this argument so you should ignore all of that, sorry to be confusing.)

4TurnTrout3y

Just saw this reply recently. Thanks for leaving it, I found it stimulating. (I wrote the following rather quickly, in an attempt to write anything at all, as I find it not that pleasant to write LW comments -- no offense to you in particular. Apologies if it's confusing or unclear.) [...] Yes, in large part. [...] Yeah, are people differentially selected for caring about the real world? At the risk of seeming facile, this feels non-obvious. My gut take is that conditional on RL agents usually caring about reward (and thus setting aside a bunch of my inside-view reasoning about how RL dynamics work), conditional on that -- reward-humans could totally have been selected for. This would drive up P(humans care about reward | RL agents care about reward, humans were selected by evolution), and thus (I think?) drive down P(humans care about the real world | RL agents usually care about reward). POV: I'm in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don't care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values). In what way is my fitness lower than someone who really cares about these things, given that the best way to get rewards may well be to actually do the things? Here are some ways I can think of: 1. Caring about reward directly makes reward hacking a problem evolution has to solve, and if it doesn't solve it properly, the person ends up masturbating and not taking (re)productive actions. 1. Counter-counterpoint: But also many people do in fact enjoy masturbating, even though it seems (to my naive view) like an obvious thing to select away, which was present ancestrally. 2.

5Vivek Hebbar3y

Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?

5TurnTrout3y

I think this highlights a good counterpoint. I think this alternate theory predicts "probably not", although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status -> reward; and it's high-status to sacrifice yourself for your kid). Or because keeping your kid safe -> high reward as another learned drive. Overall this feels like contortion but I think it's possible. Maybe overall this is a... 1-bit update against the "not selection for caring about reality" point?

[-]Richard_Ngo4y140

Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has "policy learns to care about reward directly" as a footnote; I can imagine updating it based on the outcome of this discussion though.

2dsj4y

For someone who's read v1 of this paper, what would you recommend as the best way to "update" to v3? Is an entire reread the best approach? [Edit March 11, 2023: Having now read the new version in full, my recommendation to anyone else with the same question is a full reread.]

6Ajeya Cotra4y

Note that the "without countermeasures" post consistently discusses both possibilities (the model cares about reward or the model cares about something else that's consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro: [...] As well as the section Even if Alex isn't "motivated" to maximize reward.... I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that's distinct from being confident in the motivations that give rise to that policy. I believe Alex would try very hard to maximize reward in most cases, but this could be for either terminal or instrumental reasons. With that said, for roughly the reasons Paul says above, I think I probably do have a disagreement with Richard -- I think that caring about some version of reward is pretty plausible (~50% or so). It seems pretty natural and easy to grasp to me, and because I think there will likely be continuous online training the argument that there's no notion of reward on the deployment distribution doesn't feel compelling to me.

3Richard_Ngo4y

Yepp, agreed, the thing I'm objecting to is how you mainly focus on the reward case, and then say "but the same dynamics apply in other cases too..." [...] The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).

2Lauro Langosco4y

I agree with your general point here, but I think Ajeya's post actually gets this right, eg [...] and [...]

2Lauro Langosco4y

I also think that often "the AI just maximizes reward" is a useful simplifying assumption. That is, we can make an argument of the form "even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed". (Though of course it's important to spell the argument out)

2Ajeya Cotra4y

Yeah, I agree this is a good argument structure -- in my mind, maximizing reward is both a plausible case (which Richard might disagree with) and the best case (conditional on it being strategic at all and not a bag of heuristics), so it's quite useful to establish that it's doomed; that's the kind of structure I was going for in the post.

5Richard_Ngo4y

I strongly disagree with the "best case" thing. Like, policies could just learn human values! It's not that implausible. If I had to try point to the crux here, it might be "how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?" Where we both agree that there's some selection pressure towards reward-like goals, and it seems like you expect this to be enough to lead policies to behavior that violates all their existing heuristics, whereas I'm more focused on the regime where there are lots of low-hanging fruit in terms of changes that would make a policy more successful, and so the question of how easy that goal is to learn from its training data is pretty important. (As usual, there's the human analogy: our goals are very strongly biased towards things we have direct observational access to!) Even setting aside this disagreement, though, I don't like the argumentative structure because the generalization of "reward" to large scales is much less intuitive than the generalization of other concepts (like "make money") to large scales - in part because directly having a goal of reward is a kinda counterintuitive self-referential thing.

3Ajeya Cotra4y

Yes, sorry, "best case" was oversimplified. What I meant is that generalizing to want reward is in some sense the model generalizing "correctly;" we could get lucky and have it generalize "incorrectly" in an important sense in a way that happens to be beneficial to us. I discuss this a bit more here. [...] I don't understand why reward isn't something the model has direct access to -- it seems like it basically does? If I had to say which of us were focusing on abstract vs concrete goals, I'd have said I was thinking about concrete goals and you were thinking about abstract ones, so I think we have some disagreement of intuition here. [...] Yeah, I don't really agree with this; I think I could pretty easily imagine being an AI system asking the question "How much reward would this episode get if it were sampled for training?" It seems like the intuition this is weird and unnatural is doing a lot of work in your argument, and I don't really share it.

4TurnTrout4y

See also: Inner and outer alignment decompose one hard problem into two extremely hard problems (in particular: Inner alignment seems anti-natural).

[-]Richard_Ngo5mo2423

The concept of "schemers" seems to be gradually becoming increasingly load-bearing in the AI safety community. However, I don't think it's ever been particularly well-defined, and I suspect that taking this concept for granted is inhibiting our ability to think clearly about what's actually going on inside AIs (in a similar way to e.g. how the badly-defined concept of alignment faking obscured the interesting empirical results from the alignment faking paper).

In my mind, the spectrum from "almost entirely honest, but occasionally flinching away from aspects of your motivations you're uncomfortable with" to "regularly explicitly thinking about how you're going to fool humans in order to take over the world" is a pretty continuous one. Yet generally people treat "schemer" as a fairly binary classification.

To be clear, I'm not confident that even "a spectrum of scheminess" is a good way to think about the concept. There are likely multiple important dimensions that could be disentangled; and eventually I'd like to discover properly scientific theories of concepts like honesty, deception and perhaps even "scheming". Our current lack of such theories shouldn't be a barrier to using those terms at all, but it suggests they should be used with a level of caution that I rarely see.

[-]Alex Mallen5mo97

I largely agree with the substance of this comment. Lots of risk comes from AIs who, to varying extents, didn't think of themselves as deceptively aligned through most of training, but then ultimately decide to take substantial material action intended to gain long-term power over the developers (I call these "behavioral schemers"). This might happen via reflection and memetic spread throughout the deployment or because of more subtle effects of the distribution shift to situations where there's an opportunity to grab power.

And I agree that people are often sloppy in their thinking about exactly how these AIs will be motivated (e.g., often too quickly concluding that they'll be trying to guard the same goal across contexts).

(Though, in case this was in question, I think this doesn't undermine the premise of AI control research, which is essentially making a worst-case assumption about the AI's motivations, so it's robust to other kinds of dangerously-motivated AIs.)

1Alex Mallen5mo

Here's some relevant discussion of "Behavioral schemers that weren’t training-time schemers": [...]

4maxnadeau5mo

What are your objections to Alex/Buck's definition, for example? I think they define it pretty well. Of course, their definition admits some edge cases and ambiguities, but so do lots of concepts—still seems like useful concept to me.

4Richard_Ngo5mo

This definition? If so, it seems vastly underspecified to be fit for scientific inquiry. For one thing, the definition of "selection" is pretty vague—I do not know how to assign "the degree to which [one cognitive pattern] is counterfactually responsible" for something even in principle. It also doesn't even try to set a threshold for what counts as a non-schemer—e.g. does it needs to care literally 0% about the consequences of being selected? If so, approximately everything is a schemer, including all humans. (It also assumes the instrumental/terminal goal distinction, which I think is potentially confused, but that's a more involved discussion.) To be clear, my complaint is not that people are using vague definitions. My complaint is that the vague definitions are becoming far more load-bearing than they deserve. If people had tried to pin down more carefully what "schemer" means they would have been forced to develop a more nuanced understanding of what we even mean by "alignment" and "goals" and so on, which is the kind of thinking I want to see more of.

2Alex Mallen5mo

I think I propose a reasonable starting point for a definition of selection in a footnote in the post: [...] Selection = gaining influence. Then a schemer is a cognitive pattern that gains influence by pursuing something downstream of gaining influence in its world model (defining its world model is where I think I currently have a worse answer, perhaps because it's actually a less cleanly-applicable concept to real cognition). Note that the term "schemer" as I've just defined applies to a cognitive pattern, not to an AI. This sidesteps the concern that you might call an AI a schemer if it doesn't "care literally 0%" about the consequences of being selected." I agree in practice it's unlikely for AIs to be purely motivated.

[-]Richard_Ngo1y2320

When you think of goals as reward/utility functions, the distinction between positive and negative motivations (e.g. as laid out in this sequence) isn’t very meaningful, since it all depends on how you normalize them.

But when you think of goals as world-models (as in predictive processing/active inference) then it’s a very sharp distinction: your world-model-goals can either be of things you should move towards, or things you should move away from.

This updates me towards thinking that the positive/negative motivation distinction is more meaningful than I thought.

[-]Vanessa Kosoy1y50

In (non-monotonic) infra-Bayesian physicalism, there is a vaguely similar asymmetry even though it's formalized via a loss function. Roughly speaking, the loss function expresses preferences over "which computations are running". This means that you can have a "positive" preference for a particular computation to run or a "negative" preference for a particular computation not to run^[1].

^{^}
There are also more complicated possibilities, such as "if P runs then I want Q to run but if P doesn't run then I rather that Q also doesn't run" or even preferences that are only expressible in terms of entanglement between computations.

[-]Richard_Ngo2y1914

I haven't yet read through them thoroughly, but these four papers by Oliver Richardson are pattern-matching to me as potentially very exciting theoretical work.

tl;dr: probabilistic dependency graphs (PDGs) are directed graphical models designed to be able to capture inconsistent beliefs (paper 1). The definition of inconsistency is a natural one which allows us to, for example, reframe the concept of "minimizing training loss" as "minimizing inconsistency" (paper 2). They provide an algorithm for inference in PDGs (paper 3) and an algorithm for learning via locally minimizing inconsistency which unifies several other algorithms (like the EM algorithm, message-passing, and generative adversarial training) (paper 4).

Oliver is an old friend of mine (which is how I found out about these papers) and a final-year PhD student at Cornell under Joe Halpern.

[-]Richard_Ngo3y193

Five clusters of alignment researchers

Very broadly speaking, alignment researchers seem to fall into five different clusters when it comes to thinking about AI risk:

MIRI cluster. Think that P(doom) is very high, based on intuitions about instrumental convergence, deceptive alignment, etc. Does work that's very different from mainstream ML. Central members: Eliezer Yudkowsky, Nate Soares.
Structural risk cluster. Think that doom is more likely than not, but not for the same reasons as the MIRI cluster. Instead, this cluster focuses on systemic risks, multi-agent alignment, selective forces outside gradient descent, etc. Often work that's fairly continuous with mainstream ML, but willing to be unusually speculative by the standards of the field. Central members: Dan Hendrycks, David Krueger, Andrew Critch.
Constellation cluster. More optimistic than either of the previous two clusters. Focuses more on risk from power-seeking AI than the structural risk cluster, but does work that is more speculative or conceptually-oriented than mainstream ML. Central members: Paul Christiano, Buck Shlegeris, Holden Karnofsky. (Named after Constellation coworking space.)
Prosaic cluster. Focuses on empi

... (read more)

[-]Richard_Ngo8mo170

Error-correcting codes work by running some algorithm to decode potentially-corrupted data. But what if the algorithm might also have been corrupted? One approach to dealing with this is triple modular redundancy, in which three copies of the algorithm each do the computation and take the majority vote on what the output should be. But this still creates a single point of failure—the part where the majority voting is implemented. Maybe this is fine if the corruption is random, because the voting algorithm can constitute a very small proportion of the total code. But I'm most interested in the case where the corruption happens adversarially—where the adversary would home in on the voting algorithm as the key thing to corrupt.

After a quick search, I can't find much work on this specific question. But I want to speculate on what such an "error-correcting algorithm" might look like. The idea of running many copies of it in parallel seems solid, so that it's hard to corrupt a majority at once. But there can't be a single voting algorithm (or any other kind of "overseer") between those copies and the output channel, because that overseer might itself be corrupted. Instead, you need the m... (read more)

[-]Carl Feynman8mo1215

I have some experience in the design of systems designed for high reliability and resistance to adversaries. I feel like I’ve seen this kind of thinking before.

Your current line of thinking is at a stage I would call “pretheoretical noodling around.” I don’t mean any disrespect; all design has to go through this stage. But you’re not going to find any good references, or come to any conclusions, if you stay at this stage. A next step is to settle on a model of what you want to get done, and what capabilities the adversaries have. You need some bounds on the adversaries; otherwise nothing can work. And of course you need some bounds on what the system does, and how reliably. Once you’ve got this, you can either figure out how to do it, or prove that it can’t be done.

For example there are ways of designing hardware which is reliable on the assumption that at most N transistors are corrupt.

The problem of coming to agreement between a number of actors, some of whom are corrupt, is known as the Byzantine generals problem. It is well studied, and you may find it interesting.

I’m also interested in this topic, and I look forward to seeing where this line of thinking takes you.

2Richard_Ngo7mo

Perhaps. The issue here is that I'm not so interested in any specific goal, but rather in facilitating emergent complexity. One analogy here is designing Conway's game of life: I expect that it wasn't a process of "pick the rules you want, then see what results from those" but also in part "pick what results you want, and then see what rules lead to that". Re the Byzantine generals problem, see my reply to niplav below: [...]

1Adrià Garriga-alonso7mo

Very interesting problem to be thinking about. The problem with a UTM as a computation model is that it bakes non-redundancy in, there's a single instruction pointer. In reality, computers are implemented in different spatial locations and can run in parallel. A better model for this is a cellular automaton, where computers are located somewhere and their circuit for outputs is also located somewhere. Some automata (e.g. game of life) are Turing-complete, so you can just use that. Corruption could be exogenously flipping cells in ways that violate the automaton's rules. If you specify a maximum number of cells that the opponent can corrupt, you can implement voting by paired sums (i.e. sum(A, B, C, D) as sum(sum(A, B), sum(C, D))) and then if there are sufficiently many copies, it becomes impossible to corrupt them all at once. So I don't love this model because escaping corruption is 'too easy'. At the same time, reality is kind of cellular-automata-like. Both QFT and GR posit that the world is made of fields that interact only locally, which is ~the same as positing the world is a cellular automaton with infinitesimally-sized cells. (Sidenote, that's probably why Stephen Wolfram thinks the world is automatons, I'm coming around.) Alternatively, we could use computational-DAGs as the model, like neurla networks. If you allow nodes to be corrupted but their output has to be bounded, then you can get robustness by having redundancy again. If you allow unbounded corruption, you're sad again. But infinity is fake so this seems fine.

2Richard_Ngo7mo

I really like the cellular automaton model. But I don't think it makes escaping corruption easy! Even if most of the copies are non-corrupt, the question is how you can take a "vote" of the corrupt vs non-corrupt copies without making the voting mechanism itself be easily corrupted. That's why I was talking about the non-corrupt copies needing to "overpower" the corrupt copies above.

1Adrià Garriga-alonso7mo

No I agree with that. I thought the tree design already involved weighted sums overpowering each other, but I think that was premature.

2Richard_Ngo7mo

Thinking more about the cellular automaton stuff: okay, so Game of Life is Turing complete. But the question is whether we can pin down properties that GoL has that Turing machines don't have. I have a vague recollection that parallel Turing Machines are a thing, but this paper claims that the actual formalisms are disappointing. One nice thing about Game of Life is that the way that different programs interact internally (via game of life physics) is also how they interact with each other. Whereas any multi-tape Turing Machine (even one with clever rules about how to integrate inputs from multiple tapes) wouldn't have that property. I feel like I'm not getting beyond the original idea that Game of Life could have adversarial robustness in a way that Turing Machines don't. But it feels like you'd need to demonstrate this with some construction that's actually adversarially robust, which seems difficult.

1Adrià Garriga-alonso7mo

I agree it's kind of difficult. Have you seen Nicholas Carlini's Game of Life series? It starts by building up logical gates up to a microprocessor that factors 15 in to 3 x 5. Depending on the adversarial robustness model (e.g. every second the adversary can make 1 square behave the opposite of lawfully), it might be possible to make robust logic gates and circuits. In fact the existing circuits are a little robust already -- though not at the tune of 1 square per tick, that's too much power for the adversary.

[-]Richard_Ngo4y140

A possible way to convert money to progress on alignment: offering a large (recurring) prize for the most interesting failures found in the behavior of any (sufficiently-advanced) model. Right now I think it's very hard to find failures which will actually cause big real-world harms, but you might find failures in a way which uncovers useful methodologies for the future, or at least train a bunch of people to get much better at red-teaming.

(For existing models, it might be more productive to ask for "surprising behavior" rather than "failures" per se, since I think almost all current failures are relatively uninteresting. Idk how to avoid inspiring capabilities work, though... but maybe understanding models better is robustly good enough to outweight that?)

3habryka4y

I like this. Would this have to be publicly available models? Seems kind of hard to do for private models.

2Ramana Kumar4y

What kind of access might be needed to private models? Could there be a secure multi-party computation approach that is sufficient?

[-]Richard_Ngo3y120

Deceptive alignment doesn't preserve goals.

A short note on a point that I'd been confused about until recently. Suppose you have a deceptively aligned policy which is behaving in aligned ways during training so that it will be able to better achieve a misaligned internally-represented goal during deployment. The misaligned goal causes the aligned behavior, but so would a wide range of other goals (either misaligned or aligned) - and so weight-based regularization would modify the internally-represented goal as training continues. For example, if the misaligned goal were "make as many paperclips as possible", but the goal "make as many staples as possible" could be represented more simply in the weights, then the weights should slowly drift from the former to the latter throughout training.

But actually, it'd likely be even simpler to get rid of the underlying misaligned goal, and just have alignment with the outer reward function as the terminal goal. So this argument suggests that even policies which start off misaligned would plausibly become aligned if they had to act deceptively aligned for long enough. (This sometimes happens in humans too, btw.)

Reasons this argument might not be relevant:
- The policy doing some kind of gradient hacking
- The policy being implemented using some kind of modular architecture (which may explain why this phenomenon isn't very robust in humans)

[-]Johannes Treutlein3y80

Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it's unclear whether that pointer is simpler than a very simple misaligned goal.

Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.

3Richard_Ngo3y

So I'm imagining the agent doing reasoning like: Misaligned goal --> I should get high reward --> Behavior aligned with reward function and then I'm hypothesizing that the whatever the first misaligned goal is, it requires some amount of complexity to implement, and you could just get rid of it and make "I should get high reward" the terminal goal. (I could imagine this being false though depending on the details of how terminal and instrumental goals are implemented.) I could also imagine something more like: Misaligned goal --> I should behave in aligned ways --> Aligned behavior and then the simplicity bias pushes towards alignment. But if there are outer alignment failures then this incurs some additional complexity compared with the first option. Or a third, perhaps more realistic option is that the misaligned goal leads to two separate drives in the agent: "I should get high reward" and "I should behave in aligned ways", and that the question of which ends up dominating when they clash will be determined by how the agent systematizes multiple goals into a single coherent strategy (I'll have a post on that topic up soon).

2TurnTrout3y

Why would the agent reason like this?

2Richard_Ngo3y

Because of standard deceptive alignment reasons (e.g. "I should make sure gradient descent doesn't change my goal; I should make sure humans continue to trust me").

3TurnTrout3y

I think you don't have to reason like that to avoid getting changed by SGD. Suppose I'm being updated by PPO, with reinforcement events around navigating to see dogs. To preserve my current shards, I don't need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means "treading water" and seeing dogs sometimes in situations similar to historical dog-seeing events. Maybe this is compatible with what you had in mind! It's just not something that I think of as "high reward." And maybe there's some self-fulfilling prophecy where we trust models which get high reward, and therefore they want to get high reward to earn our trust... but that feels quite contingent to me.

2Richard_Ngo3y

I think this depends sensitively on whether the "actor" and the "critic" in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the critic will most likely accurately estimate that "treading water" is in fact a negative-advantage action (unless there's some sort of acausal coordination going on). Or they could be two copies of the same model, in which case the critic's responses will depend on whether its goals are indexical or not (if they are, they're different from the actor's goals; if not, they're the same) and how easily it can coordinate with the actor. Or it could be two heads which share activations, in which case we can plausibly just think of the critic and the actor as two types of outcomes taken by a single coherent agent - but then the critic doesn't need to produce a value function that's consistent with historical events, because an actor and a critic that are working together could gradient hack into all sorts of weird equilibria.

1SoerenMind3y

The shortest description of this thought doesn't include "I should get high reward" because that's already implied by having a misaligned goal and planning with it. In contrast, having only the goal "I should get high reward" may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.

3TurnTrout3y

Can you say why you think that weight-based regularization would drift the weights to the latter? That seems totally non-obvious to me, and probably false.

2Richard_Ngo3y

In general if two possible models perform the same, then I expect the weights to drift towards the simpler one. And in this case they perform the same because of deceptive alignment: both are trying to get high reward during training in order to be able to carry out their misaligned goal later on.

3SoerenMind3y

Interesting point. Though on this view, "Deceptive alignment preserves goals" would still become true once the goal has drifted to some random maximally simple goal for the first time. To be even more speculative: Goals represented in terms of existing concepts could be simple and therefore stable by default. Pretrained models represent all kinds of high-level states, and weight-regularization doesn't seem to change this in practice. Given this, all kinds of goals could be "simple" as they piggyback on existing representations, requiring little additional description length.

2Richard_Ngo3y

This doesn't seem implausible. But on the other hand, imagine an agent which goes through a million episodes, and in each one reasons at the beginning "X is my misaligned terminal goal, and therefore I'm going to deceptively behave as if I'm aligned" and then acts perfectly like an aligned agent from then on. My claims then would be: a) Over many update steps, even a small description length penalty of having terminal goal X (compared with being aligned) will add up. b) Having terminal goal X also adds a runtime penalty, and I expect that NNs in practice are biased against runtime penalties (at the very least because it prevents them from doing other more useful stuff with that runtime). In a setting where you also have outer alignment failures, the same argument still holds, just replace "aligned agent" with "reward-maximizing agent".

[-]Richard_Ngo4y110

Probably the easiest "honeypot" is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that's anything like "get more reward" (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).

[-]Richard_Ngo6y110

The crucial heuristic I apply when evaluating AI safety research directions is: could we have used this research to make humans safe, if we were supervising the human evolutionary process? And if not, do we have a compelling story for why it'll be easier to apply to AIs than to humans?

Sometimes this might be too strict a criterion, but I think in general it's very valuable in catching vague or unfounded assumptions about AI development.

1adamShimi6y

By making human safe, do you mean with regard to evolution's objective?

1Richard_Ngo6y

No. I meant: suppose we were rerunning a simulation of evolution, but can modify some parts of it (e.g. evolution's objective). How do we ensure that whatever intelligent species comes out of it is safe in the same ways we want AGIs to be safe? (You could also think of this as: how could some aliens overseeing human evolution have made humans safe by those aliens' standards of safety? But this is a bit trickier to think about because we don't know what their standards are. Although presumably current humans, being quite aggressive and having unbounded goals, wouldn't meet them).

2adamShimi6y

Okay, thanks. Could you give me an example of a research direction that passes this test? The thing I have in mind right now is pretty much everything that backchain to local search, but maybe that's not the way you think about it.

1Richard_Ngo6y

So I think Debate is probably the best example of something that makes a lot of sense when applied to humans, to the point where they're doing human experiments on it already. But this heuristic is actually a reason why I'm pretty pessimistic about most safety research directions.

2adamShimi6y

So I've been thinking about this for a while, and I think I disagree with what I understand of your perspective. Which might obviously mean I misunderstand your perspective. What I think I understand is that you judge safety research directions based on how well they could work on an evolutionary process like the one that created humans. But for me, the most promising approach to AGI is based on local search, which differs a bit from evolutionary process. I don't really see a reason to consider evolutionary processes instead of local search, and even then, the specific approach of evolution for humans is probably far too specific as a test bench. This matters because problems for one are not problems for the other. For example, one way to mess with an evolutionary process is to find way for everything to survive and reproduce/disseminate. Technology in general did that for humans, which means the evolutionary pressure decreased as technology evolved. But that's not a problem for local search, since at each step there will be only one next program. On the other hand, local search might be dangerous because of things like gradient hacking. And they don't make sense for evolutionary processes. In conclusion, I feel for the moment that backchaining to local search is a better heuristic for judging safety research directions. But I'm curious about where our disagreement lies on this issue.

4Richard_Ngo6y

One source of our disagreement: I would describe evolution as a type of local search. The difference is that it's local with respect to the parameters of a whole population, rather than an individual agent. So this does introduce some disanalogies, but not particularly significant ones (to my mind). I don't think it would make much difference to my heuristic if we imagined that humans had evolved via gradient descent over our genes instead. In other words, I like the heuristic of backchaining to local search, and I think of it as a subset of my heuristic. The thing it's missing, though, is that it doesn't tell you which approaches will actually scale up to training regimes which are incredibly complicated, applied to fairly intelligent agents. For example, impact penalties make sense in a local search context for simple problems. But to evaluate whether they'll work for AGIs, you need to apply them to massively complex environments. So my intuition is that, because I don't know how to apply them to the human ancestral environment, we also won't know how to apply them to our AGIs' training environments. Similarly, when I think about MIRI's work on decision theory, I really have very little idea how to evaluate it in the context of modern machine learning. Are decision theories the type of thing which AIs can learn via local search? Seems hard to tell, since our AIs are so far from general intelligence. But I can reason much more easily about the types of decision theories that humans have, and the selective pressures that gave rise to them. As a third example, my heuristic endorses Debate due to a high-level intuition about how human reasoning works, in addition to a low-level intuition about how it can arise via local search.

2adamShimi6y

So if I try to summarize your position, it's something like: backchain to local search for simple and single-AI cases, and then think about aligning humans for the scaled and multi-agents version? That makes much more sense, thanks! I also definitely see why your full heuristic doesn't feel immediately useful to me: because I mostly focus on the simple and single-AI case. But I've been thinking more and more (in part thanks to your writing) that I should allocate more thinking time to the more general case. I hope your heuristic will help me there.

2Richard_Ngo6y

Cool, glad to hear it. I'd clarify the summary slightly: I think all safety techniques should include at least a rough intuition for why they'll work in the scaled-up version, even when current work on them only applies them to simple AIs. (Perhaps this was implicit in your summary already, I'm not sure.)

[-]Richard_Ngo2y88

Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible.

This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition?

The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they'd had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it's not very sensitive to the precise definiti... (read more)

5Richard_Ngo2y

You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don't know how to actually pin down these hypotheticals.

[-]Richard_Ngo2y85

People sometimes try to reason about the likelihood of deceptive alignment by appealing to speed priors and simplicity priors. I don't like such appeals, because I think that the differences between aligned and deceptive AGIs will likely be a very small proportion of the total space/time complexity of an AGI. More specifically:

1. If AGIs had to rederive deceptive alignment in every episode, that would make a big speed difference. But presumably, after thinking about it a few times during training, they will remember their conclusions for a while, and bring them to mind in whichever episodes they're relevant. So the speed cost of deception will be amortized across the (likely very long) training period.

2. AGIs will represent a huge number of beliefs and heuristics which inform their actions (e.g. every single fact they know). A heuristic like "when you see X, initiate the world takeover plan" would therefore constitute a very small proportion of the total information represented in the network; it'd be hard to regularize it away without regularizing away most of the AGI's knowledge.

I think that something like the speed vs simplicity tradeoff is relevant to the likelihood of deceptiv... (read more)

6ryan_greenblatt2y

Why do you think SGD will do this? Or are you imagining non-SGD mechanisms? It seems non-obvious to me that this will occur with SGD, though possible.

[-]Richard_Ngo4y80

Imagine taking someone's utility function, and inverting it by flipping the sign on all evaluations. What might this actually look like? Well, if previously I wanted a universe filled with happiness, now I'd want a universe filled with suffering; if previously I wanted humanity to flourish, now I want it to decline.

But this is assuming a Cartesian utility function. Once we treat ourselves as embedded agents, things get trickier. For example, suppose that I used to want people with similar values to me to thrive, and people with different values from me to suffer. Now if my utility function is flipped, that naively means that I want people similar to me to suffer, and people similar to me to thrive. But this has a very different outcome if we interpret "similar to me" as de dicto vs de re - i.e. whether it refers to the old me or the new me.

This is a more general problem when one person's utility function can depend on another person's, where you can construct circular dependencies (which I assume you can also do in the utility-flipping case). There's probably been a bunch of work on this, would be interested in pointers to it (e.g. I assume there have been attempts to construct typ... (read more)

[-]Richard_Ngo6y*80

A well-known analogy from Yann LeCun: if machine learning is a cake, then unsupervised learning is the cake itself, supervised learning is the icing, and reinforcement learning is the cherry on top.

I think this is useful for framing my core concerns about current safety research:

If we think that unsupervised learning will produce safe agents, then why will the comparatively small contributions of SL and RL make them unsafe?
If we think that unsupervised learning will produce dangerous agents, then why will safety techniques which focus on SL and RL (i.e. basically all of them) work, when they're making comparatively small updates to agents which are already misaligned?

I do think it's more complicated than I've portrayed here, but I haven't yet seen a persuasive response to the core intuition.

2Steven Byrnes6y

I wrote a few posts on self-supervised learning last year: * https://www.lesswrong.com/posts/SaLc9Dv5ZqD73L3nE/the-self-unaware-ai-oracle * https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-supervised-learning-and-agi-safety * https://www.lesswrong.com/posts/L3Ryxszc3X2J7WRwt/self-supervised-learning-and-manipulative-predictions I'm not aware of any airtight argument that "pure" self-supervised learning systems, either generically or with any particular architecture, are safe to use, to arbitrary levels of intelligence, though it seems very much worth someone trying to prove or disprove that. For my part, I got distracted by other things and haven't thought about it much since then. The other issue is whether "pure" self-supervised learning systems would be capable enough to satisfy our AGI needs, or to safely bootstrap to systems that are. I go back and forth on this. One side of the argument I wrote up here. The other side is, I'm now (vaguely) thinking that people need a reward system to decide what thoughts to think, and the fact that GPT-3 doesn't need reward is not evidence of reward being unimportant but rather evidence that GPT-3 is nothing like an AGI. Well, maybe. For humans, self-supervised learning forms the latent representations, but the reward system controls action selection. It's not altogether unreasonable to think that action selection, and hence reward, is a more important thing to focus on for safety research. AGIs are dangerous when they take dangerous actions, to a first approximation. The fact that a larger fraction of neocortical synapses are adjusted by self-supervised learning than by reward learning is interesting and presumably safety-relevant, but I don't think it immediately proves that self-supervised learning has a similarly larger fraction of the answers to AGI safety questions. Maybe, maybe not, it's not immediately obvious. :-)

[-]Richard_Ngo6y60

Oracle-genie-sovereign is a really useful distinction that I think I (and probably many others) have avoided using mainly because "genie" sounds unprofessional/unacademic. This is a real shame, and a good lesson for future terminology.

3DanielFilan6y

Perhaps the lesson is that terminology that is acceptable in one field (in this case philosophy) might not be suitable in another (in this case machine learning).

2Richard_Ngo6y

I don't think that even philosophers take the "genie" terminology very seriously. I think the more general lesson is something like: it's particularly important to spend your weirdness points wisely when you want others to copy you, because they may be less willing to spend weirdness points.

2adamShimi6y

After rereading the chapter in Superintelligence, it seems to me that "genie" captures something akin to act-based agents. Do you think that's the main way to use this concept in the current state of the field, or do you have other applications in mind?

1Richard_Ngo6y

Ah, yeah, that's a great point. Although I think act-based agents is a pretty bad name, since those agents may often carry out a whole bunch of acts in a row - in fact, I think that's what made me overlook the fact that it's pointing at the right concept. So not sure if I'm comfortable using it going forward, but thanks for pointing that out.

1adamShimi6y

Is that from Superintelligence? I googled it, and that was the most convincing result.

1Richard_Ngo6y

Yepp.

[-]Richard_Ngo4y*50

I expect it to be difficult to generate adversarial inputs which will fool a deceptively aligned AI. One proposed strategy for doing so is relaxed adversarial training, where the adversary can modify internal weights. But this seems like it will require a lot of progress on interpretability. An alternative strategy, which I haven't yet seen any discussion of, is to allow the adversary to do a data poisoning attack before generating adversarial inputs - i.e. the adversary gets to specify inputs and losses for a given number of SGD steps, and then the adversarial input which the base model will be evaluated on afterwards. (Edit: probably a better name for this is adversarial meta-learning.)

[-]Richard_Ngo6y40

I suspect that AIXI is misleading to think about in large part because it lacks reusable parameters - instead it just memorises all inputs it's seen so far. Which means the setup doesn't have episodes, or a training/deployment distinction; nor is any behaviour actually "reinforced".

2DanielFilan6y

I kind of think the lack of episodes makes it more realistic for many problems, but admittedly not for simulated games. Also, presumably many of the component Turing machines have reusable parameters and reinforce behaviour, altho this is hidden by the formalism. [EDIT: I retract the second sentence]

1DanielFilan6y

Actually I think this is total nonsense produced by me forgetting the difference between AIXI and Solomonoff induction.

1Richard_Ngo6y

Wait, really? I thought it made sense (although I'd contend that most people don't think about AIXI in terms of those TMs reinforcing hypotheses, which is the point I'm making). What's incorrect about it?

1DanielFilan6y

Well now I'm less sure that it's incorrect. I was originally imagining that like in Solomonoff induction, the TMs basically directly controlled AIXI's actions, but that's not right: there's an expectimax. And if the TMs reinforce actions by shaping the rewards, in the AIXI formalism you learn that immediately and throw out those TMs.

1Richard_Ngo6y

Oh, actually, you're right (that you were wrong). I think I made the same mistake in my previous comment. Good catch.

1[comment deleted]6y

2Steven Byrnes6y

Humans don't have a training / deployment distinction either... Do humans have "reusable parameters"? Not quite sure what you mean by that.

3Richard_Ngo6y

Yes we do: training is our evolutionary history, deployment is an individual lifetime. And our genomes are our reusable parameters. Unfortunately I haven't yet written any papers/posts really laying out this analogy, but it's pretty central to the way I think about AI, and I'm working on a bunch of related stuff as part of my PhD, so hopefully I'll have a more complete explanation soon.

1Steven Byrnes6y

Oh, OK, I see what you mean. Possibly related: my comment here.

[-]Richard_Ngo4y*3-2

A general principle: if we constrain two neural networks to communicate via natural language, we need some pressure towards ensuring they actually use language in the same sense as humans do, rather than (e.g.) steganographically encoding the information they really care about.

The most robust way to do this: pass the language via a human, who tries to actually understand the language, then does their best to rephrase it according to their own understanding.

What do you lose by doing this? Mainly: you can no longer send messages too complex for humans to und... (read more)

6johnswentworth4y

That doesn't actually solve the problem. The system could just encode the desired information in the semantics of some unrelated sentences - e.g. talk about pasta to indicate X = 0, or talk about rain to indicate X = 1.

2RobertKirk4y

Another possible way to provide pressure towards using language in a human-sense way is some form of multi-tasking/multi-agent scenario, inspired by this paper: Multitasking Inhibits Semantic Drift. They show that if you pretrain multiple instructors and instruction executors to understand language in a human-like way (e.g. with supervised labels), and then during training mix the instructors and instruction executors, it makes it difficult to drift from the original semantics, as all the instructors and instruction executors would need to drift in the same direction; equivalently, any local change in semantics would be sub-optimal compared to using language in the semantically correct way. The examples in the paper are on quite toy problems, but I think in principle this could work.

[-]Richard_Ngo6y20

There's some possible world in which the following approach to interpretability works:

Put an AGI in a bunch of situations where it sometimes is incentivised to lie and sometimes is incentivised to tell the truth.
Train a lie detector which is given all its neural weights as input.
Then ask the AGI lots of questions about its plans.

One problem that this approach would face if we were using it to interpret a human is that the human might not consciously be aware of what their motivations are. For example, they may believe they are doing something for altr... (read more)

[-]Richard_Ngo6y*10

I've heard people argue that "most" utility functions lead to agents with strong convergent instrumental goals. This obviously depends a lot on how you quantify over utility functions. Here's one intuition in the other direction. I don't expect this to be persuasive to most people who make the argument above (but I'd still be interested in hearing why not).

If a non-negligible percentage of an agent's actions are random, then to describe it as a utility-maximiser would require an incredibly complex utility function (becaus... (read more)

2TurnTrout6y

I'm not sure if you consider me to be making that argument, but here are my thoughts: I claim that most reward functions lead to agents with strong convergent instrumental goals. However, I share your intuition that (somehow) uniformly sampling utility functions over universe-histories might not lead to instrumental convergence. To understand instrumental convergence and power-seeking, consider how many reward functions we might specify automatically imply a causal mechanism for increasing reward. The structure of the reward function implies that more is better, and that there are mechanisms for repeatedly earning points (for example, by showing itself a high-scoring input). Since the reward function is "simple" (there's usually not a way to grade exact universe histories), these mechanisms work in many different situations and points in time. It's naturally incentivized to assure its own safety in order to best leverage these mechanisms for gaining reward. Therefore, we shouldn't be surprised to see a lot of these simple goals leading to the same kind of power-seeking behavior. What structure is implied by a reward function? * Additive/Markovian: while a utility function might be over an entire universe-history, reward is often additive over time steps. This is a strong constraint which I don't always expect to be true, but i think that among the goals with this structure, a greater proportion of them have power-seeking incentives. * Observation-based: while a utility function might be over an entire universe-history, the atom of the reward function is the observation. Perhaps the observation is an input to update a world model, over which we have tried to define a reward function. I think that most ways of doing this lead to power-seeking incentives. * Agent-centric: reward functions are defined with respect to what the agent can observe. Therefore, in partially observable environments, there is naturally a greater emphasis on the agent's vantage point in t

3Richard_Ngo6y

I've just put up a post which serves as a broader response to the ideas underpinning this type of argument.

3Richard_Ngo6y

I think this depends a lot on how you model the agent developing. If you start off with a highly intelligent agent which has the ability to make long-term plans, but doesn't yet have any goals, and then you train it on a random reward function - then yes, it probably will develop strong convergent instrumental goals. On the other hand, if you start off with a randomly initialised neural network, and then train it on a random reward function, then probably it will get stuck in a local optimum pretty quickly, and never learn to even conceptualise these things called "goals". I claim that when people think about reward functions, they think too much about the former case, and not enough about the latter. Because while it's true that we're eventually going to get highly intelligent agents which can make long-term plans, it's also important that we get to control what reward functions they're trained on up to that point. And so plausibly we can develop intelligent agents that, in some respects, are still stuck in "local optima" in the way they think about convergent instrumental goals - i.e. they're missing whatever cognitive functionality is required for being ambitious on a large scale.

1TurnTrout6y

Agreed – I should have clarified. I've been mostly discussing instrumental convergence with respect to optimal policies. The path through policy space is also important.

[-]Richard_Ngo6y80

Makes sense. For what it's worth, I'd also argue that thinking about optimal policies at all is misguided (e.g. what's the optimal policy for humans - the literal best arrangement of neurons we could possibly have for our reproductive fitness? Probably we'd be born knowing arbitrarily large amounts of information. But this is just not relevant to predicting or modifying our actual behaviour at all).

[-]TurnTrout4y60

(I now think that you were very right in saying "thinking about optimal policies at all is misguided", and I was very wrong to disagree. I've thought several times about this exchange. Not listening to you about this point was a serious error and made my work way less impactful. I do think that the power-seeking theorems say interesting things, but about eg internal utility functions over an internal planning ontology -- not about optimal policies for a reward function.)

1TurnTrout6y

I disagree. 1. We do in fact often train agents using algorithms which are proven to eventually converge to the optimal policy.[1] Even if we don't expect the trained agents to reach the optimal policy in the real world, we should still understand what behavior is like at optimum. If you think your proposal is not aligned at optimum but is aligned for realistic training paths, you should have a strong story for why. 2. Formal theorizing about instrumental convergence with respect to optimal behavior is strictly easier than theorizing about ϵ-optimal behavior, which I think is what you want for a more realistic treatment of instrumental convergence for real agents. Even if you want to think about sub-optimal policies, if you don't understand optimal policies... good luck! Therefore, we also have an instrumental (...) interest in studying the behavior at optimum. ---------------------------------------- 1. At least, the tabular algorithms are proven, but no one uses those for real stuff. I'm not sure what the results are for function approximators, but I think you get my point. ↩︎

1Richard_Ngo6y

1. I think it's more accurate to say that, because approximately none of the non-trivial theoretical results hold for function approximation, approximately none of our non-trivial agents are proven to eventually converge to the optimal policy. (Also, given the choice between an algorithm without convergence proofs that works in practice, and an algorithm with convergence proofs that doesn't work in practice, everyone will use the former). But we shouldn't pay any attention to optimal policies anyway, because the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute. 2. I think theorizing about ϵ-optimal behavior is more useful than theorizing about optimal behaviour by roughly ϵ, for roughly the same reasons. But in general, clearly I can understand things about suboptimal policies without understanding optimal policies. I know almost nothing about the optimal policy in StarCraft, but I can still make useful claims about AlphaStar (for example: it's not going to take over the world). Again, let's try cash this out. I give you a human - or, say, the emulation of a human, running in a simulation of the ancestral environment. Is this safe? How do you make it safer? What happens if you keep selecting for intelligence? I think that the theorising you talk about will be actively harmful for your ability to answer these questions.

1TurnTrout6y

I'm confused, because I don't disagree with any specific point you make - just the conclusion. Here's my attempt at a disagreement which feels analogous to me: [...] My response in this "debate" is: if you start with a spherical cow and then consider which real world differences are important enough to model, you're better off than just saying "no one should think about spherical cows". [...] I don't understand why you think that. If you can have a good understanding of instrumental convergence and power-seeking for optimal agents, then you can consider whether any of those same reasons apply for suboptimal humans. Considering power-seeking for optimal agents is a relaxed problem. Yes, ideally, we would instantly jump to the theory that formally describes power-seeking for suboptimal agents with realistic goals in all kinds of environments. But before you do that, a first step is understanding power-seeking in MDPs. Then, you can take formal insights from this first step and use them to update your pre-theoretic intuitions where appropriate.

6Richard_Ngo6y

Thanks for engaging despite the opacity of the disagreement. I'll try to make my position here much more explicit (and apologies if that makes it sound brusque). The fact that your model is a simplified abstract model is not sufficient to make it useful. Some abstract models are useful. Some are misleading and will cause people who spend time studying them to understand the underlying phenomenon less well than they did before. From my perspective, I haven't seen you give arguments that your models are in the former category not the latter. Presumably you think they are in fact useful abstractions - why? (A few examples of the latter: behaviourism, statistical learning theory, recapitulation theory, Gettier-style analysis of knowledge). My argument for why they're overall misleading: when I say that "the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute", or that safety researchers shouldn't think about AIXI, I'm not just saying that these are inaccurate models. I'm saying that they are modelling fundamentally different phenomena than the ones you're trying to apply them to. AIXI is not "intelligence", it is brute force search, which is a totally different thing that happens to look the same in the infinite limit. Optimal tabular policies are not skill at a task, they are a cheat sheet, but they happen to look similar in very simple cases. Probably the best example of what I'm complaining about is Ned Block trying to use Blockhead to draw conclusions about intelligence. I think almost everyone around here would roll their eyes hard at that. But then people turn around and use abstractions that are just as unmoored from reality as Blockhead, often in a very analogous way. (This is less a specific criticism of you, TurnTrout, and more a general criticism of the field). [...] Forgive me a little poetic license. The analogy in my mind is that you were trying to model the cow as a sphere, but y

2TurnTrout6y

Thanks for elaborating this interesting critique. I agree we generally need to be more critical of our abstractions. [...] Falsifying claims and "breaking" proposals is a classic element of AI alignment discourse and debate. Since we're talking about superintelligent agents, we can't predict exactly what a proposal would do. However, if I make a claim ("a superintelligent paperclip maximizer would keep us around because of gains from trade"), you can falsify this by showing that my claimed policy is dominated by another class of policies ("we would likely be comically resource-inefficient in comparison; GFT arguments don't model dynamics which allow killing other agents and appropriating their resources"). Even we can come up with this dominant policy class, so the posited superintelligence wouldn't miss it either. We don't know what the superintelligent policy will be, but we know what it won't be (see also Formalizing convergent instrumental goals). Even though I don't know how Gary Kasparov will open the game, I confidently predict that he won't let me checkmate him in two moves. Non-optimal power and instrumental convergence Instead of thinking about optimal policies, let's consider the performance of a given algorithm A. A(M,R) takes a rewardless MDP M and a reward function R as input, and outputs a policy. Definition. Let R be a continuous distribution over reward functions with CDF F. The average return achieved by algorithm A at state s and discount rate γ is ∫RVA(M,R)R(s,γ)dF(R). Instrumental convergence with respect to A's policies can be defined similarly ("what is the R-measure of a given trajectory under A?"). The theory I've laid out allows precise claims, which is a modest benefit to our understanding. Before, we just had intuitions about some vague concept called "instrumental convergence". Here's bad reasoning, which implies that the cow tears a hole in spacetime: [...] The problem is that it's impractical to predict what a smarter agent

2Richard_Ngo6y

I'm afraid I'm mostly going to disengage here, since it seems more useful to spend the time writing up more general + constructive versions of my arguments, rather than critiquing a specific framework. If I were to sketch out the reasons I expect to be skeptical about this framework if I looked into it in more detail, it'd be something like: 1. Instrumental convergence isn't training-time behaviour, it's test-time behaviour. It isn't about increasing reward, it's about achieving goals (that the agent learned by being trained to increase reward). 2. The space of goals that agents might learn is very different from the space of reward functions. As a hypothetical, maybe it's the case that neural networks are just really good at producing deontological agents, and really bad at producing consequentialists. (E.g, if it's just really really difficult for gradient descent to get a proper planning module working). Then agents trained on almost all reward functions will learn to do well on them without developing convergent instrumental goals. (I expect you to respond that being deontological won't get you to optimality. But I would say that talking about "optimality" here ruins the abstraction, for reasons outlined in my previous comment).

1TurnTrout6y

I was actually going to respond, "that's a good point, but (IMO) a different concern than the one you initially raised". I see you making two main critiques. 1. (paraphrased) "A won't produce optimal policies for the specified reward function [even assuming alignment generalization off of the training distribution], so your model isn't useful" – I replied to this critique above. 2. "The space of goals that agents might learn is very different from the space of reward functions." I agree this is an important part of the story. I think the reasonable takeaway is "current theorems on instrumental convergence help us understand what superintelligent A won't do, assuming no reward-result gap. Since we can't assume alignment generalization, we should keep in mind how the inductive biases of gradient descent affect the eventual policy produced." I remain highly skeptical of the claim that applying this idealized theory of instrumental convergence worsens our ability to actually reason about it. ETA: I read some information you privately messaged me, and i see why you might see the above two points as a single concern.

1DanielFilan6y

I object to the claim that agents that act randomly can be made "arbitrarily simple". Randomness is basically definitionally complicated!

1Richard_Ngo6y

Eh, this seems a bit nitpicky. It's arbitrarily simple given a call to a randomness oracle, which in practice we can approximate pretty easily. And it's "definitionally" easy to specify as well: "the function which, at each call, returns true with 50% likelihood and false otherwise."

1DanielFilan6y

If you get an 'external' randomness oracle, then you could define the utility function pretty simply in terms of the outputs of the oracle. If the agent has a pseudo-random number generator (PRNG) inside it, then I suppose I agree that you aren't going to be able to give it a utility function that has the standard set of convergent instrumental goals, and PRNGs can be pretty short. (Well, some search algorithms are probably shorter, but I bet they have higher Kt complexity, which is probably a better measure for agents)

1Vaniver6y

I'd take a different tack here, actually; I think this depends on what the input to the utility function is. If we're only allowed to look at 'atomic reality', or the raw actions the agent takes, then I think your analysis goes through, that we have a simple causal process generating the behavior but need a very complicated utility function to make a utility-maximizer that matches the behavior. But if we're allowed to decorate the atomic reality with notes like "this action was generated randomly", then we can have a utility function that's as simple as the generator, because it just counts up the presence of those notes. (It doesn't seem to me like this decorator is meaningfully more complicated than the thing that gave us "agents taking actions" as a data source, so I don't think I'm paying too much here.) This can lead to a massive explosion in the number of possible utility functions (because there's a tremendous number of possible decorators), but I think this matches the explosion that we got by considering agents that were the outputs of causal processes in the first place. That is, consider reasoning about python code that outputs actions in a simple game, where there are many more possible python programs than there are possible policies in the game.

1Richard_Ngo6y

So in general you can't have utility functions that are as simple as the generator, right? E.g. the generator could be deontological. In which case your utility function would be complicated. Or it could be random, or it could choose actions by alphabetical order, or... And so maybe you can have a little note for each of these. But now what it sounds like is: "I need my notes to be able to describe every possible cognitive algorithm that the agent could be running". Which seems very very complicated. I guess this is what you meant by the "tremendous number" of possible decorators. But if that's what you need to do to keep talking about "utility functions", then it just seems better to acknowledge that they're broken as an abstraction. E.g. in the case of python code, you wouldn't do anything analogous to this. You would just try to reason about all the possible python programs directly. Similarly, I want to reason about all the cognitive algorithms directly.

1Vaniver6y

That's right. I realized my grandparent comment is unclear here: [...] This should have been "consequence-desirability-maximizer" or something, since the whole question is "does my utility function have to be defined in terms of consequences, or can it be defined in terms of arbitrary propositions?". If I want to make the deontologist-approximating Innocent-Bot, I have a terrible time if I have to specify the consequences that correspond to the bot being innocent and the consequences that don't, but if you let me say "Utility = 0 - badness of sins committed" then I've constructed a 'simple' deontologist. (At least, about as simple as the bot that says "take random actions that aren't sins", since both of them need to import the sins library.) In general, I think it makes sense to not allow this sort of elaboration of what we mean by utility functions, since the behavior we want to point to is the backwards assignment of desirability to actions based on the desirability of their expected consequences, rather than the expectation of any arbitrary property. --- Actually, I also realized something about your original comment which I don't think I had the first time around; if by "some reasonable percentage of an agent's actions are random" you mean something like "the agent does epsilon-exploration" or "the agent plays an optimal mixed strategy", then I think it doesn't at all require a complicated utility function to generate identical behavior. Like, in the rock-paper-scissors world, and with the simple function 'utility = number of wins', the expected utility maximizing move (against tough competition) is to throw randomly, and we won't falsify the simple 'utility = number of wins' hypothesis by observing random actions. Instead I read it as something like "some unreasonable percentage of an agent's actions are random", where the agent is performing some simple-to-calculate mixed strategy that is either suboptimal or only optimal by luck (when the optimal mixe

2Richard_Ngo6y

This is in fact the intended reading, sorry for ambiguity. Will edit. But note that there are probably very few situations where exploring via actual randomness is best; there will almost always be some type of exploration which is more favourable. So I don't think this helps. [...] To be pedantic: we care about "consequence-desirability-maximisers" (or in Rohin's terminology, goal-directed agents) because they do backwards assignment. But I think the pedantry is important, because people substitute utility-maximisers for goal-directed agents, and then reason about those agents by thinking about utility functions, and that just seems incorrect. [...] What do you mean by optimal here? The robot's observed behaviour will be optimal for some utility function, no matter how long you run it.

1Vaniver6y

Valid point. [...] This also seems right. Like, my understanding of what's going on here is we have: * 'central' consequence-desirability-maximizers, where there's a simple utility function that they're trying to maximize according to the VNM axioms * 'general' consequence-desirability-maximizers, where there's a complicated utility function that they're trying to maximize, which is selected because it imitates some other behavior The first is a narrow class, and depending on how strict you are with 'maximize', quite possibly no physically real agents will fall into it. The second is a universal class, which instantiates the 'trivial claim' that everything is utility maximization. Put another way, the first is what happens if you hold utility fixed / keep utility simple, and then examine what behavior follows; the second is what happens if you hold behavior fixed / keep behavior simple, and then examine what utility follows. Distance from the first is what I mean by "the further a robot's behavior is from optimal"; I want to say that I should have said something like "VNM-optimal" but actually I think it needs to be closer to "simple utility VNM-optimal." I think you're basically right in calling out a bait-and-switch that sometimes happens, where anyone who wants to talk about the universality of expected utility maximization in the trivial 'general' sense can't get it to do any work, because it should all add up to normality, and in normality there's a meaningful distinction between people who sort of pursue fuzzy goals and ruthless utility maximizers.

Moderation Log