Ngo's view on alignment difficulty

Richard_Ngo; Eliezer Yudkowsky

This post features a write-up by Richard Ngo on his views, with inline comments.

Color key:

Chat

Google Doc content

Inline comments

13. Follow-ups to the Ngo/Yudkowsky conversation

13.1. Alignment difficulty debate: Richard Ngo's case

[Ngo][9:31] (Sep. 25)

As promised, here's a write-up of some thoughts from my end. In particular, since I've spent a lot of the debate poking Eliezer about his views, I've tried here to put forward more positive beliefs of my own in this doc (along with some more specific claims): [GDocs link]

[Soares: ✨]

[Ngo] (Sep. 25 Google Doc)

We take as a starting observation that a number of “grand challenges” in AI have been solved by AIs that are very far from the level of generality which people expected would be needed. Chess, once considered to be the pinnacle of human reasoning, was solved by an algorithm that’s essentially useless for real-world tasks. Go required more flexible learning algorithms, but policies which beat human performance are still nowhere near generalising to anything else; the same for StarCraft, DOTA, and the protein folding problem. Now it seems very plausible that AIs will even be able to pass (many versions of) the Turing Test while still being a long way from AGI.

[Yudkowsky][11:26] (Sep. 25 comment)

Now it seems very plausible that AIs will even be able to pass (many versions of) the Turing Test while still being a long way from AGI.

I remark: Restricted versions of the Turing Test. Unrestricted passing of the Turing Test happens after the world ends. Consider how smart you'd have to be to pose as an AGI to an AGI; you'd need all the cognitive powers of an AGI as well as all of your human powers.

[Ngo][11:24] (Sep. 29 comment)

Perhaps we can quantify the Turing test by asking something like:

What percentile of competence is the judge?
What percentile of competence are the humans who the AI is meant to pass as?
How much effort does the judge put in (measured in, say, hours of strategic preparation)?

Does this framing seem reasonable to you? And if so, what are the highest numbers for each of these metrics that correspond to a Turing test which an AI could plausibly pass before the world ends?

[Ngo] (Sep. 25 Google Doc)

I expect this trend to continue until after we have AIs which are superhuman at mathematical theorem-proving, programming, many other white-collar jobs, and many types of scientific research. It seems like Eliezer doesn't. I’ll highlight two specific disagreements which seem to play into this.

[Yudkowsky][11:28] (Sep. 25 comment)

doesn't

Eh? I'm pretty fine with something proving the Riemann Hypothesis before the world ends. It came up during my recent debate with Paul, in fact.

Not so fine with something designing nanomachinery that can be built by factories built by proteins. They're legitimately different orders of problem, and it's no coincidence that the second one has a path to pivotal impact, and the first does not.

[Ngo] (Sep. 25 Google Doc)

A first disagreement is related to Eliezer’s characterisation of GPT-3 as a shallow pattern-memoriser. I think there’s a continuous spectrum between pattern-memorisation and general intelligence. In order to memorise more and more patterns, you need to start understanding them at a high level of abstraction, draw inferences about parts of the patterns based on other parts, and so on. When those patterns are drawn from the real world, then this process leads to the gradual development of a world-model.

This position seems more consistent with the success of deep learning so far than Eliezer’s position (although my advocacy of it loses points for being post-hoc; I was closer to Eliezer’s position before the GPTs). It also predicts that deep learning will lead to agents which can reason about the world in increasingly impressive ways (although I don’t have a strong position on the extent to which new architectures and algorithms will be required for that). I think that the spectrum from less to more intelligent animals (excluding humans) is a good example of what it looks like to gradually move from pattern-memorisation to increasingly sophisticated world-models and abstraction capabilities.

[Yudkowsky][11:30] (Sep. 25 comment)

In order to memorise more and more patterns, you need to start understanding them at a high level of abstraction, draw inferences about parts of the patterns based on other parts, and so on.

Correct. You can believe this and not believe that exactly GPT-like architectures can keep going deeper until their overlap of a greater number of patterns achieves the same level of depth and generalization as human depth and generalization from fewer patterns, just like pre-transformer architectures ran into trouble in memorizing deeper patterns than the shallower ones those earlier systems could memorize.

[Ngo] (Sep. 25 Google Doc)

I expect that Eliezer won’t claim that pattern-memorisation is unrelated to general intelligence, but will claim that a pattern-memoriser needs to undergo a sharp transition in its cognitive algorithms before it can reason reliably about novel domains (like open scientific problems) - with his main argument for that being the example of the sharp transition undergone by humans.

However, it seems unlikely to me that humans underwent a major transition in our underlying cognitive algorithms since diverging from chimpanzees, because our brains are so similar to those of chimps, and because our evolution from chimps didn’t take very long. This evidence suggests that we should favour explanations for our success which don't need to appeal to big algorithmic changes, if we have any such explanations; and I think we do. More specifically, I’d characterise the three key differences between humans and chimps as:

Humans have bigger brains.
Humans have a range of small adaptations primarily related to motivation and attention, such as infant focus on language and mimicry, that make us much better at cultural learning.
Humans grow up in a rich cultural environment.

[Ngo][9:13] (Sep. 23 comment on earlier draft)

bigger brains

I recall a 3-4x difference; but this paper says 5-6x for frontal cortex: https://www.nature.com/articles/nn814

[Tallinn][3:24] (Sep. 26 comment)

language and mimicry

“apes are unable to ape sounds” claims david deutsch in “the beginning of infinity”

[Barnes][8:09] (Sep. 23 comment on earlier draft)

[Humans grow up in a rich cultural environment.]

much richer cultural environment including deliberate teaching

[Ngo] (Sep. 25 Google Doc)

I claim that the discontinuity between the capabilities of humans and chimps is mainly explained by the general intelligence of chimps not being aimed in the direction of learning the skills required for economically valuable tasks, which in turn is mainly due to chimps lacking the “range of small adaptations” mentioned above.

My argument is a more specific version of Paul’s claim that chimp evolution was not primarily selecting for doing things like technological development. In particular, it was not selecting for them because no cumulative cultural environment existed while chimps were evolving, and selection for the application of general intelligence to technological development is much stronger in a cultural environment. (I claim that the cultural environment was so limited before humans mainly because cultural accumulation is very sensitive to transmission fidelity.)

By contrast, AIs will be trained in a cultural environment (including extensive language use) from the beginning, so this won't be a source of large gains for later systems.

[Ngo][6:01] (Sep. 22 comment on earlier draft)

more specific version of Paul’s claim

Based on some of Paul's recent comments, this may be what he intended all along; though I don't recall his original writings on takeoff speeds making this specific argument.

[Shulman][14:23] (Sep. 25 comment)

(I claim that the cultural environment was so limited before humans mainly because cultural accumulation is very sensitive to transmission fidelity.)

There can be other areas with superlinear effects from repeated application of a skill. There's reason to think that the most productive complex industries tend to have that character.

Making individual minds able to correctly execute long chains of reasoning by reducing per-step error rate could plausibly have very superlinear effects in programming, engineering, management, strategy, persuasion, etc. And you could have new forms of 'super-culture' that don't work with humans.

https://ideas.repec.org/a/eee/jeborg/v85y2013icp1-10.html

[Ngo] (Sep. 25 Google Doc)

If true, this argument would weigh against Eliezer’s claims about agents which possess a core of general intelligence being able to easily apply that intelligence to a wide range of tasks. And I don’t think that Eliezer has a compelling alternative explanation of the key cognitive differences between chimps and humans (the closest I’ve seen in his writings is the brainstorming at the end of this post).

If this is the case, I notice an analogy between Eliezer’s argument against Kurzweil, and my argument against Eliezer. Eliezer attempted to put microfoundations underneath the trend line of Moore’s law, which led to a different prediction than Kurzweil’s straightforward extrapolation. Similarly, my proposed microfoundational explanation of the chimp-human gap gives rise to a different prediction than Eliezer’s more straightforward, non-microfoundational extrapolation.

[Yudkowsky][11:39] (Sep. 25 comment)

Similarly, my proposed microfoundational explanation of the chimp-human gap gives rise to a different prediction than Eliezer’s more straightforward, non-microfoundational extrapolation.

Eliezer does not use "non-microfoundational extrapolations" for very much of anything, but there are obvious reasons why the greater Earth does not benefit from me winning debates through convincingly and correctly listing all the particular capabilities you need to add over and above what GPToid architectures can achieve, in order to achieve AGI. Nobody else with a good model of larger reality will publicly describe such things in a way they believe is correct. I prefer not to argue convincingly but wrongly. But, no, it is not Eliezer's way to sound confident about anything unless he thinks he has a more detailed picture of the microfoundations than the one you are currently using yourself.

[Ngo][11:40] (Sep. 29 comment)

Good to know; apologies for the incorrect inference.

Given that this seems like a big sticking point in the debate overall, do you have any ideas about how to move forward while avoiding infohazards?

[Ngo] (Sep. 25 Google Doc)

My position makes some predictions about hypothetical cases:

If chimpanzees had the same motivational and attention-guiding adaptations towards cultural learning and cooperation that humans do, and were raised in equally culturally-rich environments, then they could become economically productive workers in a range of jobs (primarily as manual laborers, but plausibly also for operating machinery, etc).
1. Results from chimps raised in human families, like Washoe, seem moderately impressive, although still very uncertain. There’s probably a lot of bias towards positive findings - but on the other hand, it’s only been done a handful of times, and I expect that more practice at it would lead to much better results.
2. Comparisons between humans and chimps which aren’t raised in similar ways to humans are massively biased towards humans. For the purposes of evaluating general intelligence, comparisons between chimpanzees and feral children seem fairer (although it’s very hard to know how much the latter were affected by non-linguistic childhoods as opposed to abuse or pre-existing disabilities).
Consider a hypothetical species which has the same level of “general intelligence” that chimpanzees currently have, but is as well-adapted to the domains of abstract reasoning and technological development as chimpanzee behaviour is to the domain of physical survival (e.g. because they evolved in an artificial environment where their fitness was primarily determined by their intellectual contributions). I claim that this species would have superhuman scientific research capabilities, and would be able to make progress in novel areas of science (analogously to how chimpanzees can currently learn to navigate novel physical landscapes).
1. Insofar as Eliezer doubts this, but does believe that this species could outperform a society of village idiots at scientific research, then he needs to explain why the village-idiot-to-Einstein gap is so significant in this context but not in others.
2. However, this is a pretty weird thought experiment, and maybe doesn’t add much to our existing intuitions about AIs. My main intention here is to point at how animal behaviour is really really well-adapted to physical environments, in a way which makes people wonder what it would be like to be really really well-adapted to intellectual environments.
~~I claim that the difficulty of human-level oracle AGIs matching humans~~ Consider an AI which has been trained only to answer questions, and is now human-level at doing so. I claim that the difficulty of this AI matching humans at a range of real-world tasks (without being specifically trained to do so) would be much closer to the difficulty of teaching chimps to do science, than the difficulty of teaching adult humans to do abstract reasoning about a new domain.
1. The analogy here is: chimps have reasonably general intelligence, but it’s hard for them to apply it to science because they weren’t trained to apply intelligence to that. Likewise, human-level oracle AGIs have general intelligence, but it’ll be hard for them to apply it to influencing the world because they weren’t trained to apply intelligence to that.

[Barnes][8:21] (Sep. 23 comment on earlier draft)

village-idiot-to-Einstein gap

I wonder to what extent you can model within-species intelligence differences partly just as something like hyperparameter search - if you have a billion humans with random variation in their neural/cognitive traits, the top human will be a lot better than average. Then you could say something like:

humans are the dumbest species you could have where the distribution of intelligence in each generation is sufficient for cultural accumulation
that by itself might not imply a big gap from chimps
but human society has much larger population, so the smartest individuals are much smarter

[Ngo][9:05] (Sep. 23 comment on earlier draft)

I think Eliezer's response (which I'd agree with) would be that the cognitive difference between the best humans and normal humans is strongly constrained by the fact that we're all one species who can interbreed with each other. And so our cognitive variation can't be very big compared with inter-species variation (at the top end at least; although it could at the bottom end via things breaking).

[Barnes][9:35] (Sep. 23 comment on earlier draft)

I think that's not obviously true - it's definitely possible that there's a lot of random variation due to developmental variation etc. If that's the case then population size could create large within-species differences

[Yudkowsky][11:46] (Sep. 25 comment)

oracle AGIs

Remind me of what this is? Surely you don't just mean the AI that produces plans it doesn't implement itself, because that AI becomes an agent by adding an external switch that routes its outputs to a motor; it can hardly be much cognitively different from an agent. Then what do you mean, "oracle AGI"?

(People tend to produce shallow specs of what they mean by "oracle" that make no sense in my microfoundations, a la "Just drive red cars but not blue cars!", leading to my frequent reply, "Sorry, still AGI-complete in terms of the machinery you have to build to do that.")

[Ngo][11:44] (Sep. 29 comment)

Edited to clarify what I meant in this context (and remove the word "oracle" altogether).

[Yudkowsky][12:01] (Sep. 29 comment)

My reply holds just as much to "AIs that answer questions"; what restricted question set do you imagine suffices to save the world without dangerously generalizing internal engines?

[Barnes][8:15] (Sep. 23 comment on earlier draft)

The analogy here is: chimps have reasonably general intelligence, but it’s hard for them to apply it to science because they weren’t trained to apply intelligence to that. Likewise, human-level oracle AGIs have general intelligence, but it’ll be hard for them to apply it to influencing the world because they weren’t trained to apply intelligence to that.

this is not intuitive to me; it seems pretty plausible that the subtasks of predicting the world and of influencing the world are much more similar than the subtasks of surviving in a chimp society are to the subtasks of doing science

[Ngo][8:59] (Sep. 23 comment on earlier draft)

I think Eliezer's position is that all of these tasks are fairly similar if you have general intelligence. E.g. he argued that the difference between very good theorem-proving and influencing the world is significantly smaller than people expect. So even if you're right, I think his position is too strong for your claim to help him. (I expect him to say that I'm significantly overestimating the extent to which chimps are running general cognitive algorithms).

[Barnes][9:33] (Sep. 23 comment on earlier draft)

I wasn't trying to defend his position, just disagreeing with you :P

[Ngo] (Sep. 25 Google Doc)

More specific details

Here are three training regimes which I expect to contribute to AGI:

Self-supervised training - e.g. on internet text, code, books, videos, etc.
Task-based RL - agents are rewarded (likely via human feedback, and some version of iterated amplification) for doing well on bounded tasks.
Open-ended RL - agents are rewarded for achieving long-term goals in rich environments.

[Yudkowsky][11:56] (Sep. 25 comment)

bounded tasks

There's an interpretation of this I'd agree with, but all of the work is being carried by the boundedness of the tasks, little or none via the "human feedback" part which I shrug at, and none by the "iterated amplification" part since I consider that tech unlikely to exist before the world ends.

[Ngo] (Sep. 25 Google Doc)

Most of my probability of catastrophe comes from AGIs trained primarily via open-ended RL. Although IA makes these scenarios less likely by making task-based RL more powerful, it doesn’t seem to me that IA tackles the hardest case (of aligning agents trained via open-ended RL) head-on. But disaster from open-ended RL also seems a long way away - mainly because getting long-term real-world feedback is very slow, and I expect it to be hard to create sufficiently rich artificial environments. By that point I do expect the strategic landscape to be significantly different, because of the impact of task-based RL.

[Yudkowsky][11:57] (Sep. 25 comment)

a long way away

Oh, definitely, at the present rates of progress we've got years, plural.

The history of futurism says that even saying that tends to be unreliable in the general case (people keep saying it right up until the Big Thing actually happens) and also that it's rather a difficult form of knowledge to obtain more than a few years out.

[Yudkowsky][12;01] (Sep. 25 comment)

hard to create sufficiently rich artificial environments

Disagree; I don't think that making environments more difficult in a way that challenges the environment inside will prove to be a significant AI development bottleneck. Making simulations easy enough for current AIs to do interesting things in them, but hard enough that the things they do are not completely trivial, takes some work relevant to current levels of AI intelligence. I think that making those environments more tractably challenging for smarter AIs is not likely to be nearly a bottleneck in progress, compared to making the AIs smarter and able to solve the environment. It's a one-way-hash, P-vs-NP style thing - not literally, just that general relationship between it taking a lower amount of effort to pose a problem such that solving it requires a higher amount of effort.

[Ngo] (Sep. 25 Google Doc)

Perhaps the best way to pin down disagreements in our expectations about the effects of the strategic landscape is to identify some measures that could help to reduce AGI risk, and ask how seriously key decision-makers would need to take AGI risk for each measure to be plausible, and how powerful and competent they would need to be for that measure to make a significant difference. Actually, let’s lump these metrics together into a measure of “amount of competent power applied”. Some benchmarks, roughly in order (and focusing on the effort applied by the US):

Banning chemical/biological weapons
COVID
- Key points: mRNA vaccines, lockdowns, mask mandates
Nuclear non-proliferation
- Key points: Nunn-Lugar Act, stuxnet, various treaties
The International Space Station
- Cost to US: ~$75 billion
Climate change
- US expenditure: >$154 billion (but not very effectively)
Project Apollo
- Wikipedia says that Project Apollo “was the largest commitment of resources ($156 billion in 2019 US dollars) ever made by any nation in peacetime. At its peak, the Apollo program employed 400,000 people and required the support of over 20,000 industrial firms and universities.”
WW1
WW2

[Yudkowsky][12:02] (Sep. 25 comment)

WW2

This level of effort starts to buy significant amounts of time. This level will not be reached, nor approached, before the world ends.

[Ngo] (Sep. 25 Google Doc)

Here are some wild speculations (I just came up with this framework, and haven’t thought about these claims very much):

The US and China preventing any other country from becoming a leader in AI requires about as much competent power as banning chemical/biological weapons.
The US and China enforcing a ban on AIs above a certain level of autonomy requires about as much competent power as the fight against climate change.
1. In this scenario, all the standard forces which make other types of technological development illegal have pushed towards making autonomous AGI illegal too.
Launching a good-faith joint US-China AGI project requires about as much competent power as launching Project Apollo.
1. According to this article, Kennedy (and later Johnson) made several offers (some of which were public) of a joint US-USSR Moon mission, which Khrushchev reportedly came close to accepting. Of course this is a long way from actually doing a joint project (and it’s not clear how reliable the source is), but it still surprised me a lot, given that I viewed the “space race” as basically a zero-sum prestige project. If your model predicted this, I’d be interested to hear why.

[Yudkowsky][12:07] (Sep. 25 comment)

The US and China preventing any other country from becoming a leader in AI requires about as much competent power as banning chemical/biological weapons.

I believe this is wholly false. On my model it requires closer to WW1 levels of effort. I don't think you're going to get it without credible threats of military action leveled at previously allied countries.

AI is easier and more profitable to build than chemical / biological weapons, and correspondingly harder to ban. Existing GPU factories need to be shut down and existing GPU clusters need to be banned and no duplicate of them can be allowed to arise, across many profiting countries that were previously military allies of the United States, which - barring some vast shift in world popular and elite opinion against AI, which is also not going to happen - those countries would be extremely disinclined to sign, especially if the treaty terms permitted the USA and China to forge ahead.

The reason why chem weapons bans were much easier was that people did not like chem weapons. They were awful. There was a perceived common public interest in nobody having chem weapons. It was understood popularly and by elites to be a Prisoner's Dilemma situation requiring enforcement to get to the Pareto optimum. Nobody was profiting tons off the infrastructure that private parties could use to make chem weapons.

An AI ban is about as easy as banning advanced metal-forging techniques in current use so nobody can get ahead of the USA and China in making airplanes. That would be HARD and likewise require credible threats of military action against former allies.

"AI ban is as easy as a chem weapons ban" seems to me like politically crazy talk. I'd expect a more politically habited person to confirm this.

[Shulman][14:32] (Sep. 25 comment)

AI ban much, much harder than chemical weapons ban. Indeed chemical weapons were low military utility, that was central to the deal, and they have still been used subsequently.

An AI ban is about as easy as banning advanced metal-forging techniques in current use so nobody can get ahead of the USA and China in making airplanes. That would be HARD and likewise require credible threats of military action against former allies.

If large amounts of compute relative to today are needed (and presumably Eliezer rejects this), the fact that there is only a single global leading node chip supply chain makes it vastly easier than metal forging, which exists throughout the world and is vastly cheaper.

Sharing with allies (and at least embedding allies to monitor US compliance) also reduces the conflict side.

OTOH, if compute requirements were super low then it gets a lot worse.

And the biological weapons ban failed completely: the Soviets built an enormous bioweapons program, the largest ever, after agreeing to the ban, and the US couldn't even tell for sure they were doing so.

[Yudkowsky][18:15] (Oct. 4 comment)

I've updated somewhat off of Carl Shulman's argument that there's only one chip supply chain which goes through eg a single manufacturer of lithography machines (ASML), which could maybe make a lock on AI chips possible with only WW1 levels of cooperation instead of WW2.

That said, I worry that, barring WW2 levels, this might not last very long if other countries started duplicating the supply chain, even if they had to go back one or two process nodes on the chips? There's a difference between the proposition "ASML has a lock on the lithography market right now" and "if aliens landed and seized ASML, Earth would forever after be unable to build another lithography plant". I mean, maybe that's just true because we lost technology and can't rebuild old bridges either, but it's at least less obvious.

Launching Tomahawk cruise missiles at any attempt anywhere to build a new ASML, is getting back into "military threats against former military allies" territory and hence what I termed WW2 levels of cooperation.

[Shulman][18:30] (Oct. 4 comment)

China has been trying for some time to build its own and has failed with tens of billions of dollars (but has captured some lagging node share), but would be substantially more likely to succeed with a trillion dollar investment. That said, it is hard to throw money at these things and the tons of tacit knowledge/culture/supply chain networks are tough to replicate. Also many ripoffs of the semiconductor subsidies have occurred. Getting more NASA/Boeing and less SpaceX is a plausible outcome even with huge investment.

They are trying to hire people away from the existing supply chain to take its expertise and building domestic skills with the lagging nodes.

[Yudkowsky][19:14] (Oct. 4 comment)

Does that same theory predict that if aliens land and grab some but not all of the current ASML personnel, Earth is thereby successfully taken hostage for years, because Earth has trouble rebuilding ASML, which had the irreproducible lineage of masters and apprentices dating back to the era of Lost Civilization? Or would Earth be much better at this than China, on your model?

[Shulman][19:31] (Oct. 4 comment)

I'll read that as including the many suppliers of ASML (one EUV machine has over 100,000 parts, many incredibly fancy or unique). It's just a matter of how many years it takes. I think Earth fails to rebuild that capacity in 2 years but succeeds in 10.

"A study this spring by Boston Consulting Group and the Semiconductor Industry Association estimated that creating a self-sufficient chip supply chain would take at least $1 trillion and sharply increase prices for chips and products made with them...The situation underscores the crucial role played by ASML, a once obscure company whose market value now exceeds $285 billion. It is “the most important company you never heard of,” said C.J. Muse, an analyst at Evercore ISI."

https://www.nytimes.com/2021/07/04/technology/tech-cold-war-chips.html

[Yudkowsky][19:59] (Oct. 4 comment)

No in 2 years, yes in 10 years sounds reasonable to me for this hypothetical scenario, as far as I know in my limited knowledge.

[Yudkowsky][12:10] (Sep. 25 comment)

Launching a good-faith joint US-China AGI project requires about as much competent power as launching Project Apollo.

It's really weird, relative to my own model, that you put the item that the US and China can bilaterally decide to do all by themselves, without threats of military action against their former allies, as more difficult than the items that require conditions imposed on other developed countries that don't want them.

Political coordination is hard. No, seriously, it's hard. It comes with a difficulty penalty that scales with the number of countries, how complete the buy-in has to be, and how much their elites and population don't want to do what you want them to do relative to how much elites and population agree that it needs doing (where this very rapidly goes to "impossible" or "WW1/WW2" as they don't particularly want to do your thing).

[Ngo] (Sep. 25 Google Doc)

So far I haven’t talked about how much competent power I actually expect people to apply to AI governance. I don’t think it’s useful for Eliezer and me to debate this directly, since it’s largely downstream from most of the other disagreements we’ve had. In particular, I model him as believing that there’ll be very little competent power applied to prevent AI risk from governments and wider society, partly because he expects a faster takeoff than I do, and partly because he has a lower opinion of governmental competence than I do. But for the record, it seems likely to me that there’ll be as much competent effort put into reducing AI risk by governments and wider society as there has been into fighting COVID; and plausibly (but not likely) as much as fighting climate change.

One key factor is my expectation that arguments about the importance of alignment will become much stronger as we discover more compelling examples of misalignment. I don’t currently have strong opinions about how compelling the worst examples of misalignment before catastrophe are likely to be; but identifying and publicising them seems like a particularly effective form of advocacy, and one which we should prepare for in advance.

The predictable accumulation of easily-accessible evidence that AI risk is important is one example of a more general principle: that it’s much easier to understand, publicise, and solve problems as those problems get closer and more concrete. This seems like a strong effect to me, and a key reason why so many predictions of doom throughout history have failed to come true, even when they seemed compelling at the time they were made.

Upon reflection, however, I think that even taking this effect into account, the levels of competent power required for the interventions mentioned above are too high to justify the level of optimism about AI governance that I started our debate with. On the other hand, I found Eliezer’s arguments about consequentialism less convincing than I expected. Overall I’ve updated that AI risk is higher than I previously believed; though I expect my views to be quite unsettled while I think more, and talk to more people, about specific governance interventions and scenarios.

[-]Wei Dai3y90

One hope I have in this vein is that human genes don't contain any "metaphilosophical secret sauce" (instead all the secret sauce is in the culture) so we are able to build a competent philosopher just by doing (something like) fine-tuning GPT-n with a bunch of philosophy papers and/or human feedback. Then we use the artificial (black box) philosopher as part of an aligned AI or to help solve alignment problems.

Unfortunately, I expect that even in the scenario where this ends up working, the artificial philosophers will probably end up writing thousands of increasingly hard-to-follow papers on each philosophical problem exploring all the possible arguments/counterarguments, before reaching some consensus among themselves, and because we won't have a white-box understanding of metaphilosophy, we will just have to hope that they learned to do philosophy the "right way" whatever that actually is.