I agree with the basic difference you point to between fast- and slow-takeoff worlds, but disagree that it has important strategic implications for the obviousness of takeover risk.
In slow takeoff worlds, many aspects of the alignment problem show up well before AGI goes critical. However, people will by-default train systems to conceal those problems. (This is already happening: RL from human feedback is exactly the sort of strategy which trains systems to conceal problems, and we've seen multiple major orgs embracing it within the past few months.) As a result, AI takeover risk never looks much more obvious than it does now.
Concealed problems look like no problems, so there will in-general be economic incentives to train in ways which conceal problems. The most-successful-looking systems, at any given time, will be systems trained in ways which incentivize hidden problems over visible problems.
I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.
Examples of small AI catastrophes will also probably make takeover risk more obvious.
I guess another example of this phenomenon is that a bunch of people are more worried about AI takeover than they were five years ago, because they’ve seen more examples of ML systems being really smart, even though they wouldn’t have said five years ago that ML systems could never solve those problems. Seeing the ways things happen is often pretty persuasive to people.
Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment.
This prediction feels like... it doesn't play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which is not an economically-stable state of affairs, so shortly thereafter Facebook switches to a different metric which is less click-centric. (IIRC this actually happened a few years ago.)
On the other hand, sometimes Facebook's newsfeed algorithm is bad in ways which are not visible to individual customers. Like, maybe there's an echo chamber problem, people only see things they agree with. But from an individual customer's perspective, that's exactly what they (think they) want to see, they don't know that there's anything wrong with the information they're receiving. This sort of problem does not actually look like a problem from the perspective of any one person looking at their own feed; it looks good. So that's a much more economically stable state; Facebook is less eager to switch to a new metric.
... but even that isn't a real example of a problem which is properly invisible. It's still obvious that the echo-chamber-newsfeed is bad for other people, and therefore it will still be noticed, and Facebook will still be pressured to change their metrics. (Indeed that is what happened.) The real problems are problems people don't notice at all, or don't know to attribute to the newsfeed algorithm at all. We don't have a widely-recognized example of such a thing and probably won't any time soon, precisely because most people do not notice it. Yet I'd be surprised if Facebook's newsfeed algorithm didn't have some such subtle negative effects, and I very much doubt that the subtle problems will go away as the visible problems are iterated on.
If anything, I'd expect iterating on visible problems to produce additional subtle problems - for instance, in order to address misinformation problems, Facebook started promoting an Official Narrative which is itself often wrong. But that's much harder to detect, because it's wrong in a way which the vast majority of Official Sources also endorse. To put it another way: if most of the population can be dragged into a single echo chamber, all echoing the same wrong information, that doesn't make the echo chamber problem less bad, but it does make the echo chamber problem less visible.
Anyway, zooming out: solve for the equilibrium, as Cowen would say. If the problems are visible to customers, that's not a stable state. Organizations will be incentivized to iterate until problems stop being visible. They will not, however, be incentivized to iterate away the problems which aren't visible.
I guess it depends on “how fast is fast and how slow is slow”, and what you say is true on the margin, but here's my plea that the type of thinking that says “we want some technical problem to eventually get solved, so we try to solve it” is a super-valuable type of thinking right now even if we were somehow 100% confident in slow takeoff. (This is mostly an abbreviated version of this section.)
I think this point is really crucial, and I was correct to make it, and it continues to explain a lot of disagreements about AI safety.
I think this point is incredibly important and quite underrated, and safety researchers often do way dumber work because they don't think about it enough.
In contrast, in a slow takeoff world, many aspects of the AI alignment problems will already have showed up as alignment problems in non-AGI, non-x-risk-causing systems; in that world, there will be lots of industrial work on various aspects of the alignment problem, and so EAs now should think of themselves as trying to look ahead and figure out which margins of the alignment problem aren’t going to be taken care of by default, and try to figure out how to help out there.
Lets consider the opposite. Imagine you are programming a self driving car, in a simulated environment. You notice it goodhearting your metrics, so you tweak them and try again. You build up a list of 1001 ad hoc patches that makes your self driving car behave reasonably most of the time.
The object level patches only really apply to self driving cars. They include things like a small intrinsic preference towards looking at street signs. The meta level strategy of patching it until it works isn't very relevant either.
Imagine a world with many AI's like this. All with ad hoc kludges of hard coded utility functions. The AI is becoming increasingly economically important and getting close to AGI. Slow takeoff. All the industrial work is useless.
In particular, in a fast takeoff world, AI takeover risk never looks much more obvious than it does now, and so x-risk-motivated people should be assumed to cause the majority of the research on alignment that happens.
I strongly disagree with that and I don't think it follows from the premise. I think by most reasonable definitions of alignment it is already the case that most of the research is not done by x-risk motivated people.
Furthermore, I think it reflects poorly on this community that this sort of sentiment seems to be common.
It's possible that a lot of our disagreement is due to different definitions of "research on alignment", where you would only count things that (e.g.) 1) are specifically about alignment that likely scales to superintelligent systems, or 2) is motivated by X safety.
To push back on that a little bit...
RE (1): It's not obvious what will scale, And I think historically this community has been too pessimistic (i.e. almost completely dismissive) about approaches that seem hacky or heuristic.
RE (2): This is basically circular.
I disagree, so I'm curious about what are great examples for you of good research on alignment that is not done by x-risk motivated people? (Not being dismissive, I'm genuinely curious, and discussing specifics sounds more promising than downvoting you to oblivion and not having a conversation at all).
Examples would be interesting, certainly. Concerning the post's point, I'd say the relevant claim is that [type of alignment research that'll be increasingly done in slow takeoff scenarios] is already being done by non x-risk motivated people.
I guess the hope is that at some point there are clear-to-everyone problems with no hacky solutions, so that incentives align to look for fundamental fixes - but I wouldn't want to rely on this.
The slow takeoff hypothesis predicts that AGI emerges in a world where powerful but non-AGI AI is already a really big deal. Whether AI is a big deal right before the emergence of AGI determines many super basic things about what we should think our current job is. I hadn’t fully appreciated the size of this effect until a few days ago.
In particular, in a fast takeoff world, AI takeover risk never looks much more obvious than it does now, and so x-risk-motivated people should be assumed to cause the majority of the research on alignment that happens. In contrast, in a slow takeoff world, many aspects of the AI alignment problems will already have showed up as alignment problems in non-AGI, non-x-risk-causing systems; in that world, there will be lots of industrial work on various aspects of the alignment problem, and so EAs now should think of themselves as trying to look ahead and figure out which margins of the alignment problem aren’t going to be taken care of by default, and try to figure out how to help out there.
In the fast takeoff world, we’re much more like a normal research field–we want some technical problem to eventually get solved, so we try to solve it. But in the slow takeoff world, we’re basically in a weird collaboration across time with the more numerous, non-longtermist AI researchers who will be in charge of aligning their powerful AI systems but who we fear won’t be cautious enough in some ways or won’t plan ahead in some other ways. Doing technical research in the fast takeoff world basically just requires answering technical questions, while in the slow takeoff world your choices about research projects are closely related to your sociological predictions about what things will be obvious to whom when.
I think that these two perspectives are extremely different, and I think I’ve historically sometimes had trouble communicating with people who held the slow takeoff perspective because I didn’t realize we disagreed on basic questions about the conceptualization of the question. (These miscommunications persisted even after I was mostly persuaded of slow takeoffs, because I hadn’t realized the extent to which I was implicitly assuming fast takeoffs in my picture of how AGI was going to happen.)
As an example of this, I think I was quite confused about what genre of work various prosaic alignment researchers think they’re doing when they talk about alignment schemes. To quote a recent AF shortform post of mine:
I think the slightly exaggerated slogan for this update of mine is “IDA is futurism, not a proposal”.
My current favorite example of the thinking-on-the-margin version of alignment research strategy is in this comment by Paul Christiano.