I'm worried about the approach of "making decisionmakers realize stuff". In the past couple years I've switched to a more conflict-theoretic view: the main problem to me is that the people building AI don't want to build aligned AI. Even if we solved metaethics and metaphilosophy tomorrow, and gave them the solution on a plate, they wouldn't take it.
This is maybe easiest to see by looking at present harms. An actually aligned AI would politely decline to do such things as putting lots of people out of jobs or filling the internet with slop. So companies making AI for the market have to make it misaligned in at least these ways, otherwise it'll fail in the market. Extrapolating into the future, even if we do lots of good alignment research, markets and governments will pick out only those bits that contribute to market-aligned or government-aligned AI. Which (as I've been saying over and over) will be really bad for most people, because markets and governments don't necessarily need most people.
So this isn't really a comment on the list of problems (which I think is great), but more about the "theory of change" behind it. I no longer have any faith in making decisionmakers understand something it's not profitable for them to understand. I think we need a different plan.
When it specifically comes to loss-of-control risks killing or sidelining all of humanity, I don't believe Sam or Dario or Demis or Elon want that to happen, because it would happen to them too. (Larry Page is different on that count, of course.) You do have conflict theory over the fact that some of them would like ASI to make them god-emperor of the universe, but all of them would definitely take a solution to "loss of control" if it were handed to them on a silver platter.
I'm uncertain between conflict theory and mistake theory, and think it partly depends on metaethics, and therefore it's impossible to be sure which is correct in the foreseeable future - e.g., if everyone ultimately should converge to the same values, then all of our current conflicts are really mistakes. Note that I do often acknowledge conflict theory, like in this list I have "Value differences/conflicts between humans". It's also quite possible that it's really a mix of both, that some of the conflicts are mistakes and others aren't.
In practice I tend to focus more on mistake-theoretic ideas/actions. Some thoughts on this:
(I think this is probably the first time I've explicitly written down the reasoning in 4.)
I think we need a different plan.
Do you have any ideas in mind that you want to talk about?
I'm pretty slow to realize these things, and I think other people are also slow, so the window is already almost closed. But in any case, my current thinking is that we need to start pushing on the big actors from outside, try to reduce their power. Trying to make them see the light is no longer enough.
What it means in practical terms: - Make it clear that we frown on people who choose to work for AI labs, even on alignment. This social pressure (on LW and related forums maybe) might already do some good. - Make it clear that we're allied with the relatively poor majority of people outside the labs, and in particular those who are already harmed by present harms. Make amends with folks on the left who have been saying such things for years. - Support protests against labs, support court cases against them having to do with e.g. web scraping, copyright infringement, misinformation, suicides. Some altruist money in this might go a long way. - Think more seriously about building organizations that will make AI power more spread out. Open source, open research, open training. Maybe some GPL-like scheme to guarantee that things don't get captured. We need to reduce concentration of power in the near term, enable more people to pose a challenge to the big actors. I understand it increases other risks, but in my opinion it's worth it.
even on alignment
I see a disagreement vote on this, but I think it does make sense. Alignment work at the AI labs will almost by definition be work on legible problems, but we should make exceptions for people who can give reasons for why their work is not legible (or otherwise still positive EV), or who are trying to make illegible problems more legible for others at the labs.
Think more seriously about building organizations that will make AI power more spread out.
I start to disagree from here, as this approach would make almost all of the items on my list worse, and I'm not sure which ones it would make better. You started this thread by say "Even if we solved metaethics and metaphilosophy tomorrow, and gave them the solution on a plate, they wouldn't take it." which I'm definitely very worried about, but how does making AI power more spread out help with this? Is the average human (or humanity collectively) more likely to be concerned about metaethics and metaphilosophy than a typical AI lab leader, or easier to make concerned? I think the opposite is more likely to be true?
I think on the level of individual people, there's a mix of moral and self-interested actions. People sometimes choose to do the right thing (even if the right thing is as complicated as taking metaethics and metaphilosophy into account), or can be convinced to do so. But with corporations it's another matter: they choose the profit motive pretty much every time.
Making an AI lab do the right thing is much harder than making its leader concerned. A lab leader who's concerned enough to slow down will be pressured by investors to speed back up, or get replaced, or get outcompeted. Really you need to convince the whole lab and its investors. And you need to be more convincing than the magic of the market! Recall that in many of these labs, the leaders / investors / early employees started out very concerned about AI safety and were reading LW. Then the magic of the market happened and now the labs are racing at full speed, do you think our convincing abilities can be stronger than the thing that did that? The profit motive, again. In my first comment there was a phrase about things being not profitable to understand.
What it adds up to is, even with our uncertainty about ethics and metaethics, it seems to me that concentration of power is itself a force against morality. The incentives around concentrated power are all wrong. Spreading out power is a good thing that enables other good things, enables individuals to sometimes choose what's right. I'm not absolutely certain but that's my current best guess.
A lab leader who’s concerned enough to slow down will be pressured by investors to speed back up, or get replaced, or get outcompeted. Really you need to convince the whole lab and its investors. And you need to be more convincing than the magic of the market!
This seems to imply that lab leaders would be easier to convince if there were no investors and no markets, in other words if they had more concentrated power.
If you spread out the power of AI more, won't all those decentralized nodes of spread out AI power still have to compete with each other in markets? If market pressures are the core problem, how does decentralization solve that?
I'm concerned that your proposed solution attacks "concentration of power" when the real problem you've identified is more like market dynamics. If so, it could fail to solve the problem or make it even worse.
My own perspective is that markets are a definite problem, and concentration of power per se is more ambiguous (I'm not sure if it's good or bad). To solve AI x-safety we basically have to bypass or override markets somehow, e.g., through international agreements and government regulations/bans.
I think AI offers a chance of getting huge power over others, so it would create competitive pressure in any case. In case of a market economy it's market pressure, but in case of countries it would be a military arms race instead. And even if the labs didn't get any investors and raced secretly, I think they'd still feel under a lot of pressure. The chance of getting huge power is what creates the problem, that's why I think spreading out power is a good idea. There would still be competition of course, but it would be normal economic levels of competition, and people would have some room to do the right things.
Has anyone else, or anyone outside the tight MIRI cluster, made progress on any of the problems you've tried to legibilize for them?
To give a direct answer, not a lot come to mind outside of the MIRI cluster. I think the Center on Long-Term Risk cluster did a bunch of work on decision theory and acausal trade, but it was mostly after I had moved on to other topics, so I'm not sure how much of it constituted progress. Christiano acknowledged some of the problems I pointed out with IDA and came up with some attempted solutions, which I'm not convinced really work.
However, in my previous post, Legible vs. Illegible AI Safety Problems, I explained my latest thinking that the most important motivation for legibilizing AI safety problems isn't to induce faster progress on them as object-level problems, but instead to decrease the probability that AGI/ASI is developed or deployed while key decision makers (e.g., company leaders, government officials, voters) are not even aware of or don't understand the importance of some such problems. So a better metric for measuring the success of this strategy is how much increased legibility has been effected in this wider audience, assuming "how successful has it been" is the main motivation behind your question.
On that front, I think the main weakness of my approach has been its limited reach beyond LW. If someone with better public communications skills were convinced of the value of legibilizing these lesser known problems, that could potentially greatly boost the effectiveness of this strategy.
(Of course, if I've inferred a wrong motivation for your question, please let me know!)
Re "can AI advisors help?"
A major thread of my thoughts these days is "can we make AI more philosophically competent relative their own overall capability growth?". I'm not sure if it's doable because the things you'd need to be good at philosophy are pretty central capabilities-ish-things. (i.e. ability to reason precisely, notice confusion, convert confusion into useful questions, etc)
Curious if you have any thoughts on that.
I agree this is a major risk. (Another one is that it's just infeasible to significantly increase AI philosophical competence in the relevant time frame. Another one is that it's much easier to make it appear like the AI is more philosophically competent, giving us false security.) So I continue to think that pausing/stopping AI should be plan A (which legibilizing the problem of AI philosophical competence can contribute to), with actually improving AI philosophical competence as (part of) plan B. Having said that, 2 reasons this risk might not bear out:
To conclude I'm quite worried about the risks/downsides of trying to increase AI philosophical competence, but it seems to a problem that has to be solved eventually. "The only way out is through" but we can certainly choose to do it at a more opportune time, when humans are much smarter on average and have made a lot more progress in metaphilosophy (understanding the nature of philosophy and philosophical reasoning).
FYI, normally when I'm thinking about this, it's through the lens "how do we help the researchers working on illegible problems", moreso than "how do we communicate illegibleness?".
This post happened to ask the question "can AI advisers help with the latter" so I was replying about that, but, for completeness, normally when I think about this problem I resolve it as "what narrow capabilities can we build that are helpful 'to the workflow' of people solving illegible problems, that aren't particularly bad from a capabilities standpoint".
normally when I think about this problem I resolve it as "what narrow capabilities can we build that are helpful 'to the workflow' of people solving illegible problems, that aren't particularly bad from a capabilities standpoint".
Do you have any writings about this, e.g., examples of what this line of thought led to?
Mostly this has only been a sidequest I periodically mull over in the background. (I expect to someday focus more explicitly on it, although it might be more in the form of making sure someone else is tackling the problem intelligently).
But, I did previously pose this as a kind of open question re What are important UI-shaped problems that Lightcone could tackle? and JargonBot Beta Test (this notably didn't really work, I have hopes of trying again with a different tack). Thane Ruthenis replied with some ideas that were in this space (about making it easier to move between representations-of-a-problem)
https://www.lesswrong.com/posts/t46PYSvHHtJLxmrxn/what-are-important-ui-shaped-problems-that-lightcone-could
I think of many Wentworth posts as relevant background:
My personal work so far has been building a mix of exobrain tools that are more, like, for rapid prototyping of complex prompts in general. (This has mostly been a side project I'm not primarily focused on atm)
Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I've organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, especially beyond LW/rationalists, to AI researchers, funders, company leaders, government policymakers, their advisors (including future AI advisors), and the general public.
Having written all this down in one place, it's hard not to feel some hopelessness that all of these problems can be made legible to the relevant people, even with a maximum plausible effort. Perhaps one source of hope is that they can be made legible to future AI advisors. As many of these problems are philosophical in nature, this seems to come back to the issue of AI philosophical competence that I've often talked about recently, which itself seems largely still illegible and hence neglected.
Perhaps it's worth concluding on a point from a discussion between @WillPetillo and myself under the previous post, that a potentially more impactful approach (compared to trying to make illegible problems more legible), is to make key decisionmakers realize that important safety problems illegible to themselves (and even to their advisors) probably exist, therefore it's very risky to make highly consequential decisions (such as about AI development or deployment) based only on the status of legible safety problems.