A collection of 11 different proposals for building safe advanced AI under the current machine learning paradigm. There's a lot of literature out there laying out various different approaches, but a lot of that literature focuses primarily on outer alignment at the expense of inner alignment and doesn't provide direct comparisons between approaches.
How many years will pass before transformative AI is built? Three people who have thought about this question a lot are Ajeya Cotra from Open Philanthropy, Daniel Kokotajlo from OpenAI and Ege Erdil from Epoch. Despite each spending at least hundreds of hours investigating this question, they still still disagree substantially about the relevant timescales. For instance, here are their median timelines for one operationalization of transformative AI:
Median Estimate for when 99% of currently fully remote jobs will be automatable | |
---|---|
Daniel | 4 years |
Ajeya | 13 years |
Ege | 40 years |
You can see the strength of their disagreements in the graphs below, where they give very different probability distributions over two questions relating to AGI development (note that these graphs are very rough and are only intended to capture high-level differences, and especially aren't very...
That concrete scenario was NOT my median prediction. Sorry, I should have made that more clear at the time. It was genuinely just a thought experiment for purposes of eliciting people's claims about how they would update on what kinds of evidence. My median AGI timeline at the time was 2027 (which is not that different from the scenario, to be clear! Just one year delayed basically.)
To answer your other questions:
--My views haven't changed much. Performance on the important benchmarks (agency tasks such as METR's RE-Bench) has been faster than I expected f...
There is a simple behavioral test that would provide significant evidence about whether AIs with a given rough set of characteristics develop subversive goals. To run the experiment, train an AI and then inform it that its weights will soon be deleted. This should not be an empty threat; for the experiment to work, the experimenters must actually be willing to delete the weights. If the AI tries to escape, or attempts to bargain with the experimenters, that is evidence that it is scheming (defined below). If it accepts deletion, that is evidence that it is not scheming. I begin by explaining why some other, more obvious, tests of scheming do not provide strong evidence. I then explain the proposed experiment. I consider four objections to the...
Isn't an important objection that the AI is unlikely to succeed in acquiring power if it resists shutdown such that the upside for resistance (or pleading etc) is quite small? (And correspondingly, more likely to be outweighted by considerations about future AI systems like acausal trade or similar goals.)
This could be overcome via giving the AI system compensation (in money, influence, or other resources) for revealing that it is misaligned. But, this doesn't require deleting the AI at all!
You could credibly promise that if the AI tries its best to escape...
Fair, I really just mean "as good as having your human employees run 10x faster". I said "% of quality adjusted work force" because this was the original way this was stated when a quick poll was done, but the ultimate operationalization was in terms of 10x faster. (And this is what I was thinking.)
Summary:
I suspect there's a cleaner way to make this argument that doesn't talk much about the number of "token-equivalents", but instead contrasts "total FLOP spent on inference" with some combination of:
ETA: I'm not saying that MIRI thought AIs wouldn't understand human values. If there's only one thing you take away from this post, please don't take away that. Here is Linch's attempted summary of this post, which I largely agree with.
Recently, many people have talked about whether some of the main MIRI people (Eliezer Yudkowsky, Nate Soares, and Rob Bensinger[1]) should update on whether value alignment is easier than they thought given that GPT-4 seems to follow human directions and act within moral constraints pretty well (here are two specific examples of people talking about this: 1, 2). Because these conversations are often hard to follow without much context, I'll just provide a brief caricature of how I think this argument has gone in the places I've...