Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
Some of my favorite memes:
(by Rob Wiblin)
(xkcd)
My EA Journey, depicted on the whiteboard at CLR:
(h/t Scott Alexander)
I don't think it's the most important or original or interesting thing I've done, but I'm proud of the ideas in here nevertheless. Basically, other researchers have now actually done many of the relevant experiments to explore the part of the tech tree I was advocating for in this post. See e.g. https://www.alignmentforum.org/posts/HuoyYQ6mFhS5pfZ4G/paper-output-supervision-can-obfuscate-the-cot
I'm very happy that those researchers are doing that research, and moreover, very happy that the big AI companies have sorta come together to agree on the importance of CoT monitorability! https://arxiv.org/abs/2507.11473
OK but are these ideas promising? How do they fit into the bigger picture?
Conveniently, my answer to those questions is illustrated in the AI 2027 scenario:
It's not the only answer though. I think that improving the monitorability of CoT is just amaaaazing for building up the science of AI and the science of AI alignment, plus also raising awareness about how AIs think and work, etc.
Another path to impact is that if neuralese finally arrives and we have no more CoTs to look at, (a) some of the techniques for making good neuralese interpreters might benefit from the ideas developed for keeping CoT faithful, and (b) having previously studied all sorts of examples of misaligned CoTs, it'll be easier to argue to people that there might be misaligned cognition happening in the neuralese.
(Oh also, to be clear, I'm not taking credit for all these ideas. Other people like Fabian and Tamera for example got into CoT monitorability before I did, as did MIRI arguably. I think of myself as having just picked up the torch and ran with it for a bit, or more like, shouted at people from the sidelines to do so.)
Overall do I recommend this for the review? Well, idk, I don't think it's THAT important or great. I like it but I don't think it's super groundbreaking or anything. Also, it has low production value; it was a very low-effort dialogue that we banged out real quick after some good conversations.
Whoa, you think the scenarios I'm focusing on are marginalist? I didn't expect you to say that. I generally think of what we are doing as (a) forecasting and (b) making ambitious solve-approximately-all-the-problems plans to present to the world. Forecasting isn't marginalist, it's a type error to think so, and as for our plans, well, they seem pretty ambitious to me.
This is really great, thank you! It feels like it's a one-stop-shop for a lot of the most important ideas and arguments that have been developed on the topic of deep learning misalignment over the past few years.
This ontology allows clearer and more nuanced understanding of what's going on and dispels some confusions.
The ontology seems good to me, but what confusions is it dispelling? I'm out of the loop here.
IIUC the core of this post is the following:
There are three forces/reasons pushing very long-sighted ambitious agents/computations to make use of very short-sighted, unambitious agents/computations:
1. Schelling points for acausal coordination
2. Epistemology works best if you just myopically focus on answering correctly whatever question is in front of you, rather than e.g. trying to optimize your long-run average correctness or something.
3. Short-sighted, unambitious agents/computations are less of a threat, more easily controlled.
I'll ignore 1 for now. For 2 and 3, I think I understand what you are saying on a vibes/intuitive level, but I don't trust my vibes/intuition enough. I'd like to see 2 and 3 spelled out and justified more rigorously. IIRC there's an academic philosophy literature on 2, but I'm afraid I don't remember it much. Are you familiar with it? As for 3, here's a counterpoint: Yes, more longsighted agents/computations carry with them risk of various kinds of Evil. But, the super myopic ones also have their drawbacks. (Example: Sometimes you'd rather have a bureaucracy staffed with somewhat agentic people who can make exceptions to the rules when needed, than a bureaucracy staffed with apathetic rule-followers.) You haven't really argued that the cost-benefit analysis systematically favors delegating to myopic agents/computations.
I disagree with the probabilities given by the OP. Also, the thing I mentioned was just one example, and probably not the best example; the idea is that the 10 people on the inside would be implementing a whole bunch of things like this.
I don't think the idea is that the 10 people on the inside violate the wishes of company leadership. Rather, the idea is that they use whatever tiny amount of resources and political capital they do have as best as possible. E.g. leadership might be like "Fine, before we erase the logs of AI activity we can have your monitor system look over them and flag anything suspicious -- but you have to build the monitor by next week because we aren't delaying, and also, it can't cost more than 0.01% of overall compute."
I don't think what you are doing makes the situation worse. Perhaps you do think that of me though; this would be understandable...