I think this section of the post is slightly overstating the opportunity cost of doing a PhD. PhD students typically spend most of their time on research so ideally, they should be doing AI safety research during the PhD (e.g. like Stephen Casper). If the PhD is in an unrelated field or for the sake of upskilling then there is a more significant opportunity cost relative to working directly for an AI safety organization.
Thank you for explaining PPO. In the context of AI alignment, it may be worth understanding in detail because it's the core algorithm at the heart of RLHF. I wonder if any of the specific implementation details of PPO or how it's different from other RL algorithms have implications for AI alignment. To learn more about PPO and RLHF, I recommend reading this paper: Secrets of RLHF in Large Language Models Part I: PPO.
LLMs aren't that useful for alignment experts because it's a highly specialized field and there isn't much relevant training data. The AI Safety Chatbot partially solves this problem using retrieval-augmented generation (RAG) on a database of articles from https://aisafety.info. There also seem to be plans to fine-tune it on a dataset of alignment articles.
Nice post! The part I found most striking was how you were able to use the mean difference between outputs on harmful and harmless prompts to steer the model into refusing or not. I also like the refusal metric which is simple to calculate but still very informative.
TL;DR: Private AI companies such as Anthropic which have revenue-generating products and also invest heavily in AI safety seem like the best type of organization for doing AI safety research today. This is not the best option in an ideal world and maybe not in the future but right now I think it is.
I appreciate the idealism and I'm sure there is some possible universe where shutting down these labs would make sense but I'm quite unsure about whether doing so would actually be net-beneficial in our world and I think there's a good chance it would be net-neg...
Thanks for the post! I think it does a good job of describing key challenges in AI field-building and funding.
The talent gap section describes a lack of positions in industry organizations and independent research groups such as SERI MATS. However, there doesn't seem to be much content on the state of academic AI safety research groups. So I'd like to emphasize the current and potential importance of academia for doing AI safety research and absorbing talent. The 80,000 Hours AI risk page says that there are several academic groups working on AI safety inc...
Some related posts on automating alignment research I discovered recently:
I agree that the difficulty of the alignment problem can be thought of as a diagonal line on the 2D chart above as you described.
This model may make having two axes instead of one unnecessary. If capabilities and alignment scale together predictably, then high alignment difficulty is associated with high capabilities, and therefore the capabilities axis could be unnecessary.
But I think there's value in having two axes. Another way to think about your AI alignment difficulty scale is like a vertical line in the 2D chart: for a given level of AI capability (...
I highly recommend this interview with Yann LeCun which describes his view on self-driving cars and AGI.
Basically, he thinks that self-driving cars are possible with today's AI but would require immense amounts of engineering (e.g. hard-wired behavior for corner cases) because today's AI (e.g. CNNs) tends to be brittle and lacks an understanding of the world.
My understanding is that Yann thinks we basically need AGI to solve autonomous driving in a reliable and satisfying way because the car would need to understand the world like a human to drive reliably.
The superalignment team is OpenAI’s new team, co-led by Jan Leike and Ilya Sutskever, for solving alignment for superintelligent AI. One of their goals is to create a roughly human-level automated alignment researcher. The idea is that creating an automated alignment researcher is a kind of minimum viable alignment project where aligning it is much easier than aligning a more advanced superintelligent sovereign-style system. If the automated alignment researcher creates a solution to aligning superintelligence, Ope...
That seems like a better split and there are outliers of course. But I think orgs are more likely to be well-known to grant-makers on average given that they tend to have a higher research output, more marketing, and the ability to organize events. An individual is like an organization with one employee.
But I think orgs are more likely to be well-known to grant-makers on average given that they tend to have a higher research output,
I think your getting the causality backwards. You need money first, before there is an org. Unless you count informal multi people collaborations as orgs.
I think people how are more well-known to grant-makers are more likely to start orgs. Where as people who are less known are more likely to get funding at all, if they aim for a smaller garant, i.e. as an independent researcher.
This sounds more or less correct to me. Open Philanthropy (Open Phil) is the largest AI safety grant maker and spent over $70 million on AI safety grants in 2022 whereas LTFF only spent ~$5 million. In 2022, the median Open Phil AI safety grant was $239k whereas the median LTFF AI safety grant was only $19k in 2022.
Open Phil and LTFF made 53 and 135 AI safety grants respectively in 2022. This means the average Open Phil AI safety grant in 2022 was ~$1.3 million whereas the average LTFF AI safety grant was only $38k. So the average Open Phil AI safety grant...
This is brilliant work, thank you. It's great that someone is working on these topics and they seem highly relevant to AGI alignment.
One intuition for why a neuroscience-inspired approach to AI alignment seems promising is that apparently a similar strategy worked for AI capabilities: the neural network researchers from the 1980s who tried to copy how the brain works using deep learning were ultimately the most successful at building highly intelligent AIs (e.g. GPT-4) and more synthetic approaches (e.g. pure logic) were less successful.
Similarly, we alrea... (read more)