How I Formed My Own Views About AI Safety

Neel Nanda

Disclaimer: I work as a researcher at Anthropic, but this post entirely represents my own views, rather than the views of my own employer

Introduction

I’ve spent the past two years getting into the field of AI Safety. One important message I heard as I was entering the field was that I needed to “form an inside view about AI Safety”, that I needed to form my own beliefs and think for myself rather than just working on stuff because people smarter than me cared about it. And this was incredibly stressful! I think the way I interpreted this was pretty unhealthy, caused me a lot of paralysing uncertainty and anxiety, and almost caused me to give up on getting into the field. But I feel like I’ve now reached a point I’m comfortable with, and where I somewhat think I have my own inside views on things and understand how to form them.

In this post, I try to explain the traps I fell into and why, what my journey actually looked like, and my advice for how to think about inside views, now I’ve seen what not to do. This is a complex topic and I think there are a lot of valid perspectives, but hopefully my lens is novel and useful for some people trying to form their own views on confusing topics (AI Safety or otherwise)! (Note: I don’t discuss why I do now think AI Safety is important and worth working on - that’s a topic for a future post!)

The Message of Inside Views

First, context to be clear about what I mean by inside views. As I understand it, this is a pretty fuzzily defined concept, but roughly means “having a clear model and argument in my head, starting from some basic and reasonable beliefs about the world, that get to me to a conclusion like ‘working on AI Safety is important’ without needing to rely on deferring to people”. This feels highly related to the concept of gears-level models. This is in comparison to outside views, or deferring to people, where the main reason I believe something is because smart people I respect believe it. In my opinion, there’s a general vibe in the rationality community that inside views are good and outside views are bad (see Greg Lewis’ In Defence of Epistemic Modesty for a good argument for the importance of outside views and deferring!). Note that this is not the Tetlockian sense of the words, used in forecasting, where outside view means ‘look up a base rate’ and inside view means ‘use my human intuition, which is terribly calibrated’, where the standard wisdom is outside view > inside view.

Good examples of this kind of reasoning: Buck Shlegeris’ My Personal Cruxes for Working on AI Safety, Richard Ngo’s AGI Safety from First Principles, Joseph Carlsmith’s report on Existential Risk from Power-Seeking AI. Note that, while these are all about the question of ‘is AI Safety a problem at all’, the notion of an inside view also applies well to questions like ‘de-confusion research/reinforcement learning from human feedback/interpretability is the best way to reduce existential risk from AI’, arguing for specific research agendas and directions.

How I Interpreted the Message of Inside Views

I’m generally a pretty anxious person and bad at dealing with uncertainty, and sadly, this message resulted in a pretty unhealthy dynamic in my head. It felt like I had to figure out for myself the conclusive truth of ‘is AI Safety a real problem worth working on’ and which research directions were and were not useful, so I could then work on the optimal one. And that it was my responsibility to do this all myself, that it was bad and low-status to work on something because smart people endorsed it.

This was hard and overwhelming because there are a lot of agendas, and a lot of smart people with different and somewhat contradictory views. So this felt basically impossible. But it also felt like I had to solve this before I actually started any permanent research positions (ie by the time I graduated) in case I screwed up and worked on something sub-optimal. And thus, I had to solve this problem that empirically most smart people must be screwing up, and do it all before I graduated. This seemed basically impossible, and created a big ugh field around exploring AI Safety. Which was already pretty aversive, because it involved re-skilling, deciding between a range of different paths like PhDs vs going straight into industry, and generally didn’t have a clean path into it.

My Journey

So, what actually happened to me? I started taking AI Safety seriously in my final year of undergrad. At the time, I bought the heuristic arguments for AI Safety (like, something smarter than us is scary), but didn’t really know what working in the field looked like beyond ‘people at MIRI prove theorems I guess, and I know there are people at top AI labs doing safety stuff?’ I started talking to lots of people who worked in the field, and gradually got data on what was going on. This was all pretty confusing and stressful, and was competing with going into quant finance - a safe, easy, default path that I already knew I’d enjoy.

After graduating, I realised I had a lot more flexibility than I thought. I took a year out, and managed to finagle my way into doing three back-to-back AI Safety internships. The big update was that I could explore AI Safety without risking too much - I could always go back into finance in a year or two if it didn’t work out. I interned at FHI, DeepMind and CHAI - working on mathematical/theoretical safety work, empirical ML based stuff to do with fairness and bias, and working on empirical interpretability work respectively. I also did the AGI Fundamentals course, and chatted to a lot of people at the various orgs I worked at and at conference. I tried to ask all the researchers I met about their theory of change for how their research actually matters. One thing that really helped me was chatting to a researcher at OpenAI who said that, when he started, he didn’t have clear inside views. But that he’d formed them fairly organically over time, and just spending time thinking and being in a professional research environment was enough.

At the end of the year, I had several offers and ended up joining Anthropic to work on interpretability with Chris Olah. I wasn’t sure this was the best option, but I was really excited about interpretability, and it seemed like the best bet. A few months in, this was clearly a great decision and I’m really excited about the work, but it wouldn’t have been the end of the world if I’d decided the work wasn’t very useful or a bad fit, and I expect I could have left within a few months without hard feelings. As I’ve done research and talked to Chris + other people here, I’ve started to form clearer views on what’s going on with interpretability and the theory of impact for it and Anthropic’s work, but there’s still big holes in my understanding where I’m confused or deferring to people. And this is fine! I don’t think it’s majorly holding me back from having an impact in the short-term, and I’m forming clearer views with time.

My Advice for Thinking About & Forming Inside Views

Why to form them?

I think there are four main reasons to care about forming inside views:

Truth-tracking - having an impact is hard! It’s really important to have true beliefs, and the best way to find them is by trying hard to form your own views and ensuring they correlate with truth. It’s easy to get deferring wrong if you trust the wrong people.

I’m pretty unconvinced by this one - it doesn’t seem that hard to find people smarter than me, who’ve thought about each problem for longer than I have, and just believing whatever they believe. Especially if I average multiple smart people’s beliefs

Eg, I haven’t thought too much about biosecurity, but will happily defer to people like Greg Lewis on the topic!

Ensuring good community epistemic health - Maybe your personal inside view will track the truth less well than the best researchers. But it’s not perfectly correlated! If you try hard to find the truth on your own, you might notice ideas other people are missing, can poke holes in popular arguments, etc. And this will make the community as a whole better off

This one is pretty legit, but doesn’t seem that big a deal. Like, important, sure, but not something I’d dedicate more than 5% of my effort towards max
It seems particularly important to avoid information cascades where I work on something because Alice thinks it matters, and then Bob is a bit skeptical of Alice alone but observes that both Alice and I believe it matters, and works on it even harder, Charlie sees me, Alice and Bob, etc. This is a main reason I try hard to distinguish between what I believe all things considered (including other people’s views) and what I believe by my own lights (according to my own intuitions + models of the world)

Motivation - It’s really hard to work on something you don’t believe in!

I personally overthink things, and this one is really important to me! But people vary - this is much more a fact about personal psychology than an abstract statement about how to have an impact

Research quality - Doing good research involves having good intuitions and research taste, sometimes called an inside view, about why the research matters and what’s really going on. This conceptual framework guides the many small decisions and trade-offs you make on a daily basis as a researcher

I think this is really important, but it’s worth distinguishing this from ‘is this research agenda ultimately useful’. This is still important in eg pure maths research just for doing good research, and there are areas of AI Safety where you can do ‘good research’ without actually reducing the probability of x-risk.

Toy example: Let’s say there are ten good AI Safety researchers in the world, who all believe different things. My all-things-considered view should put 10% credence on each person’s view. But I’ll get much more research done if I randomly pick one person and fully adopt their views and dive into their research agenda. So, even if only one researcher is correct, the latter strategy is much better in expected value.

This is one of the main reasons that mentorship is so key. I have become a way more effective interpretability researcher by having ready access to Chris to ask for advice, intuitions and direction. And one of my top priorities is absorbing as many of his conceptual frameworks as I can

More generally, IMO the point of a research mentor is to lend you their conceptual frameworks to advise you on how to make the right decisions and trade-offs. And you slowly absorb their frameworks by supervised learning, and build on and add to them as you grow as a researcher

These are pretty different, and it’s really important to be clear about which reasons you care about! Personally, I mostly care about motivation > research quality = impact >> community epistemics

How to form them?

Talk to people! Try to absorb their inside views, and make it your own

Importantly, the goal is not to defer to them, it’s to understand what they believe and why.
My main tool for this is to ask lots of questions, and then paraphrase - summarise back my understanding in my own words, and ask what’s wrong or what I’m missing.

My default question is ‘so, why, concretely, does your research direction reduce existential risk from AI?’
Or, ‘what are the biggest ways you disagree with other researchers?’ Or ‘why aren’t you working on X?’

I really, really love paraphrasing! A few reasons it’s great:

It forces you to actively listen and process in the moment
It’s much easier to correct than teach - the other person can easily identity issues in your paraphrase and correct them
It makes it obvious to myself if I’m confused or don’t understand something, or if I’m deferring on any points - it’s awkward to say things that are confused!
Once I get it working, I have now downloaded their mental model into my head and can play around with it
Once you’ve downloaded multiple people’s models, you can compare them, see how they differ, etc

A variant - focus on cruxes, key claims where if they changed their mind on that they’d change their mind about what to work on.

This is really important - some people work on a direction because they think it’s the most important, other people work on it because eg it’s a good personal fit or they find it fun. These should be completely different conversations

A variant - write a google doc summarising a conversation and send it to them afterwards for comments. This can work great if you find it hard to summarise in the moment, and can produce a good artefact to publish or share - I’d love it if people did this more with me

You have permission to disagree (even with really cool and high-status people)

This was a big update for me! Someone being smart and competent just means they’re right more often, not that they’re always right
It really helps to have a low bar for asking dumb questions - if you poke at everything that might be wrong, 90% of the time they’re right and you learn something, and 10% of the time they missed something
For example, I’ve done research in the past that, in hindsight, I don’t think was particularly useful. And this is totally fine!
Empirically, there’s a lot of smart people who believe different and contradictory things! It’s impossible for all of them to be right, so you must disagree with some of them. Internalising that you can do this is really important for being able to think clearly

Don’t be a monk - you form an inside view by going out in the world and doing things - not just by hiding away and thinking really hard

Eg, just try doing research! Spend 10 hours pursuing something, write up a blog post, fail, succeed, hear criticism, see what you learn and make updates
Talk to lots of people!
Live your life, and see what happens - my thoughts naturally change a lot over time
It’s valuable to spend some time reading and thinking, but if this is all you do I think that’s a mistake

Think from first principles (sometimes)

Concrete exercise: Open a blank google doc, set a one hour timer, and start writing out your case for why AI Safety is the most important problem to work on. Spend the full hour on this, and if you run out of steam, go back through and poke at everything that feels confusing, or off, or dodgy. Write out all the counter-arguments you can think of, and repeat
This definitely isn’t all you should do, but I think this is a really useful exercise for anything confusing!

Don’t just try harder - I have a failure mode I call pushing the Try Harder button where I label something as important and just try to channel a lot of willpower and urgency towards it. Don’t do that! This takes a long time, and a lot will happen naturally as you think, talk to people, and do research.

If you find this really stressful, you have my permission to chill and not make it a priority for a while!
I’ve found my inside views develop a lot over time, fairly organically

Inside vs outside views is a spectrum - there’s no clear division between thinking for yourself and deferring. Forming inside views starts out by deferring, and then slowly forming more and more detailed models of where I’m deferring and why over time

My views have gone fairly organically from naive stories like ‘AGI seems scary because intelligence is important and smart people think this matters’ to more detailed ones like ‘I think one reason AGI is scary is inner misalignment. Because neural networks have the base optimiser of stochastic gradient descent, the network may end up as a mesa-optimiser with a different mesa-objective. And this may create an instrumental incentive for power seeking’. The latter story is way more detailed, but still includes a lot of implicit deferring - eg that we’ll get AGI at all, that it’ll be via deep learning, that mesa-optimisers are a thing at all, that there’s an instrumental incentive for power seeking, etc. But expanding the tree of concepts like this is what progress looks like!
Or, ‘I should work on AI because AGI will happen eventually - if nature did it, so can we’ to ‘AGI is compute constrained. Using the bioanchors method to link to the size of the human brain gives 30-ish year AI timelines for human-level AI. I believe AGI is compute constrained because of some heuristic arguments about empirical trends, and because lots of smart people believe this’
Getting here looks like downloading other people’s gears level models into your head, and slowly combining them, deleting parts you disagree with, adding ideas of your own, etc

Misc

Defer intelligently - Don’t just adopt someone’s opinions as your own because they’re charismatic, high status, or well-credentialed. Think about why you think their opinions track the truth better than your own, and in which areas you’re willing to defer to them. Figure out how hard they’ve thought about this, and whether they’ve taken the belief seriously

One key question is how much feedback they get from the world - would they know if they were wrong? I think some fields score much better on this than others - I’m a lot more comfortable disagreeing with many moral philosophy professors and being a committed consequentialist than I am with eg disagreeing with most algebraic geometers. Mathematicians get feedback re whether there proofs work in a way that, as far as I can tell, moral philosophy doesn’t
And be domain specific - I’d defer to a Cambridge maths professor about mathematical facts, but not on a topic like ‘how best to teach maths to undergraduates’ - they clearly haven’t done enough experimentation to tell if they’re missing out on vastly better methods

You can act without an inside view

Forming a good inside view takes a really long time! I’ve been doing full-time safety research for the past year and a bit and I’m still very confused

An analogy - a PhD is essentially a training program to give people an inside view for a specific research area. And this takes several years! IMO a question like ‘is AGI an existential risk’ is much harder than most thesis topics, and you don’t have a hope of really understanding it without that much work

You can always change your mind and pivot later! Make the best expected value bet given what you know at the time, and what information you might get in future
Gathering information has costs! Sometimes thinking harder about a problem is analysis paralysis, and it’s worth just running with your best guess
I think it’s good to spend maybe 10% of your time long-term on high-level thinking, strategy, forming inside views, etc - a lot of your time should be spent actually doing stuff!

Though it’s OK to spend a higher percentage early on when you have major decisions like what career path to go down.

You don’t have to form an inside view - Forming inside views that track the truth is hard, and it’s a skill. You might just be bad at it, or find it too stressful. And this is fine! It shouldn’t be low-status or guilt-inducing to just do what people more competent than you recommend

You can be a great research assistant, ops person, engineer etc without having a clear inside view - just find someone smart who you trust, explain your situation, and do what they think is best

I think the main reason this is a bad idea is motivational, not really about truth-tracking. And it’s up to you how much you care about this motivationally!
An analogy: I think basically all AI Safety researchers who have ideas for an agenda should get funded, even if I personally think their agenda is BS. Likewise, I want them all to have enough labour available to execute well on their agenda - picking the agenda you’re the best personal fit for and just deferring is a good way to implement this in practice.

Aim high, but be OK with missing - It’s valuable and important practice to try forming inside views, but it’s also pretty hard! It’s OK to struggle and not make much progress

IMO, trying to think for yourself is great training - it’ll help you think more clearly, be harder to con, become a better researcher, etc.
Outside view: The vast majority of the world thinks AI Safety is nonsense, and puts very few resources towards it. This is worth taking seriously! You shouldn’t throw your life away on a weird and controversial idea without thinking seriously about it first
This is a good way to trade-off between motivation and truth-tracking - so long as I try hard to think for myself, I feel OK motivationally, even if I know that I may not be tracking truth well

In practice, I try hard to form my own views, but then make big decisions by deferring a lot and forming an all-things-considered view, which I expect to track truth better

If you aren’t doing full-time research, it’s much harder to form clear views on things! This is a really hard thing you’re trying to do

Convey mindsets, not inside views - If you’re talking to someone else about this stuff, eg while community building, it’s important to try to convey the spirit and mindset of forming inside views, more so than your actual views. Try to convey all of the gears-level models in your head, but make it clear that they’re just models! Try to convey what other people believe in.

I try hard to be clear about which beliefs I’m confident in, which are controversial, which points I’m deferring on, and which things I’ve thought hard about. I think this is important for avoiding information cascades, and building a healthy community
Relatedly, if you’re mostly doing community building, it’s totally fine to not have inside views on hard technical questions like AI Safety! Your goal is more to help people in your community form their own views on things - having views of your own is helpful but not essential.

[-]Shmi4y3-2

Just a small remark

Open a blank google doc, set a one hour timer, and start writing out your case for why AI Safety is the most important problem to work on

Not "why", but "whether" is the first step. Otherwise you end up being a clever arguer.

[-]Logan Riggs4y90

No, "why" is correct. See the rest of the sentence:

Write out all the counter-arguments you can think of, and repeat

It's saying assume it's correct, then assume it's wrong, and repeat. Clever arguers don't usually devil advocate themselves.

[-]Raemon4y30

I'm currently working to form my own models here. I'm not sure if this post concretely helped me but it's nice to see other people grappling with it.

One thing I notice is that this post is sort of focused on "developing inside views as a researcher, so you can do research." But an additional lens here is "Have an inside view so you can give advice to other people, or do useful community building, or build useful meta-tools for research, or fund research.

In my case I already feel like I have a solid inside view of "AGI is important", and "timelines might be short-ish", but am lacking in "okay, but what do we do about it?". And the most important problem (for me) is that as a LessWrong admin and curator of various in-person offices/retreats/projects, I'm not sure which specific projects to focus on fostering.

I have a general sense of "all the promising avenues should ideally be getting funding". But attention is still limited. Which things make sense to curate? Which things make sense to crosspost to Alignment Forum? A retreat or office can only have so many people, and it matters not just if the people are "good/promising" in some sense, but that they have good intellectual chemistry.

I also have a sense that, working in a meta-org, some of the best meta-work is to provide concrete object-level-work for people to do.

22