I'll be running an Ask Me Anything on this post from Friday (April 30) to Saturday (May 1).

If you want to ask something just post a top-level comment; I'll spend at least a day answering questions.

You can find some background about me here.

New Comment
144 comments, sorted by Click to highlight new comments since: Today at 7:52 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Who's the best critic of your alignment research? What have they been right about?

In this post I argued that an AI-induced point of no return would probably happen before world GDP starts to noticeably accelerate. You gave me some good pushback about the historical precedent I cited, but what is your overall view? If you can spare the time, what is your credence in each of the following PONR-before-GDP-acceleration scenarios, and why?

1. Fast takeoff

2. The sorts of skills needed to succeed in politics or war are easier to develop in AI than the sorts needed to accelerate the entire world economy, and/or have less deployment lag. (Maybe it takes years to build the relevant products and industries to accelerate the economy, but only months to wage a successful propaganda campaign to get people to stop listening to the AI safety community)

3. We get an "expensive AI takeoff" in which AI capabilities improve enough to cross some threshold of dangerousness, but this improvement happens in a very compute-intensive way that makes it uneconomical to automate a significant part of the economy until the threshold has been crossed.

4. Vulnerable world: Thanks to AI and other advances, a large number of human actors get the ability to make WMD's.

5. Persuasion/p... (read more)

I don't know if we ever cleared up ambiguity about the concept of PONR. It seems like it depends critically on who is returning, i.e. what is the counterfactual we are considering when asking if we "could" return. If we don't do any magical intervention, then it seems like the PONR could be well before AI since the conclusion was always inevitable. If we do a maximally magical intervention, of creating unprecedented political will, then I think it's most likely that we'd see 100%+ annual growth (even of say energy capture) before PONR. I don't think there are reasonable definitions of PONR where it's very likely to occur before significant economic acceleration.

I don't think I consider most of the scenarios list necessarily-PONR-before-GDP acceleration scenarios, though many of them could permit PONR-before-GDP if AI was broadly deployed before it started adding significant economic value.

All of these probabilities are obviously pretty unreliable and made up on the spot:

1. Fast takeoff

Defined as 1-year doubling starts before 4-year doubling finishes, maybe 25%?

2. The sorts of skills needed to succeed in politics or war are easier to develop in AI than the sorts needed to accelerate

... (read more)
1Donald Hobson3y
I think there is a reasonable way it could happen even without an enormous lead. You just need either, 1. Its very hard to capture a significant fraction of the gains from the tech. 2. Tech progress scales very poorly in money.  For example, suppose it is obvious to everyone that AI in a few years time will be really powerful. Several teams with lots of funding are set up. If progress is researcher bound, and researchers are ideologically committed to the goals of the project, then top research talent might be extremely difficult to buy. (They are already well paid, for the next year they will be working almost all day. After that, the world is mostly shaped by which project won.)  Compute could be hard to buy if there were hard bottlenecks somewhere in the chip supply chain, most of the worlds new chips were already being used by the AI projects, and an attitude of "our chips and were not selling" was prevalent.  Another possibility, suppose deploying a tech means letting the competition know how it works. Then if one side deploys, they are pushing the other side ahead. So the question is, does deploying one unit of research give you the resources to do more than one unit? 

What are the most important ideas floating around in alignment research that don't yet have a public write-up? (Or, even better, that have a public write-up but could do with a good one?)

I have a big gap between "stuff I've written up" and "stuff that I'd like to write up." Some particular ideas that come to mind: how epistemic competitiveness seems really important for alignment; how I think about questions like "aligned with whom" and why I think it's good to try to decouple alignment techniques from decisions about values / preference aggregation (this position is surprisingly controversial); updated views on the basic dichotomy in Two Kinds of Generalization and the current best hopes for avoiding the bad kind.

I think that there's a cluster of really important questions about what we can verify, how "alien" the knowledge of ML systems will be, and how realistic it's going to be to take a kind of ad hoc approach to alignment. In my experience people with a more experimental bent to be more optimistic about those questions tend to have a bunch of intuitions about those questions that do kind of hang together (and are often approximately shared across people). This comes with some more color on the current alignment plan / what's likely to happen in practice as people try to solve the problem on their feet. I don't think that's really been written up well but it s... (read more)

3Ben Pace3y
The stuff about ‘alien’ knowledge sounds really fascinating, and I’d be excited about write-ups. All my concrete intuitions here come from reading Distill.Pub papers.

What important truth do very few people in your community/network agree with you on?

Unfortunately (fortunately?) I don't feel like I have access to any secret truths. Most idiosyncratic things I believe are pretty tentative, and I hang out with a lot of folks who are pretty open to the kinds of weird ideas that might have ended up feeling like Paul-specific secret truths if I hung with a more normal crowd. 

It feels like my biggest disagreement with people around me is something like: to what extent is it likely to be possible to develop an algorithm that really looks on paper like it should just work for aligning powerful ML systems. I'm at like 50-50 and I think that the consensus estimate of people in my community is more like "Uh, sure doesn't sound like that's going to happen, but we're still excited for you to try."

Do you know what sorts of people you're looking to hire? How much do you expect ARC to grow over the coming years, and what will the employees be doing? I can imagine it being a fairly small group of like 3 researchers and a few understudies, I can also imagine it growing to 30 people like MIRI. Which one of these is it closer to?

I'd like to hire a few people (maybe 2 researchers median?) in 2021. I think my default "things are going pretty well" story involves doubling something like every 1-2 years for a while. Where that caps out / slows down a lot depends on how the field shapes out and how broad our activities are. I would be surprised if I wanted to stop growing at <10 people just based on the stuff I really know I want to do.

The very first hires will probably be people who want to work on the kind of theory I do, since right now that's what I'm feeling most excited about and really want to set up a team working on. I don't really know where that will end up going.

Once getting that going I'm not sure whether the next step will be growing it further or branching out into other things, and it will probably depend on how the theory work goes. I could also imagine doing enough theory on my own to change my view about how promising it is and make initial hires in another area instead.

You've written multiple outer alignment failure stories. However, you've also commented that these aren't your best predictions. If you condition on humanity going extinct because of AI, why did it happen?

I think my best guess is kind of like this story, but:

  1. People aren't even really deploying best practices.
  2. ML systems generalize kind of pathologically over long time horizons, and so e.g. long-term predictions don't correctly reflect the probability of systemic collapse.
  3. As a result there's no complicated "take over the sensors moment" it's just everything is going totally off the rails and everyone is yelling about it but it just keeps gradually drifting on the rails.
  4. Maybe the biggest distinction is that e.g. "watchdogs" can actually give pretty good arguments about why things are bad. In the story we fix all the things they can explain and are left only with the crazy hard core of human-incomprehensible problems, but in reality we will probably just fix the things that are pretty obvious and will be left with the hard core of problems that are still fairly obvious but not quite obvious enough that institutions can respond intelligently to them.

Pre-hindsight: 100 years from now, it is clear that your research has been net bad for the long-term future. What happened?

Some plausible and non-exhaustive options, in roughly descending order of plausibility:

  • I crowd out other people who would have done a better job of working on alignment (either by being better or just by being more). People feel like in order to be taken seriously they have to engage with Paul's writing and ideas and that's annoying. Or the space seems like a confused mess with sloppy standards in part because of my influence. Or more charitably maybe they are more likely to feel like it's "under control." Or maybe I claim ideas and make it harder for others to get credit even if they would have developed the ideas further or better (or even end up stealing the credit for others' ideas and disincentivizing them from entering the field).
  • I convincingly or at least socially-forcefully argue for conclusions that turn out to be wrong (and maybe I should have understood as wrong) and so everyone ends up wronger and makes mistakes that have a negative effect. I mean ex post I think this kind of thing is pretty likely in some important cases (if I'm 80-20 and convince people to update in my favor I still think there's a 20% chance that I pushed people in the wrong direction and across many
... (read more)

As an aside, I think that the possibility of "work doesn't matter" is typically way more important then "work was net bad," at least once you are making a serious effort to do something good rather than bad for the world (I agree that for the "average" project in the world the negative impacts are actually pretty large relative to the positive impacts).

EAs/rationalists often focus on the chance of a big downside clawing back value. I think that makes sense to think seriously about, and sometimes it's a big deal, but most of the time the quantitative estimates just don't seem to add up at all to me and I think people are making a huge quantitative error. I'm not sure exactly where we disagree, I think a lot of it is just that I'm way more skeptical about the ability to incidentally change the world a huge amount---I think that changing the world a lot usually just takes quite a bit of effort.

I guess in some sense I agree that the downside is big for normal butterfly-effect-y reasons (probably 50% of well-intentioned actions make the world worse ex post), so it's also possible that I'm just answering this question in a slightly different way.

My big caveat is that I think the numbers ... (read more)

3DanielFilan3y
I guess I feel like we're in a domain where some people were like "we have concretely-specifiable tasks, intelligence is good, what if we figured how to create artificial intelligence to do those tasks", which is the sort of thing that someone trying to do good for the world would do, but had some serious chance of being very bad for the world. So in that domain, it seems to me that we should keep our eyes out for things that might be really bad for the world, because all the things in that domain are kind of similar. That being said, I agree that the possibility that the work doesn't matter is more important once you're making a thoughtful effort to do good. But I see much more effort and thought into trying to address that part, such that the occasional nudge to consider negative impacts seems appropriate to me.
4Paul Christiano3y
I think it's good to sometimes meditate on whether you are making the world worse (and get others' advice), and I'd more often recommend it for crowds other than EA and certainly wouldn't discourage people from doing it sometimes. I'm sympathetic to arguments that you should be super paranoid in domains like biosecurity since it honestly does seem asymmetrically easier to make things worse rather than better. But when people talk about it in the context of e.g.  AI or policy interventions or gathering better knowledge about the world that might also have some negative side-effects, I often feel like there's little chance that predictable negative effects they are imagining loom large in the cost-benefit unless the whole thing is predictably pointless. Which isn't a reason not to consider those effects, just a push-back against the conclusion (and a heuristic push-back against the state of affairs where people are paralyzed by the possibility of negative consequences based on kind of tentative arguments). For advancing or deploying AI I generally have an attitude like "Even if actively trying to push the field forward full-time I'd be a small part of that effort, whereas I'm a much larger fraction of the stuff-that-we-would-be-sad-about-not-happening-if-the-field-went-faster, and I'm not trying to push the field forward," so while I'm on board with being particularly attentive to harms if you're in a field you think can easily cause massive harms, in this case I feel pretty comfortable about the expected cost-benefit unless alignment work isn't really helping much (in which case I have more important reasons not to work on it). I would feel differently about this if pushing AI faster was net bad on e.g. some common-sense perspective on which alignment was not very helpful, but I feel like I've engaged enough with those perspectives to be mostly not having it.

"Even if actively trying to push the field forward full-time I'd be a small part of that effort"

I think conditioning on something like 'we're broadly correct about AI safety' implies 'we're right about some important things about how AI development will go that the rest of the ML community is surprisingly wrong about'. In that world we're maybe able to contribute as much as a much larger fraction of the field, due to being correct about some things that everyone else is wrong about.

I think your overall point still stands, but it does seem like you sometimes overestimate how obvious things are to the rest of the ML community

What's the most important thing that AI alignment researchers have learned in the past 10 years? Also, that question but excluding things you came up with.

"Thing" is tricky. Maybe something like the set of intuitions and arguments we have around learned optimizers, i.e. the basic argument that ML will likely produce a system that is "trying" to do something, and that it can end up performing well on the training distribution regardless of what it is "trying" to do (and this is easier the more capable and knowledgeable it is). I don't think we really know much about what's going on here, but I do think it's an important failure to be aware of and at least folks are looking for it now. So I do think that if it happens we're likely to notice it earlier than we would if taking a purely experimentally-driven approach and it's possible that at the extreme you would just totally miss the phenomenon. (This may not be fair to put in the last 10 years, but thinking about it sure seemed like a mess >10 years ago.)

(I may be overlooking something such that I really regret that answer in 5 minutes but so it goes.)

I wonder how valuable you find some of the more math/theory focused research directions in AI safety. I.e., how much less impactful do you find them, compared to your favorite directions? In particular,

  1. Vanessa Kosoy's learning-theoretic agenda, e.g., the recent sequence on infra-Bayesianism, or her work on traps in RL. Michael Cohen's research, e.g. the paper on imitation learning seems to go into a similar direction.
  2. The "causal incentives" agenda (link).
  3. Work on agent foundations, such as on cartesian frames. You have commented on MIRI's research in the past, but maybe you have an updated view.

I'd also be interested in suggestions for other impactful research directions/areas that are more theoretical and less ML-focused (expanding on adamShimi's question, I wonder which part of mathematics and statistics you expect to be particularly useful).

I'm generally bad at communicating about this kind of thing, and it seems like a kind of sensitive topic to share half-baked thoughts on. In this AMA all of my thoughts are half-baked, and in some cases here I'm commenting on work that I'm not that familiar with. All that said I'm still going to answer but please read with a grain of salt and don't take it too seriously.

Vanessa Kosoy's learning-theoretic agenda, e.g., the recent sequence on infra-Bayesianism, or her work on traps in RL. Michael Cohen's research, e.g. the paper on imitation learning seems to go into a similar direction.

I like working on well-posed problems, and proving theorems about well-posed problems are particularly great.

I don't currently expect to be able to apply those kinds of algorithms directly to alignment for various reasons (e.g. no source of adequate reward function that doesn't go through epistemic competitiveness which would also solve other aspects of the problem, not practical to get exact imitation), so I'm mostly optimistic about learning something in the course of solving those problems that turns out to be helpful. I think that's plausible because these formal problems do engage some of the dif... (read more)

4Ryan Carey3y
Thanks for these thoughts about the causal agenda. I basically agree with you on the facts, though I have a more favourable interpretation of how they bear on the potential of the causal incentives agenda. I've paraphrased the three bullet points, and responded in reverse order: 3) Many important incentives are not captured by the approach - e.g. sometimes an agent has an incentive to influence a variable, even if that variable does not cause reward attainment.  -> Agreed. We're starting to study "side-effect incentives" (improved name pending), which have this property. We're still figuring out whether we should just care about the union of SE incentives and control incentives, or whether SE or when, SE incentives should be considered less dangerous. Whether the causal style of incentive analysis captures much of what we care about, I think will be borne out by applying it and alternatives to a bunch of safety problems. 2) sometimes we need more specific quantities, than just D affects A. -> Agreed. We've privately discussed directional quantities like "do(D=d) causes A=a" as being more safety-relevant, and are happy to hear other ideas. 1) eliminating all control-incentives seems unrealistic -> Strongly agree it's infeasibile to remove CIs on all variables. My more modest goal would be to prove that for particular variables (or classes of variables) such as a shut down button, or a human's values, we can either: 1) prove how to remove control (+ side-effect) incentives, or 2) why this is impossible, given realistic assumptions. If (2), then that theoretical case could justify allocation of resources to learning-oriented approaches. Overall, I concede that we haven't engaged much on safety issues in the last year. Partly, it's that the projects have had to fit within people's PhDs. Which will also be true this year. But having some of the framework stuff behind us, we should still be able to study safety more, and gain a sense of how addressable concerns lik
2Tom Everitt3y
  This is what multi-agent incentives are for (i.e. incentive analysis in multi-agent CIDs).  We're still working on these as there are a range of subtleties, but I'm pretty confident we'll have a good account of it.

Do you have any specific plans for your life in a post-singularity world?

Not really.

I expect that many humans will continue to participate in a process of collectively clarifying what we want and how to govern the universe. I wouldn't be surprised if that involves a lot of life-kind-of-like-normal that gradually improves in a cautious way we endorse rather than some kind of table-flip (e.g. I would honestly not be surprised if post-singularity we still end up raising another generation because there's no other form of "delegation" that we feel more confident about). And of course in such a world I expect to just continue to spend a lot of time thinking, again probably under conditions that are designed to be gradually improving rather than abruptly changing. The main weird thing is that this process will now be almost completely decoupled from productive economic activity.

I think it's hard to talk about "your life" and identity is likely to be fuzzy over the long term. I don't think that most of the richness and value in the world will come from creatures who feel like "us" (and I think our selfish desires are mostly relatively satiable). That said, I do also expect that basically all of the existing humans will have a future that they feel excited abou... (read more)

What work are you most proud of? 

Slightly different: what blog post are you most proud of?

I don't have an easy way of slicing my work up / think that it depends on how you slice it. Broadly I think the two candidates are (i) making RL from human feedback more practical and getting people excited about it at OpenAI, (ii) the theoretical sequence from approval-directed agents and informed oversight to iterated amplification to getting a clear picture of the limits of iterated amplification and setting out on my current research project. Some steps of that were really hard for me at the time though basically all of them now feel obvious.

My favorite blog post was probably approval-directed agents, though this is very much based on judging by the standards of how-confused-Paul-started-out. I think that it set me on a way better direction for thinking about AI safety (and I think it also helped a lot of people in a similar way). Ultimately it's clear that I didn't really understand where the difficulties were, and I've learned a lot in the last 6 years, but I'm still proud of it.

How many ideas of the same size as "maybe a piecewise linear non-linearity would work better than a sigmoid for not having vanishing gradients" are we away from knowing how to build human-level AI technology?

I think it's >50% chance that ideas like ReLUs or soft attention are best though of as multiplicative improvements on top of hardware progress (as are many other ideas like auxiliary objectives, objectives that better capture relevant tasks, infrastructure for training more efficiently, dense datasets, etc.), because the basic approach of "optimize for a task that requires cognitive competence" will eventually yield human-level competence. In that sense I think the answer is probably 0.

Maybe my median number of OOMs left before human-level intelligence, including both hardware and software progress, is 10 (pretty made-up). Of that I'd guess around half will come from hardware, so call it 5 OOMs of software progress. Don't know how big that is relative to ReLUs, maybe 5-10x? (But hard to define the counterfactual w.r.t. activation functions.)

(I think that may imply much shorter timelines than my normal view. That's mostly from thoughtlessness in this answer which was quickly composed and didn't take into account many sources of evidence, some is from legit correlations not taken into account here, some is maybe legitimate signal from an alternative estimation approach, not sure.)

3Daniel Kokotajlo3y
When you say hardware progress, do you just mean compute getting cheaper or do you include people spending more on compute? So you are saying, you guess that if we had 10 OOMs of compute today that would have a 50% chance of leading to human-level AI without any further software progress, but realistically you expect that what'll happen is we get +5 OOMs from increased spending and cheaper hardware, and then +5 "virtual OOMs" from better software?

How many ideas of the same size as "maybe we could use inverse reinforcement learning to learn human values" are we away from knowing how to knowably and reliably build human-level AI technology that wouldn't cause something comparably bad as human extinction?

A lot of this is going to come down to estimates of the denominator. 

(I mostly just think that you might as well just ask people "Is this good?" rather than trying to use a more sophisticated form of IRL---in particular I don't think that realistic versions of IRL will successfully address the cases where people err in answering the "is it good?" question, that directly asking is more straightforward in many important ways, and that we should mostly just try to directly empower people to give better answers to such questions.)

Anyway, with that caveat and kind of using the version of your idea that I feel most enthusiastic about (and construing it quite broadly), I have a significant probability on 0, maybe a median somewhere in 10-20, significant probability at very high levels.

What is the most common wrong research-relevant intuition among AI alignment researchers?

What was your biggest update about the world from living through the coronavirus pandemic?

Follow-up: does it change any of your feelings about how civilization will handle AGI?

I found our COVID response pretty "par for the course" in terms of how well we handle novel challenges. That was a significant negative update for me because I had a moderate probability on us collectively pulling out some more exceptional adaptiveness/competence when an issue was imposing massive economic costs and had a bunch of people's attention on it. I now have somewhat more probability on AI dooms that play out slowly where everyone is watching and yelling loudly about it but it's just really tough to do something that really improves the situation (and correspondingly more total probability on doom). I haven't really sat down and processed this update or reflected on exactly how big it should be.

Do you have any advice for junior alignment researchers? In particular, what do you think are the skills and traits that make someone an excellent alignment researcher? And what do you think someone can do early in a research career to be more likely to become an excellent alignment researcher?

Some things that seem good:

  • Acquire background in relevant adjacent areas---especially a reasonably deep understanding of ML, but then also a broader+shallower background in more distant areas like algorithms, economics, learning theory, and some familiarity with what kinds of intellectual practices work well in other fields.
  • Build some basic research skills, especially (i) applied work in ML (e.g. be able to implement ML algorithms and run experiments, hopefully getting some kind of mentorship or guidance but you can also do a lot independently), (ii) academic research in any vaguely relevant area. I think it's good to have e.g. actually proven a few things, designed algorithms for a few problems, beaten your head against a few problems and then figured out how to make them work.
  • Think a bunch about alignment. It feels like there is really just not much relevant stuff that's publicly written so you might as well read basically all of it and try to come up with views on the core questions yourself.

I personally feel like I got a lot of benefit out of doing some research in adjacent areas, but I'd guess that mostly it's better to focus on what you actually want to achieve and just be a ... (read more)

What are the highest priority things (by your lights) in Alignment that nobody is currently seriously working on?

It's not clear how to slice the space up into pieces so that you can talk about "is someone working on this piece?" (and the answer depends a lot on that slicing). Here are two areas in robustness that feel kind of empty for my preferred way of slicing up the problem (though for a different slicing they could be reasonably crowded). These are are also necessarily areas where I'm not doing any work so I'm really out on a limb here.

I think there should be more theoretical work on neural net verification / relaxing adversarial training. I should probably update from this to think that it's more of a dead end (and indeed practical verification work does seem to have run into a lot of trouble), but to me it looks like there's got to be more you can say at least to show that various possible approaches are dead ends. I think a big problem is that you really need to keep the application in mind in order to actually know the rules of the game. (That is, we have a predicate A, say implemented as a neural network, and we want to learn a function f such that for all x we have A(x, f(x)), but the problem is only supposed to be possible because in some sense the predicate A is "easy" to satisfy... (read more)

Copying my question from your post about your new research center (because I'm really interested in the answer): which part (if any) of theoretical computer science do you expect to be particularly useful for alignment?

5Paul Christiano3y
Learning theory definitely seems most relevant. Methodologically I think any domain where you are designing and analyzing algorithms, especially working with fuzzy definitions or formalizing intuitive problems, is also useful practice though much less bang for your buck (especially if just learning about it rather than doing research in it). That theme cuts a bunch across domains, though I think cryptography, online algorithms, and algorithmic game theory are particularly good.

Going to start now. I vaguely hope to write something for all of the questions that have been asked so far but we'll see (80 questions is quite a few).

What is your theory of change for the Alignment Research Center? That is, what are the concrete pathways by which you expect the work done there to systematically lead to a better future?

For the initial projects, the plan is to find algorithmic ideas (or ideally a whole algorithm) that works well in practice, can be adopted by labs today, and would put us in a way better position with respect to future alignment challenges. If we succeed in that project, then I'm reasonably optimistic about being able to demonstrate the value of our ideas and get them adopted in practice (by a combination of describing them publicly, talking with people at labs, advising people who are trying to pressure labs to take alignment seriously about what their asks should be, and consulting for labs to help implement ideas). Even if adoption or demonstrating desirability turns out to be hard, I think that the alignment community would be in a much better place if we had a proposal that we all felt good about that we were advocating for (since we'd then have a better shot at doing so, and labs that were serious about alignment would be able to figure out what to do).

Beyond that, I'm also excited about offering concrete and well-justified advice (either about what algorithms to use or about alignment-relevant deployment decisions) that can help labs who care about alignment, or can be taken as a clear indicator of best practices so be adopted by labs who want to present as socially-responsible (whether to please employees, funders, civil society, or competitors).

But I'm mostly thinking about the impact of initial activities, and for that I feel like the theory of change is relatively concrete/straightforward.

If you could magically move most of the US rationality and x-risk and EA community to a city in the US that isn't the Bay, and you had to pick somewhere, where where would you move them to?

If I'm allowed to think about it first then I'd do that. If I'm not, then I'd regret never having thought about it, probably Seattle would be my best guess.

2Ben Pace3y
Huh, am surprised. Guess I might’ve predicted Boston. Curious if it’s because of the culture, the environment, or what.
3Paul Christiano3y
Don't read too much into it. I do dislike Boston weather.

And on an absolute level, is the world much more or less prepared for AGI than it was 15 years ago? 

Follow-up: How much did the broader x-risk community change it at all?

4Paul Christiano3y
I think much better. I don't really know / tough to answer. Certainly there's a lot more people talking about the problem, it's hard to know how much that comes from x-risk community or from vague concerns about AI in the world (my guess is big parts of both). I think we are in a better place with respect to knowledge of technical alignment---we know a fair bit about what the possible approaches are and have taken a lot of positive steps. There is a counterfactual where alignment isn't even really recognized as a distinct problem and is just lumped in with vague concerns about safety, which would be significantly worse in terms of our ability to work productively on the problem (though I'd love if we were further away from that world).

How many hours per week should the average AI alignment researcher spend on improving their rationality? How should they spend those hours?

I probably wouldn't set aside hours for improving rationality (/ am not exactly sure what it would entail). Seems generally good to go out of your way to do things right, to reflect on lessons learned from the things you did, to be willing to do (and slightly overinvest in) things that are currently hard in order to get better, and so on. Maybe I'd say that like 5-10% of time should be explicitly set aside for activities that just don't really move you forward (like post-mortems or reflecting on how things are going in a way that's clearly not going to pay itself off for this project) and a further 10-20% on doing things in ways that aren't the very optimal way right now but useful for getting better at doing them in the future (e.g. using unfamiliar tools, getting more advice from people than would make sense if the world ended next week, being more methodical about how you approach problems).

I guess the other aspect of this is separating some kind of general improvement from more domain specific improvement (i.e. are the numbers above about improving rationality or just getting better at doing stuff?). I think stuff that feels vaguely like "rationality" in the sense of being abou... (read more)

I want to know this question, but for the ‘peak’ alignment researcher.

3Paul Christiano3y
My answer isn't sensitive to things like "how good are you at research" (I didn't even express the sensitivity to "how much do you like reflecting" or "how old are you" which I think are more important). I guess probably the first order thing is that the 'peak' alignment researcher is more likely to be older and closer to death so investing somewhat less in getting better at things. (But the world changes and lives are long so I'm not sure it's a huge deal.)

I'm not interested in the strongest argument from your perspective (i.e. the steelman), but I am interested how much you think you can pass the ITT for Eliezer's perspective on the alignment problem — what shape the problem is, why it's hard, and how to make progress. Can you give a sense of the parts of his ITT you think you've got?

I think I could do pretty well (it's plausible to me that I'm the favorite in any head-to-head match with someone who isn't a current MIRI employee? probably not but I'm at least close). There are definitely some places I still get surprised and don't expect to do that well, e.g. I was recently surprised by one of Eliezer's positions regarding the relative difficulty of some kinds of reasoning tasks for near-future language models (and I expect there are similar surprises in domains that are less close to near-term predictions). I don't really know how to split it into parts for the purpose of saying what I've got or not.

Did you get much from reading the sequences? What was one of the things you found most interesting or valuable personally it them?

I enjoyed Leave a Line of Retreat. It's a very concrete and simple procedure that I actually still use pretty often and I've benefited a lot just from knowing about. Other than that I think I found a bunch of the posts interesting and entertaining. (Looking back now the post is a bit bombastic, I suspect all the sequences are, but I don't really mind.)

1. What credence would you assign to "+12 OOMs of compute would be enough for us to achieve AGI / TAI / AI-induced Point of No Return within five years or so." (This is basically the same, though not identical, with this poll question)

2. Can you say a bit about where your number comes from? E.g. maybe 25% chance of scaling laws not continuing such that OmegaStar, Amp(GPT-7), etc. don't work, 25% chance that they happen but don't count as AGI / TAI / AI-PONR, for total of about 60%? The more you say the better, this is my biggest crux! Thanks!

I'd say 70% for TAI in 5 years if you gave +12 OOM.

I think the single biggest uncertainty is about whether we will be able to adapt sufficiently quickly to the new larger compute budgets (i.e. how much do we need to change algorithms to scale reasonably? it's a very unusual situation and it's hard to scale up fast and depends on exactly how far that goes). Maybe I think that there's an 90% chance that TAI is in some sense possible (maybe: if you'd gotten to that much compute while remaining as well-adapted as we are now to our current levels of compute) and conditioned on that an 80% chance that we'll actually do it vs running into problems?

(Didn't think about it too much, don't hold me to it too much. Also I'm not exactly sure what your counterfactual is and didn't read the original post in detail, I was just assuming that all existing and future hardware got 12OOM faster. If I gave numbers somewhere else that imply much less than that probability with +12OOM, then you should be skeptical of both.)

3Daniel Kokotajlo3y
My counterfactual attempts to get at the question "Holding ideas constant, how much would we need to increase compute until we'd have enough to build TAI/AGI/etc. in a few years?" This is (I think) what Ajeya is talking about with her timelines framework. Her median is +12 OOMs. I think +12 OOMs is much more than 50% likely to be enough; I think it's more like 80% and that's after having talked to a bunch of skeptics, attempted to account for unknown unknowns, etc. She mentioned to me that 80% seems plausible to her too but that she's trying to adjust downwards to account for biases, unknown unknowns, etc. Given that, am I right in thinking that your answer is really close to 90%, since failure-to-achieve-TAI/AGI/etc-due-to-being-unable-to-adapt-quickly-to-magically-increased-compute "shouldn't count" for purposes of this thought experiment?

(I don't think Amp(GPT-7) will work though.)

2Daniel Kokotajlo3y
I'm very glad to hear that! Can you say more about why?

Natural language has both noise (that you can never model) and signal (that you could model if you were just smart enough). GPT-3 is in the regime where it's mostly signal (as evidenced by the fact that the loss keeps going down smoothly rather than approaching an asymptote). But it will soon get to the regime where there is a lot of noise, and by the time the model is 9 OOMs bigger I would guess (based on theory) that it will be overwhelmingly noise and training will be very expensive.

So it may or may not work in the sense of meeting some absolute performance threshold, but it will certainly be a very bad way to get there and we'll do something smarter instead.

3Daniel Kokotajlo3y
Hmm, I don't count "It may work but we'll do something smarter instead" as "it won't work" for my purposes. I totally agree that noise will start to dominate eventually... but the thing I'm especially interested in with Amp(GPT-7) is not the "7" part but the "Amp" part. Using prompt programming, fine-tuning on its own library, fine-tuning with RL, making chinese-room-bureaucracies, training/evolving those bureaucracies... what do you think about that? Naively the scaling laws would predict that we'd need far less long-horizon data to train them, since they have far fewer parameters, right? Moreover IMO evolved-chinese-room-bureaucracy is a pretty good model for how humans work, and in particular for how humans are able to generalize super well and make long-term plans etc. without many lifetimes of long-horizon training.

You seem in the unusual position of having done excellent conceptual alignment work (eg with IDA), and excellent applied alignment work at OpenAI, which I'd expect to be pretty different skillsets. How did you end up doing both? And how useful have you found ML experience for doing good conceptual work, and vice versa?

Aw thanks :) I mostly trained as a theorist through undergrad, then when I started grad school I spent some time learning about ML and decided to do applied work at OpenAI. I feel like the methodologies are quite different but the underlying skills aren't that different. Maybe the biggest deltas are that ML involves much more management of attention and jumping between things in order to be effective in practice, while theory is a bit more loaded on focusing on one line of reasoning for a long time and having some clever idea. But while those are important skills I don't think they are the main things that you improve at by working in either area and aren't really core.

I feel like in general there is a lot of transfer between doing well in different research areas, though unsurprisingly it's less than 100% and I think I would be better at either domain if I'd just focused on it more. The main exception is that I feel like I'm a lot better at grounding out theory that is about ML, since I've had more experience and have more of a sense for what kinds of assumptions are reasonable in practice. And on the flip side I do think theory is similar to a lot of algorithm design/analysis questions that come up in ML (frankly it doesn't seem like a central skill but I think there are big logistical benefits from being able to do the whole pipeline as one person).

What's your favourite mathematical object? What's your least favourite mathematical object?

Favorite: Irit Dinur's PCP for constraint satisfaction. What a proof system.

If you want to be more pure, and consider the mathematical objects that are found rather than built, maybe the monster group? (As a layperson so I can't appreciate the full extent of what's going, on and like most people I only real know about it second-hand, but its existence seems like a crazy and beautiful fact about the world.)

Least favorite: I don't know, maybe Chaitin's constant?

4Paul Christiano2y
I take it back, Chaitin's constant is more cool than I thought. I don't like the cardinal ℵ1 very much, but I like 2ℵ0 just fine so it's not really clear if it's a problem with the object or the reference.
2DanielFilan2y
What changed your mind about Chaitin's constant?
3Paul Christiano2y
I hadn't appreciated how hard and special it is to be algorithmically random.

Should marginal CHAI PhD graduates who are dispositionally indifferent between the two options try to become a professor or do research outside of universities?

Not sure. If you don't want to train students, seems toe me like you should be outside of a university. If you do want to train students it's less clear and maybe depends on what you want to do (and given that students vary in what they are looking for, this is probably locally self-correcting if too many people go one way or the other). I'd certainly lean away from university for the kinds of work that I want to do, or for the kinds of things that involve aligning large ML systems (which benefit from some connection to customers and resources).

What are the main ways you've become stronger and smarter over the past 5 years? This isn't a question about new object-level beliefs so much as ways-of-thinking or approaches to the world that have changed for you.

3Paul Christiano3y
I'm changing a lot less with every successive 5-year interval. The last 5 years was the end of grad school and my time at OpenAI. I certainly learned a lot about how to make ML work in practice (start small, prioritize simple cases where you can debug, isolate assumptions). Then I learned a lot about how to run a team. I've gotten better at talking to people and writing and being a broadly functional (making up on some lost time when I was younger and focused on math instead). I don't think there's any simple slogan for new ways-of-thinking or changed approaches to the world. Mostly just seems like a ton of little stuff. I think earlier phases of my life were more likely to be a shift in an easily described direction, but this time it's been more a messy mix---I became more arrogant in some ways and more humble in others, more optimistic in some ways and more pessimistic in others, more inclined to trust on-paper reasoning in some ways and less in others, etc

What's the largest cardinal whose existence you feel comfortable with assuming as an axiom?

I'm pretty comfortable working with strong axioms. But in terms of "would actually blow my mind if it turned out not to be consistent," I guess alpha-inaccessible cardinals for any concrete alpha? Beyond that I don't really know enough set theory to have my mind blown.

Why did nobody in the world run challenge trials for the covid vaccine and save us a year of economic damage?

Wild speculation, not an expert. I'd love to hear from anyone who actually knows what's going on.

I think it's overoptimistic that human challenge trials would save a year, though it does seem like they could have plausibly have saved weeks or months if done in the most effective form. (And in combination with other human trials and moderate additional spending I'd definitely believe 6-12 months of acceleration was possible.)

In terms of why so few human experiments have happened in general, I think it's largely because of strong norms designed to protect experiment participants (and taken quite seriously by doctors I've talked to), together with limited upside for the experimenters, an overriding desire for vaccine manufacturers to avoid association with a trial that ends up looking bad (this doesn't apply to other kinds of trial but the upside is often lower and there's no real stakeholder), a lack of understanding for a long time of how big a problem this would be, the difficulty of quickly shifting time/attention from other problems to this one, and the general difficulty of running experiments.

If a 17-year-old wanted to become the next Paul Christiano, what should they do?

What research in the past 5 years has felt like the most significant progress on the alignment problem? Has any of it made you more or less optimistic about how easy the alignment problem will be?

What should people be spending more money on?

2Paul Christiano3y
Which people? (And whose "should"?) Maybe public goods, software, and movies?
3Matthew "Vaniver" Gray3y
Is "movies" a standin for "easily duplicated cultural products", or do you think movies in particular are underproduced?
4Paul Christiano3y
Mostly a stand-in, but I do wish people were making more excellent movies :)
1Ben Pace3y
Most people, or most people you know. And “should“ = given their own goals. I’m asking what you think people might be wrong about. And very slightly hoping for product recommendations :)

What is the main mistake you've made in your research, that you were wrong about?

Positive framing: what's been the biggest learning moment in the course of your work?

Basically every time I've shied away from a solution because it feels like cheating, or like it doesn't count / address the real spirit of the problem, I've regretted it. Often it turns out it really doesn't count, but knowing exactly why (and working on the problem with no holds barred) had been really important for me.

The most important case was dismissing imitation learning back in 2012-2014, together with basically giving up outright on all ML approaches, which I only recognized as a problem when I was writing up why those approaches were doomed more carefully and why imitation learning was a non-solution.

What is your top feature request for LessWrong.com?

4Paul Christiano3y
When I begin a comment with a quotation, I don't know how to insert new un-quoted text at the top (other than by cutting the quotation, adding some blank lines, then pasting the quotation back). That would be great. Also moderate performance improvements. And then maybe a better feed that gives me the content I'm most likely to see? That's a tough thing to design but could add significant value.
4Oliver Habryka3y
You can do this by pressing enter in an empty paragraph of a quoted block. That should cause you to remove the block. See this gif: 
4Paul Christiano3y
I thought that I tried that but it seems to work fine, presumably user error :)

You gave a great talk on the AI Alignment Landscape 2 years ago. What would you change if giving the same talk today?

4Paul Christiano3y
The boxes at the top haven't really changed. The boxes at the bottom never felt that great, it still seems like a fine way for them to be---I expect they would change if I did it again but I wouldn't feel any better about the change than I did about the initial or final version.

Curated. I don't think we've curated an AMA before, and not sure if I have a principled opinion on doing that, but this post seems chock full of small useful incites, and fragments of ideas that seem like they might otherwise take awhile to get written up more comprehensively, which I think is good.

Should more AI alignment research be communicated in book form? Relatedly, what medium of research communication is most under-utilized by the AI alignment community?

I think it would be good to get more arguments and ideas pinned down, explained carefully, collected in one place. I think books may be a reasonable format for that, though man they take a long time to write.

I don't know what medium is most under-utilized.

Who do you admire?

What mechanisms could effective altruists adopt to improve the way AI alignment research is funded?

Long run I'd prefer with something like altruistic equity / certificates of impact. But frankly I don't think we have hard enough funding coordination problems that it's going to be worth figuring that kind of thing out. 

(And like every other community we are free-riders---I think that most of the value of experimenting with such systems would accrue to other people who can copy you if successful, and we are just too focused on helping with AI alignment to contribute to that kind of altruistic public good. If only someone would be willing to purchase the impact certificate from us if it worked out...)

What works of fiction / literature have had the strongest impact on you? Or perhaps, that are responsible for the biggest difference in your vector relative to everyone else's vector?

(e.g. lots of people were substantially impacted by the Lord of the Rings, but perhaps something else had a big impact on you that led you in a different direction from all those people)

(that said, LotR is a fine answer)

There has been surprisingly little written on concrete threat models for how AI leads to existential catastrophes (though you've done some great work rectifying this!). Why is this? And what are the most compelling threat models that don't have good public write-ups? In particular, are there under-appreciated threat models that would lead to very different research priorities within Alignment?

6Paul Christiano3y
Depends how you slice and dice the space (and what counts as a "threat model"), I don't have a good answer for this. In general I feel like a threat model is more like something that everyone can make for themselves and is a model of the space of threats, not like a short list of things that you might discover. We could talk about particular threats that don't have good public write-ups. I feel like there are various humans-are-fragile-so-weak-AI-takes-over-when-world-falls-apart possibilities and those haven't been written up very well.
2Rohin Shah3y
I think Neel is using this in the sense I use the phrase, where you carve up the space of threats in some way, and then a "threat model" is one of the pieces that you carved up, rather than the way in which you carved it up. This is meant to be similar to how in security there are many possible kinds of risks you might be worried about, but then you choose a particular set of capabilities that an attacker could have and call that a "threat model" -- this probably doesn't capture every setting you care about, but does capture one particular piece of it. (Though maybe in security the hope is to choose a threat model that actually contains all the threats you expect in reality, so perhaps this analogy isn't the best.) (I think "that's a thing that people make for themselves" is also a reasonable response for this meaning of "threat model".)
2Paul Christiano3y
On that perspective I guess by default I'd think of a threat as something like "This particular team of hackers with this particular motive" and a threat model as something like "Maybe they have one or two zero days, their goal is DoS or exfiltrating information, they may have an internal collaborator but not one with admin privileges..." And then the number of possible threat models is vast even compared to the vast space of threats.
5Paul Christiano3y
I'm not sure why there isn't more work on concrete descriptions of possible futures and how they go wrong. Some guesses: * Anything concrete is almost certainly wrong. People are less convinced that it's useful given that it will be wrong, and so try to make vaguer / more abstract stories that maybe describe reality at the expense of having less detail. * It's not exactly clear what you do with such a story or what the upside is, it's kind of a vague theory of change and most people have some specific theory of change they are more excited about (even if this kind of story is a bit of a public good that's useful on a broader variety of perspectives / to people who are skeptical). * Any detailed story produced in a reasonable amount of time will also be obviously wrong to someone who notices the right considerations or has the right background. It's very demoralizing to write something that someone is going to recognize is obviously incoherent/wrong (especially if you expect that to get pointed out and taken by some to undermine your view). * It just kind of takes a long time and is hard, and people don't do that many hard things that take a long time. * A lot of people most interested in futurism are into very fast-take-off models where there isn't as much to say and they maybe feel like it's mostly been said. (I think that "threat models" is somewhat broader / different from concrete stories, and it's a bit less clear to me exactly how much people have done or what counts.)
1Neel Nanda3y
Ah, interesting! I'm surprised to hear that. I was under the impression that while many researchers had a specific theory of change, it was often motivated by an underlying threat model, and that different threat models lead to different research interests. Eg, someone worries about a future where AI control the world but are not human comprehensible, feels very different from someone worried about a world where we produce an expected utility maximiser that has a subtly incorrect objective, resulting in bad convergent instrumental goals. Do you think this is a bad model of how researchers think? Or are you, eg, arguing that having a detailed, concrete story isn't important here, just the vague intuition for how AI goes wrong?
4Paul Christiano3y
I think most people have expectations regarding e.g. how explicitly will systems represent their preferences, how much will they have preferences, how will that relate to optimization objectives used in ML training, how well will they be understood by humans, etc. Then there's a bunch of different things you might want: articulations of particular views on some of those questions, stories that (in virtue of being concrete) show a whole set of guesses and how they can lead to a bad or good outcome, etc. My bullet points were mostly regarding the exercise of fleshing out a particular story (which is therefore most likely to be wrong), rather than e.g. thinking about particular questions about the future.

What were your main updates from the past few months?

6Paul Christiano3y
Lots of in-the-weeds updates about theory, maybe most interestingly that "tell me what I want to hear" models are a large fraction of long-term (i.e. not-resolved-with-scale-and-diversity) generalization problems than I'd been imagining. I've increased my probability on fast takeoff in the sense of successive doublings being 4-8x faster instead of 2x faster, by taking more seriously the possibility "if you didn't hit diminishing-marginal-returns in areas like solar panels, robotics, and software, current trends would actually imply faster-than-industrial-revolution takeoff even without AI weirdness." That's not really a bayesian update, just a change in beliefs.

What's a direction you'd like to see the rationality community grow stronger in over the coming 5-10 years?

More true beliefs (including especially about large numbers of messy details rather than a few central claims that can receive a lot of attention).

Favorite SSC / ASX post?

4Paul Christiano3y
...And I show you how deep the rabbit hole goes Maybe Guided by the Beauty of our Weapons if fiction doesn't count. (I expect I'd think of a better post than this one if I thought longer, but not a better post than the black pill story.)

What's the optimal ratio of researchers to support staff in an AI alignment research organization?

4Paul Christiano3y
I guess it depends a lot on what the organization is doing and how exactly we classify "support staff." For my part I'm reasonably enthusiastic about eventually hiring people who are engaged in research but whose main role is more like clarifying, communicating, engaging with outside world, prioritizing, etc., and I could imagine doing like 25-50% as much of that kind of work as we do of frontier-pushing? I don't know whether you'd classify those people as researchers (though I probably wouldn't call it "support" since that seems to kind of minimize the work). Once you are relying on lots of computers, that's a whole different category of work and I'm not sure what the right way of organizing that is or what we'd call support. In terms of things like fundraising, accounting, supporting hiring processes, making payroll and benefits, budgeting, leasing and maintaining office space, dealing with the IRS, discharging legal obligations of employers, immigration, purchasing food, etc.... I'd guess it's very similar to other research organizations with similar salaries. I'm very ignorant about all of this stuff (I expect to learn a lot about it) but I'd guess that depending on details it ends up being 10-20% of staff. But it could go way lower if you outsource a lot to external vendors rather than in-house. (And if you organize a lot of events then that kind of work could just grow basically without bound and in that case I'd again wonder if "support" is the right word.)

Which rationalist virtue do you identify with the strongest currently? Which one would you like to get stronger at?

Should more AI alignment researchers run AMAs?

5Paul Christiano3y
Dunno, would be nice to figure out how useful this AMA was for other people. My guess is that they should at some rate/scale (in combination with other approaches like going on a podcast or writing papers or writing informal blog posts), and the question is how much communication like that to do in an absolute sense and how much should be AMAs vs other things. Maybe I'd guess that typically like 1% of public communication should be something like an AMA, and that something like 5-10% of researcher time should be public communication (though as mentioned in another comment you might have some specialization there which would cut it down, though I think that the AMA format is less likely to be split off, though that might be an argument for doing less AMA-like stuff and more stuff that gets split off...). So that would suggest like 0.05-0.1% of time on AMA-like activities. If the typical one takes a full-time-day-equivalent, then that's like doing one every 2 years, which I guess would be way more AMAs than we have. This AMA is more like a full-time day so maybe every 4 years? That feels a bit like an overestimate, but overall I'd guess that it would be good on the margin for there to be more alignment researcher AMAs.  (But I'm not sure if AMAs are the best AMA-like thing.) In general I think that talking with other researchers and practitioners 1:1 is way more valuable than broadcast communication.

Who is right between Eliezer and Robin in the AI FOOM debate?

I mostly found myself more agreeing with Robin, in that e.g. I believe previous technical change is mostly a good reference class, that Eliezer's AI-specific arguments are mostly kind of weak. (I liked the image, I think from that debate, of a blacksmith emerging into the townsquare with his mighty industry and making all bow before them.)

That said, I think Robin's quantitative estimates/forecasts are pretty off and usually not very justified, and I think he puts too much stock on an outside view extrapolation from past transitions rather than looking at the inside view for existing technologies (the extrapolation seems helpful in the absence of anything else, but it's just not that much evidence given the shortness and noisiness of the time series and the shakiness of the underlying regularities). I don't remember exactly what kinds of estimates he gives in that debate.

(This is more obvious for his timeline estimates, which I think have an almost comically flimsy justification given how seriously he takes them.)

Overall I think that it would be more interesting to have a Carl vs Robin FOOM debate; I expect the outcome would be Robin saying "do you really call that a FOOM?" and Carl saying "well it is pretty fast and would have crazy disruptive geopolitical consequences and generally doesn't fit that well with your implied forecasts about the world even if not contradicting that many of the things you actually commit to" and we could all kind of agree and leave it at that modulo a smaller amount of quantitative uncertainty.

3Rob Bensinger3y
Source for the blacksmith analogy: I Still Don't Get Foom
3Ben Pace3y
Noted. (If both parties are interested in that debate I’m more than happy to organize it in whatever medium and do any work like record+transcripts or book an in-person event space.)

Why aren't impact certificates a bigger deal?

4Paul Christiano3y
Change is slow and hard and usually driven by organic changes rather than clever ideas, and I expect it to be the same here. In terms of why the idea is actually just not that big a deal, I think the big thing is that altruistic projects often do benefit hugely from not needing to do explicit credit attribution. So that's a real cost. (It's also a cost for for-profit businesses, leading to lots of acrimony and bargaining losses.) They also aren't quite consistent with moral public goods / donation-matching, which might be handled better by a messy status quo, and I think that's a long-term problem though probably not as big as the other issues.

Let me ask the question Daniel Filan is too polite to ask: would you like to be interviewed on your research for an episode of the AXRP podcast? 

That's not the AXRP question I'm too polite to ask.

1Ben Pace3y
Paul, if you did an episode of AXRP, which two other AXRP episodes do you expect your podcast would be between, in terms of quality? For this question, collapse all aspects of quality into a scalar.

Other than by doing your own research, from where or whom do you tend to get valuable research insights?

I'd be interested in your thoughts on human motivation in HCH and amplification schemes.
Do you see motivational issues as insignificant / a manageable obstacle / a hard part of the problem...?

Specifically, it concerns me that every H will have preferences valued more highly than [completing whatever task we assign], so would be expected to optimise its output for its own values rather than the assigned task, where these objectives diverged. In general, output needn't relate to the question/task.
[I don't think you've addressed this at all recently - I've on... (read more)

4Paul Christiano3y
I mostly don't think this thing is a major issue. I'm not exactly sure where I disagree, but some possibilities: * H isn't some human isolated from the world, it's an actual process we are implementing (analogous to the current workflow involving external contractors, lots of discussion about the labeling process and what values it might reflect, discussions between contractors and people who are structuring the model, discussions about cases where people disagree) * I don't think H is really generalizing OOD, you are actually collecting human data on the kinds of questions that matter (I don't think any of my proposals rely on that). So the scenario you are talking about is something like the actual people who are implementing H---real people who actually exist and we are actually working with---are being offered payments or extorted or whatever by the datapoints that the actual ML is giving them. That would be considered a bad outcome on many levels (e.g. man that sounds like it's going to make the job stressful), and you'd be flagging models that systematically produce such outputs (if all is going well they shouldn't be upweighted), and coaching contractors and discussing the interesting/tricky cases and so on. * H is just not making that many value calls, they are mostly implemented by the process that H answers. Similarly, we're just not offloading that much of the substantive work to H (e.g. they don't need to be super creative or wise, we are just asking them to help construct a process that responds appropriately to evidence). * I don't really know what kind of opportunity cost you have in mind. Yes, if we hire contractors and can't monitor their work they will sometimes do a sloppy job. And indeed if someone from an ML team is helping run an oversight process there might be some kinds of inputs where they don't care and slack off? But there seems to be a big mismatch between the way this scenario is being described and a realistic process for producin
2Joe_Collman3y
Thanks, that's very helpful. It still feels to me like there's a significant issue here, but I need to think more. At present I'm too confused to get much beyond handwaving.  A few immediate thoughts (mainly for clarification; not sure anything here merits response): * I had been thinking too much lately of [isolated human] rather than [human process]. * I agree the issue I want to point to isn't precisely OOD generalisation. Rather it's that the training data won't be representative of the thing you'd like the system to learn: you want to convey X, and you actually convey [output of human process aiming to convey X]. I'm worried not about bias in the communication of X, but about properties of the generating process that can be inferred from the patterns of that bias. * It does seem hard to ensure you don't end up OOD in a significant sense. E.g. if the content of a post-deployment question can sometimes be used to infer information about the questioner's resource levels or motives. * The opportunity costs I was thinking about were in altruistic terms: where H has huge computational resources, or the questioner has huge resources to act in the world, [the most beneficial information H can provide] would often be better for the world than [good direct answer to the question]. More [persuasion by ML] than [extortion by ML]. * If (part of) H would ever ideally like to use resources to output [beneficial information], but gives direct answers in order not to get thrown off the project, then (part of) H is deceptively aligned. Learning from a (partially) deceptively aligned process seems unsafe. * W.r.t. H's making value calls, my worry isn't that they're asked to make value calls, but that every decision is an implicit value call (when you can respond with free text, at least). I'm going to try writing up the core of my worry in more precise terms. It's still very possible that any non-trivial substance evaporates under closer scrutiny.

How will we know when it's not worth getting more people to work on reducing existential risk from AI?

2Paul Christiano3y
We'll do the cost-benefit analysis and over time it will look like a good career for a smaller and smaller fraction of people (until eventually basically everyone for whom it looks like a good idea is already doing it). That could kind of qualitatively look like "something else is more important," or "things kind of seem under control and it's getting crowded," or "there's no longer enough money to fund scaleup." Of those, I expect "something else is more important" to be the first to go (though it depends a bit on how broadly you interpret "from AI," if anything related to the singularity / radically accelerating growth is classified as "from AI" then it may be a core part of the EA careers shtick kind of indefinitely, with most of the action in which of the many crazy new aspects of the world people are engaging with).

How many new blogs do you anticipate creating in the next 5 years?

3Paul Christiano3y
I've created 3 blogs in the last 10 years and 1 blog in the preceding 5 years. It seems like 1-2 is a good guess. (A lot depends on whether there ends up being an ARC blog or it just inherits ai-alignment.com)

What is the Paul Christiano production function?

How would you teach someone how to get better at the engine game?

2Paul Christiano3y
No idea other than playing a bunch of games (might as well current version, old dailies probably best) and maybe looking at solutions when you get stuck. Might also just run through a bunch of games and highlight the main important interactions and themes for each of them, e.g. Innovation + Public Works + Reverberate or Hatchery + Till. I think on any given board (and for the game in general) it's best to work backwards from win conditions, then midgames, and then openings.
1Neel Nanda3y
What's the engine game?
1Mark Xu3y
engine-game.com, a game that Paul develops