All of janleike's Comments + Replies

I'm not entirely sure but here is my understanding:

I think Paul pictures relying heavily on process-based approaches where you trust the output a lot more because you closely held the system's hand through the entire process of producing the output. I expect this will sacrifice some competitiveness, and as long as it's not too much it shouldn't be that much of a problem for automated alignment research (as opposed to having to compete in a market). However, it might require a lot more human supervision time.

Personally I am more bullish on understanding how... (read more)

  1. Yes, we are currently planning continue to pursue these directions for scalable oversight. My current best guess is that scalable oversight will do a lot of the heavy lifting for aligning roughly human-level alignment research models (by creating very precise training signals), but not all of it. Easy-to-hard generalization, (automated) interpretability, adversarial training+testing will also be core pieces, but I expect we'll add more over time.

  2. I don't really understand why many people updated so heavily on the obfuscated arguments problem; I don't t

... (read more)
4Wei Dai
Thanks for engaging with my questions here. I'll probably have more questions later as I digest the answers and (re)read some of your blog posts. In the meantime, do you know what Paul meant by "it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents" in the other subthread?

Yeah, you could reformulate the question as "how much consequentialist reasoning do you need to do 95% or 99% of the alignment work?" Maybe the crux is in what we mean by consequentialist reasoning. For example, if you build a proof oracle AlphaZero-style, would that be a consequentialist? Since it's trained with RL to successfully prove theorems you can argue it's a consequentialist since it's the distillation of a planning process, but it's also relatively myopic in the sense that it doesn't care about anything that happens after the current theorem is p... (read more)

Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I'm not sure that it's always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify.

However, I'm skeptical that the list of novel philosophical problems we... (read more)

2Wei Dai
Given this, what is your current plan around AI and philosophy? I guess train AI to do various kinds of reasoning required for automated alignment research, using RRM, starting with the easier kinds and working your way up to philosophical reasoning? What are your current credences for A) this ends up working well B) it clearly doesn't work C) it appears to work well but actually doesn't (we end up getting fooled by persuasive but wrong arguments)? What's the backup plan in case of B? I'm concerned about C (for obvious reasons) and that if B is true by the time we learn it we'll have AI that is human-level or better on every capability except philosophy, which seems a terrible position to be in. Given your quoted comment above, it seems that you can't have a very high credence for A, but then how do you justify being overall optimistic? It could be a short list and that's still a big problem, right? I may be missing your point here, but I'm not sure why this matters. If we get some of these questions wrong, we may end up throwing away large fractions of the potential value of the universe, so it seems important to solve them regardless of whether they're categorized under "alignment" or "capabilities". It's possible that we could safely punt these questions for a while but also possible that we end up losing a lot of value that way (if we have to make highly consequential decisions/commitments/bargains during this time). But we can't be sure that the longtermist view is correct, so in the interest of not risking committing a moral atrocity (in case a more neartermist view is correct) we still have to answer these questions. Answering these questions also seems urgent from the perspective that AIs (or humans for that matter) will pretty soon start making persuasive arguments that AIs deserve moral patienthood and hence legal rights, and "let's ignore these arguments for now because of longtermism" probably won't be a socially acceptable answer.
Jan Leike2312

Thanks for writing this! I'd be very excited to see more critiques of our approach and it's been great reading the comments so far! Thanks to everyone who took the time to write down their thoughts! :)

I've also written up a more detailed post on why I'm optimistic about our approach. I don't expect this to be persuasive to most people here, but it should give a little bit more context and additional surface area to critique what we're doing.

I strongly agree with you that it'll eventually be very difficult for humans to tell apart AI-generated alignment proposals that look good and aren't good from ones that look good and are actually good.

There is a much stronger version of the claim "alignment proposals are easier to evaluate than to generate" that I think we're discussing in this thread, where you claim that humans will be able to tell all good alignment proposals apart from bad ones or at least not accept any bad ones (precision matters much more than recall here since you can compensate b... (read more)

If it turns out that evaluation of alignment proposals is not easier than generation, we're in pretty big trouble because we'll struggle to convince others that any good alignment proposals humans come up with are worth implementing.

You could still argue by generalization that we should use alignment proposals produced by humans who had a lot of good proposals on other problems even if we're not sure about those alignment proposals. But then you're still susceptible to the same kinds of problems.

4Wei Dai
But this is pretty likely the case though, isn't it? Actually I think by default the situation will be the opposite: it will be too easy to convince others that some alignment proposal is worth implementing, because humans are in general too easily convinced by informal arguments that look good but contain hidden flaws (and formalizing the arguments is both very difficult and doesn't help much because you're still depending on informal arguments for why the formalized theoretical concepts correspond well enough to the pre-theoretical concepts that we actually care about). Look at the history of philosophy, or cryptography, if you doubt this. But suppose we're able to convince people to distrust their intuitive sense of how good an argument is, and to keep look for hidden flaws and counterarguments (which might have their own hidden flaws and so on). Well how do we know when it's safe to end this process and actually hit the run button?

You might think that humans are more robust on the distribution of [proposals generated by humans trying to solve alignment] vs [proposals generated by a somewhat superhuman model trying to get a maximal score]

I think we need to unpack "sufficiently aligned"; here's my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a "sufficiently aligned" AI that, conditional on a proposal looking promising, is likely to be actually correct.

  • A system that produces a random 10000-bit string that looks promising to a human reviewe
... (read more)

Thanks for your question! I suspect there is some confusion going on here with what recursive reward modeling is. The example that you describe sounds like an example from imitating expert reasoning.

In recursive reward modeling, agent is not decomposing tasks, it is trying to achieve some objective that the user intends for it to perform. then assists the human in evaluating ’s behavior in order to train a reward model. Decomposition only happens on the evaluation of ’s task.

For example, proposes some plan and proposes the largest weaknes... (read more)

3[anonymous]
Thanks a lot! This definitely clears things up and also highlights the difference between recursive reward modeling and typical amplification/the expert imitation approach you mentioned.

This is an obviously important problem! When we put a human in the loop, we have to be confident that the human is actually aligned—or at least that they realize when their judgement is not reliable to the current situation and defer to some other fallback process or ask for additional assistance. We are definitely thinking about this problem at DeepMind, but it’s out of the scope of this paper and the technical research direction that we are proposing to pursue here. Instead, we zoom into one particular aspect, how to solve the agent alignment problem in the context of aligning a single agent to a single user, because we think it is the hardest technical aspect of the alignment problem.

I'm glad to hear that the DeepMind safety team is thinking about this problem and look forward to reading more about your thoughts on it. However I don't think putting human safety problems outside of "aligning a single agent to a single user" is a natural way to divide up the problem space, because there are likely ways to address human safety problems within that context. (See these two posts which I wrote after posting my comment here.)

I would go further and say that without an understanding of the ways in which humans are unsafe, and how that should be

... (read more)

Good question. The short answer is “I’m not entirely sure.” Other people seem to struggle with understanding Paul Christiano’s agenda as well.

When we developed the ideas around recursive reward modeling, we understood amplification to be quite different (what we ended up calling Imitating expert reasoning in the paper after consulting with Paul Christiano and Andreas Stuhlmüller). I personally find that the clearest expositions for what Paul is trying to do are Iterated Distillation and Amplification and Paul's latest paper, which we compare to in mul... (read more)

Iterated amplification is a very general framework, describing algorithms with two pieces:

  • An amplification procedure that increases an agent's capability. (The main candidates involve decomposing a task into pieces and invoking the agent to solve each piece separately, but there are lots of ways to do that.)
  • A distillation procedure that uses a strong expert to train an efficient agent. (I usually consider semi-supervised RL, as in our paper.)

Given these two pieces, we plug them into each other: the output of distillation becomes the input to amplifica... (read more)