User Comment Replies — AI Alignment Forum

On scalable oversight with weak LLMs judging strong LLMs

Zachary Kenton

9mo10

The post has now been edited with the updated plots for open consultancy/debate.

On scalable oversight with weak LLMs judging strong LLMs

Zachary Kenton

9mo*30

Thanks for the comment Fabien. A couple of points:

open debate accuracy is (almost, except for the way we handle invalid answers, which is very rare) the same as debate accuracy. That's because the data is almost exactly the same - we're just marking one debater as a protagonist based on what that model would choose under direct QA. So it's not bad that open debate has same accuracy as debate, that was expected. It is kinda bad that it's somewhat worse than open consultancy, though we didn't try 'fully open debate' where debaters can both pick same si

... (read more)

3Fabien Roger9mo

Fair for open debate. It's still a bit unclear to me what open debate is supposed to do on top of regular debate. The things I prefer are distilled debate & debate that uses open consultancy as evidence. I appreciate you taking the time to fix these results and flagging the change in this comment! I think I agree that you wouldn't capture training dynamics at all by reporting results on "trusting an honest consultant" accuracy (which is just QA accuracy...), and your data is closer to capturing the training dynamics, though it's not exactly that. The experiment I am most excited about is "if you train a consultant to be as convincing as possible (and choose its side), does it eventually learn to argue for an answer it thinks is wrong because it is easier to argue for, or does it just learn to argue for what it thinks is true?". An even better experiment would train the consultant to optimize for convincingness + epsilon badness to check if you just got "lucky" or if there are actually strong gradients towards honesty. I think the BoN version of this is also somewhat interesting (e.g. "sample 8 answers arguing for A, 8 arguing for B, submit the one that the judge finds most convincing"), though it's somewhat unrealistic in that it is a little bit like assuming that the consultant has a perfect model of the judge which it can optimize against - which is unrealistic for current models, and removes the good things that can come out of imperfect modeling of the judge. Feel free to DM me if you want to have a chat about any of this!

1Zachary Kenton9mo

The post has now been edited with the updated plots for open consultancy/debate.

Threat Model Literature Review

Zachary Kenton

2y20

Thanks for the comment Michael. Firstly, just wanted to clarify the framing of this literature review - when considering strengths and weaknesses of each threat model, this was done in light of what we were aiming to do: generate and prioritise alignment research projects -- rather than as an all-things-considered direct critique of each work (I think that is best done by commenting directly on those articles etc). I'll add a clarification of that at the top. Now to your comments:

To your 1st point: I think the lack of specific assumptions about the AGI dev... (read more)

1michaelcohen2y

On the 2nd point, the whole discussion of mu^prox vs. mu^dist is fundamentally about goal (mis)generalization. My position is that for a very advanced agent, point estimates of the goal (i.e. certainty that some given account of the goal is correct) would probably really limit performance in many contexts. This is captured by Assumptions 2 and 3. An advanced agent is likely to entertain multiple models of what their current understanding of their goal in a familiar context implies about their goal in a novel context. Full conviction in mu^dist does indeed imply non-wireheading behavior, and I wouldn't even call it misgeneralization; I think that would be a perfectly valid interpretation of past rewards. So that's why I spend so much time discussing relative credence in those models.

Discovering Agents

Zachary Kenton

3y20

I haven't considered this in great detail, but if there are $N$ variables, then I think the causal discovery runtime is $O (N^{2})$ . As we mention in the paper (footnote 5) there may be more efficient causal discovery algorithms that make use of certain assumptions about the system.

On adoption, perhaps if one encounters a situation where the computational cost is too high, one could coarse-grain their variables to reduce the number of variables. I don't have results on this at the moment but I expect that the presence of agency (no... (read more)

AI ALIGNMENT FORUM
AF

All of zac_kenton's Comments + Replies