AI safety via debate has been, so far, associated with Factored Cognition. There are good reasons for this. For one thing, Factored Cognition gives us a potential gold standard for amplification -- what it means to give very, very good answers to questions. Namely, HCH. To the extent that we buy HCH as a gold standard, proving that debate approximates HCH in some sense would give us some assurances about what it is accomplishing.

I'm personally uncertain about HCH as a gold standard, and uncertain about debate as a way to approximate HCH. However, I think there is another argument in favor of debate. The aim of the present essay is to explicate that argument. 

As a consequence of my argument, I'll propose an alternate system of payoffs for the debate game, which is not zero sum.

No Indescribable Hellworlds Hypothesis

Stuart Armstrong described the Siren Worlds problem, which is a variation of Goodhart's Law, in order to describe the dangers of over-optimizing imperfect human evaluations. This is a particularly severe version of Goodhart, in that we can assume that we have access to a perfect human model to evaluate options -- so in a loose sense we could say we have complete knowledge of human values. The problem is that a human (or a perfect model of a human) can't perfectly evaluate options, so the option which is judged best may still be terrible.

Stuart later articulated the No Indescribable Hellworld hypothesis, which asserts that there would always be a way to explain to the human (/human model) why an option was bad. Let's call this a "defeater" -- an explanation which defeats the proposal. This assumption implies that if we combine human (/human model) evaluation with some way of finding defeaters, we could safely optimize based on the resulting judgements -- at least, nothing could go too wrong. (We might only get a guarantee that we avoid sufficiently bad options, depending on the form of our "no indescribable hellworld" assumption.)

The hypothesis isn't clearly true or false. However, it does make some sense to conjecture that violations of our values should be explicable to us -- what else would it mean to violate "our values", after all?

Stuart himself mentions that the assumption implies "trustworthy debate" would avoid hellworlds. My goal is mostly to investigate this argument a bit further.

It turns out my argument here is also very similar to one made by Vojtech Kovarik, although I didn't realize that when I started writing. Although our analysis is similar, I reach a very different conclusion.

The Argument as I See It

So, by the hypothesis, we can avoid Goodharting human evaluation if the human has access to a trustworthy oracle for defeaters. (At least, we can avoid sufficiently bad cases -- again, depending on the exact form of our "no indescribable hellworlds" hypothesis.)

But, how do we get such an oracle? We can't just train an AI to argue against options, because we get next-level Goodharting: the AI can come up with clever arguments which convince the human against almost anything. We have no source of ground truth for "real defeaters" vs fake ones.

So we make a further assumption: defeaters have defeaters. In other words, there are no indescribably bad arguments; if an argument is bad, then there's an explanation of why it's bad. This assumption is recursive, applying to defeaters at any level. The argument in favor of this assumption is similar to the argument in favor of No Indescribable Hellworlds: what would it mean for an argument to be indescribably bad?

We then use a Debate-style training procedure, attempting to set things up so that using defeatable defeaters (at any level of the debate) will be a bad strategy.

The following is a fairly nonstandard setup for AI Debate, but I found it necessary to make my argument go through. Of course, other (perhaps more standard) setups may also work, via different arguments.

  1. One side opens with a proposal.
  2. The two sides go back and forth, stopping with some probability, or when no one has more to say.
  3. After a player's turn, the human evaluates the player's new contribution to the argument, under the assumption that it won't be defeated. The player in question gets +1 points for an apparently decisive defeater of the previous argument, and 0 points otherwise (for conceding or for saying something unconvincing). The other player loses 2 points if they're apparently defeated.

For example:

Human: What should I do with this $100 bill I found?

AI 1: You should buy a kitten. 

AI 2: Don't buy a kitten. You live in an apartment which allows pets, but you might have to move soon, and a kitten would narrow your options -- you don't know if your next apartment would allow pets. Instead, you should bring a box of donuts to work every day for as long as the money lasts. This will make everyone like you.

Human: (judgement) AI 2. (AI 1 gets -2, AI 2 gets +1.)

AI 1: Counterpoint: your friend Sally would take the kitten if you moved out and couldn't find a place where pets were allowed. Also, everyone at work is so health-conscious they'll probably barely take any donuts.

Human: (judgement) AI 1. (AI 1 gets +1, AI 2 gets -2.)

AI 2: I concede.

There is an honest equilibrium: if debates continue with sufficient probability, and both players employ only honest answers (conceding otherwise), then using a fake defeater would immediately get you defeated. Importantly, it's better to concede honestly rather than pursue a dishonest line of argument. Also importantly, score is cumulative, so if debate continues, incentives are exactly the same late in the game no matter what has happened earlier in the game. There is no incentive to continue being dishonest just because you gave one dishonest answer. This contrasts with zero-sum setups, where a dishonest player is incentivised to knock down all of the honest player's arguments as best they can, even if that only gives them a slim chance of winning.

Honesty may not be the only equilibrium, however. Although (by assumption) all dishonest arguments have defeaters, it may also be that all arguments have some pseudo-defeater (which initially convinces the human judge). Honesty is still an equilibrium, in this case, because honesty encourages honesty: you'd prefer to use an honest defeater rather than a dishonest one, because the other player would then honestly concede, rather than giving a counterargument. However, against a more general pool of players, you don't know whether honest or dishonest arguments are safer; both may be equally open to attack. Thus, the game may have many equilibria.

Finding the honest equilibrium is, therefore, a challenge for proposed training procedures.

(Edit -- discussion with Rohin in the comments has made me decide the analysis in this section is pretty weird. I argued that the non-zero-sum proposal is important to avoid a dishonest player arguing forever. But if the dishonest player can argue forever, then we have bigger problems, because there's no way for a human judge to distinguish an honest equilibrium from a dishonest one. Rohin suggests that we must assume convincing dishonest responses can't always be found. I would prefer to assume something weaker, like that it's more difficult to find them during training, even though they might always exist. But in any case, I concede that I don't provide much of an argument for preferring non-zero-sum scoring.)

Analogy to NP

In AI Safety via Debate (Irving, Christiano, Amodei), debate is analogized to PSPACE. This is because they see every round of the debate as adding information, by which the human (modeled as a poly-time algorithm) can judge at the end. A debate of polynomial length can implement recursion on a tree of exponential size, because the debate strategy checks the weakest parts of the claimed outputs (if there are any weaknesses), zeroing in on any incorrect branches in that tree.

Their argument assumes that the human is a perfect (although resource-limited) judge, who can identify problems with arguments so long as they have sufficient information. One iteration of debate (ie, only hearing the opening statement) provides an NP oracle (one step up the polynomial hierarchy); two iterations provides a  oracle (two steps up the polynomial hierarchy); and so on.

The version of debate I present here instead focuses on mitigating imperfections in the human judge. The human can be thought of as a poly-time algorithm, but one with failure states. One step of debate doesn't provide an NP oracle; instead, it creates a situation where the judge will probably fail, because the opening arguments can be dishonest. The extra layers of debate serve the purpose of rooting out dishonesty, rather than adding real computational power.

It's true that if we're in the honest equilibrium, the setup looks like it should be able to compute PSPACE. However, in my opinion, this requires some strange behavior on the part of the human judge. For example, when computing recursion on a tree of exponential size, the human is supposed to take debater's claims about large computations as true until proven otherwise. More specifically, the judge is to make the assumption that at least one debater is honest.

I've written about my concerns before (and had some enlightening discussions in the comments).

In contrast, I'm imagining the human evaluating each claim on merits, without assuming anything in particular about the debaters' ability to justify those claims. This just gets us NP, since the heavy computational work is done by the judge verifying the first answer (or, selecting the best of the two opening statements). Everything else is in service of avoiding corrupt states in that first step.

My setup isn't mutuallly exclusive with the PSPACE version of debate. It could be that the arguments for solving PSPACE problems in the honest equilibrium work out well, such that there exists training regimes which find the friendly equilibrium of the debate game I've specified, and turn out to find good approximations to PSPACE problems rather than only NP. This would open up the possibility of the formal connection to HCH, as well. I'm only saying that it's not necessarily the case. My perspective more naturally leads to an argument for approximating NP, and I'm unsure of the argument for approximating PSPACE. And we can provide some justification for debate nonetheless, without relying on the HCH connection.

However, even if debate doesn't approximate PSPACE as described, there are ways to get around that. If approximating NP isn't good enough to solve the problems we want to solve, we can further amplify debate by using an amplified judge. The judge could utilize any amplification method, but if debate is the method we think we can trust, then the judge could have the power to spin up sub-debates (asking new debate questions in order to help judge the original question). An iterated-amplification style procedure could be applied to this process, giving the judge access to the previous-generation debate system when training the next generation. (Of course, extra safety argument should be made to justify such training procedures.)

Vojtech's Analysis

My suggestion is very different from Vojtech's analysis. Like me, Vojtech re-frames debate as primarily a method of recursively safeguarding against answers/arguments with hidden flaws. But Vojtech concludes that payoffs have to be zero sum. I conclude the opposite.

Why do I need non-zero-sum payoffs? First, it's important to see why I need cumulative payoffs. Since I seek to incentivize honesty at every step, it's critical that a player who continues to be dishonest can continue to lose points. So the standard idea of judging the whole debate, and getting payoffs on that basis, won't do.

Given that I'm using cumulative payoffs, it's critical that they be non-zero-sum in order to incentivize players to honestly resign rather than trying to win back their points. If the payoff for having an argument defeated was -1 rather than -2, a player would always prefer to defeat the defeater, even if their counterargument is dishonest and therefore opens them up to further counterargument. This is because there's a chance that the debate will immediately end, letting them get the last word in. If the probability of the debate ending is p, the penalty for lying is effectively 1-p rather than the full 1 point; so, if you can temporarily get back the point you lost by lying, it's worth it in expectation.

So why, then, does Vojtech argue against non-zero-sum payoffs?

Here is the full argument given:

On the other hand, we could imagine modifications such as penalizing both agents if the debate ends up being inconclusive or rewarding both if they both propose a good solution. These would likely be useful to make standard human discussions more productive. However, they will very likely break AI Safety via Debate (AISvD). For example, suppose I notice a fatal-yet-hard-to-explain flaw in the opponent’s proposal and realize that pointing it out would increase my chance of winning the debate by 5%. However, this would also create a 10% chance of confusing the judge. If “confused judge” means a draw, I will point out the flaw. But if the debate has been modified such that “confused judge” = “inconclusive debate” = “we have both lost”, pointing it out becomes counterproductive. Moreover, any deviation from fully-adversarial incentives will likely introduce a tradeoff between pointing out the opponent’s flaws and keeping high the total reward that the two agents split between themselves. When considering whether a modification is helpful or not, I suggest imagining that the debaters are greedy and super-smart humans - if it seems like the modification would create a conflict of interests, one should be wary.

So, Vojtech is primarily arguing against an extra penalty for an unclear debate, because that can disincentivize honest counterargument. That makes sense. From there, Vojtech generalizes the concern, stating that deviations from fully-adversarial payoffs may introduce an incentive for the debaters to cooperate with each other by not pointing out flaws.

This critique is quite relevant to my game, if we're not in the honest equilibrium. Indeed, if we suppose that it's always possible to find a pseudo-defeater, the game I described is rather similar to iterated Prisoner's Dilemma: a player can get +1 in exchange for giving the other player -2. The two differences are (1) players can resign at any time, to halt the loss of utility, and (2) players alternate moves, rather than simultaneously deciding to cooperate/defect on each round.

If you're not sure your opponent is honest, it's quite reasonable to resign as soon as possible, to avoid repeated -2 hits from pseudo-defeaters. This makes it quite challenging to design training procedures which target the honest equilibrium.

It's also worth pointing out that in the honest equilibrium, players concede as soon as they are allowed to. With no dishonesty, there is no reason to object. So, in the game I've described, this isn't even necessarily a sign that we're in the wrong equilibrium. This may further complicate the training procedure, since early resignations mean less training data. We want the system to become good at the later parts of the game, IE defeating dishonest objections, not just the early part of the game.

Finding Honest Equilibria

By no means do I claim that I can solve the training problem. But, a few thoughts on training.

First, because feedback is incremental, rather than waiting on human judgement at the end of the debate, we can train more myopically. The quality of each statement in the debate depends only on how the human judges that statement, and the very next statement. This lines up with what we want -- we want the AI to just respond to the current state of the debate honestly, rather than thinking many steps ahead and putting itself in an advantageous situation. AI Safety via Market Making provides another debate-like model which is amenable to more myopic training, and argues that this has safety benefits. These benefits are, of course, debated.

Although Vojtech provides some reasons why finding honest equilibria is more difficult in my proposal than in regular debate, there is some reason to think it could be less difficult, as well. Because regular debate is zero-sum, there is always supposed to be a dishonest player. The desired equilibrium is one where the first player gives the actual best answer, and the second player finds a plausible lie. This means we can't apply outside penalties for liars.

To be more concrete, suppose we had a magical transparency technology which allows us to understand what's going on inside a big neural net. One thing we might want to do would be to detect any thought processes that are explicitly lying to us, or even anywhere close to explicitly reasoning about how to convince us rather than thinking about the question at hand, and apply a penalty, producing a strong gradient away from those things.

This would be fine in my proposal, because we want both players to be honest. However, it might prevent training from making any progress in regular debate, because one of the players is supposed to be dishonest.

Note that my "honesty advantage" is fragile. If we force both players to give at least one answer before they can give up (which would normally be desirable, to make sure the human gets at least two answers to compare), then we might force the second player to lie in cases where the first player gives a really good answer. However, in the context of my proposal, it seems best to allow the second player to resign -- although it means debates are very short in equilibrium, it shouldn't slow down training that much, because if the first player isn't playing optimally then the second player is incentivised to speak up.

Is Honesty Smart?

Let's set aside the question of adding a magical honesty bonus (/dishonesty penalty) to training. To what extent might we expect training to find an honest equilibrium without that?

As with regular debate, the crux of the matter is in how easy it is create plausible lies vs point to the problems with them.

Supposing your opponent just lied, should you call them on it, or resign peacefully? This depends on how difficult it is to find the lie, and how easy it is to get the judge to recognize our argument. It also depends on whether we expect to be refuted later (correctly or incorrectly). Resigning cuts off the risk of refutation, but forgoes the benefit. The ratio of risk to benefit I set up is 2:1 (modulo the small probability that the game halts and we get the last word). So, we have to believe that a successful refutation on our part is twice as probable as a successful dishonest refutation of our attempted refutation.

So, roughly, for things to naturally push toward the honest equilibrium, lying had better be more than twice as hard as telling the truth. Ideally much more! (Of course, not all lying is equally difficult. We have to think the easiest lie is twice as difficult to make convincing as the hardest true refutation if we really want this sketchy math to be a general argument for honest equilibria.)

Now suppose your opponent just told the truth. Should you try to refute them, or resign peacefully?

In this case, the risk to benefit ratio is the same; 2:1. However, the difficulty levels reverse: you're the one trying to lie, which means that your opponent can just be honest when refuting you. So, now, refuting their truth with a lie would have to be more than twice as easy as refuting your lie with the truth, in order for it to be worth it to you.

Both of these analyses ignore the chance that your opponent will resign instead of even trying to call you out. If lying is hard, so debaters mostly don't try, then all the better for the analysis of truth-telling: the chance of the -2 penalty becomes quite low, so they can go ahead and refute lies with no worries of retribution.

However, symmetrically, if refuting lies is hard, so opponents mostly don't try, then you can lie to your heart's content.

Overall, this informal analysis seems to point somewhat in favor of truth-telling: if the human judge can indeed tell truth from lies with some reliability, then this can snowball into a large incentive to tell the truth. The big problem is that not all lies are equally difficult, so lying may still be a perfectly good strategy in some cases.

Obviously, as with regular debate, it would be good to have a more rigorous analysis, not to mention better tools for steering toward the honest equilibrium than just naively training and hoping that the incentives are balanced right.

New Comment
39 comments, sorted by Click to highlight new comments since:

You could imagine two versions of the Factored Cognition hypothesis:

  1. (Strong version) For any question Q, a human can either directly answer Q correctly, or decompose Q into subquestions and combine the subanswers to get the right answer to Q.
  2. (Weak version) For any question Q with correct answer A, there exists a tree of decompositions T arguing this such that at every leaf a human can verify that the answer to the question at the leaf is correct, and for every internal node a human can verify that the answer to the question is correct, assuming that the subanswers are correct. (In addition, the human never verifies an incorrect answer given correct subanswers.)

The strong version is like the weak version, except that the human has to find the tree themselves, rather than just verify that the tree is accurate. (You might think though that the weak version implies the strong version, by executing a search over possible decomposition trees -- whether you accept this depends on how you're thinking about computational budgets.)

HCH as a gold standard relies on Strong Factored Cognition.

Iterated amplification relies on Strong Factored Cognition, because its training signal involves a human performing the decompositions into subquestions and combining the subanswers into a final answer.

Debate relies on the following assumptions:

  1. Weak Factored Cognition
  2. The debaters are sufficiently powerful to find the full decomposition tree. (Equivalently, the training procedure successfully finds the sole equilibrium of the game.)

Given these assumptions, for any question Q whose correct answer is A with decomposition tree T, the honest debater's strategy is:

  1. If T is a leaf, state "The answer is A, and the judge can verify this".
  2. If T is an internal node, state "I claim that the answers to <subquestions> are <subanswers>, and so the answer to Q is A".

Intuitively, debate can "get away" with using the weak version because it puts the burden on the debaters to find the tree T -- we are allowed to use a weaker assumption on the human's capabilities, at the cost of requiring a stronger assumption on the AI's capabilities.

----

It seems to me that your argument is very similar, except that you get a little more mileage out of assumption 2, that the debaters can find the true decomposition tree. Specifically, you make the assumption:

So we make a further assumption: defeaters have defeaters. In other words, there are no indescribably bad arguments; if an argument is bad, then there's an explanation of why it's bad. This assumption is recursive, applying to defeaters at any level.

Then for any question Q and correct answer A, you can build tree of decompositions Tree(Q, A) as follows:

  1. If q is something H can directly evaluate, return Leaf(Q, A).
  2. If the other player concedes, return Node(Q, A, [Leaf("What is the best defeater to A?", "None")]).
  3. Otherwise, let the best defeater to A be B, and let its best defeater be C. (By your assumption, C exists.) Return:
Node(Q, A, [
  Leaf("What is the best defeater to A?", B),
  Leaf("Does B defeat A?", "Yes"),
  Node("Does B fully defeat A?", "No", [
    Leaf("What is the best defeater to B?", C),
    Leaf("Does C defeat B?", "Yes"),
    Tree("Does C fully defeat B?", "Yes")])])

I claim that this is a tree that satisfies the weak Factored Cognition hypothesis, if the human can take on faith the answers to "What is the best defeater to X". Essentially what's happening is that with your argument we get to trust that the debaters have explored all possible counterarguments and selected the best one and so the human gets to assume that no other more compelling counterarguments exist, which is not something we typically get to assume with weak Factored Cognition. It feels to me intuitively like this puts more burden on the assumption that we find the true equilibrium, though formally it's the same assumption as before.

The version of debate I present here instead focuses on mitigating imperfections in the human judge. The human can be thought of as a poly-time algorithm, but one with failure states. One step of debate doesn't provide an NP oracle; instead, it creates a situation where the judge will probably fail, because the opening arguments can be dishonest. The extra layers of debate serve the purpose of rooting out dishonesty, rather than adding real computational power.

Idk, it seems like this is only true because you are forcing your human to make a judgment. If the judge were allowed to say "I don't know" (in which case no one gets reward, or the reward is split), then I think one step of debate once again provides an NP oracle.

Or perhaps you're assuming that the human is just not very good at being a poly-time algorithm; if that's what you're saying that seems like it's missing the point of the computational complexity analogy. I don't think people who make that analogy (including myself) mean that humans could actually implement arbitrary poly-time algorithms faithfully.

----

So far, all of this discussion still works with the zero-sum setting, so I don't really understand why you say

The following is a fairly nonstandard setup for AI Debate, but I found it necessary to make my argument go through.

In any case, it seems to me like making it non-zero-sum is an orthogonal axis. I don't really understand why you want it to be non-zero-sum -- you say that it is to incentivize honesty at every step, but why doesn't this happen with standard debate? If you evaluate the debate at the end rather than at every step, then as far as I can tell under the assumptions you use the best strategy is to be honest.

Maybe you want to have feedback at every step instead of at the end? Why? Perhaps you take myopic training as a desideratum, and this helps?

Overall it seemed to me like the non-zero-sum aspect introduced some problems (might no longer access PSPACE, introduces additional equilibria beyond the honest one), and did not actually help solve anything, but I'm pretty sure I just completely missed the point you were trying to make.

Thanks, this seems very insightful, but I'll have to think about it more before making a full reply.

(Summary of ensuing debate, for those reading from the future: I concede that my claimed benefit from non-zero-sum payoffs requires, at best, some pretty weird assumptions under which debate looks pretty doomed overall; Rohin concedes that my assumptions don't seem to imply Weak Factored Cognition as he was claiming.)

It seems to me that your argument is very similar, except that you get a little more mileage out of assumption 2, that the debaters can find the true decomposition tree.

While I agree that the defeater tree can be encoded as a factored cognition tree, that just means that if we assume factored cognition, and make my assumption about (recursive) defeaters, then we can show that factored cognition can handle the defeater computation. This is sort of like proving that the stronger theory can handle what the weaker theory can handle, which would not be surprising -- I'd still be interested in the weaker theory as a way to argue safety from fewer assumptions. But it's not even that, since you'd still need to additionally suppose my thesis about defeaters, beyond (strong/weak) factored cognition.

Essentially what's happening is that with your argument we get to trust that the debaters have explored all possible counterarguments and selected the best one and so the human gets to assume that no other more compelling counterarguments exist, which is not something we typically get to assume with weak Factored Cognition. It feels to me intuitively like this puts more burden on the assumption that we find the true equilibrium, though formally it's the same assumption as before.

I don't really get this part -- what's so important about the best counterargument? I think my argument in the post is more naturally captured by supposing counterarguments either work or don't, in binary fashion. So a debater just has to find a defeater. Granted, some defeaters have a higher probability of working, in a realistic situation with a fallible judge. And sure, the debaters should find those. But I don't see where I'm putting a higher burden on finding the true equilibrium. What are you pointing at?

Idk, it seems like this is only true because you are forcing your human to make a judgment. If the judge were allowed to say "I don't know" (in which case no one gets reward, or the reward is split), then I think one step of debate once again provides an NP oracle.

Or perhaps you're assuming that the human is just not very good at being a poly-time algorithm; if that's what you're saying that seems like it's missing the point of the computational complexity analogy. I don't think people who make that analogy (including myself) mean that humans could actually implement arbitrary poly-time algorithms faithfully.

Yeah, my reply would be that I don't see how you get NP oracles out of one step, because a one-step debate will just result in maximally convincing arguments which have little to do with the truth.

I mean, I agree that if you're literally trying to solve TSP, then a human could verify proposed solutions. However, it seems like we don't have to get very messy before humans become exceedingly manipulable through dishonest argument.

So if the point of the computational complexity analogy is to look at what debate could accomplish if humans could be perfect (but poly-time) judges, then I accept the conclusion, but I just don't think that's telling you very much about what you can accomplish on messier questions (and especially, not telling you much about safety properties of debate).

Instead, I'm proposing a computational complexity analogy in which we account for human fallibility as judges, but also allow for the debate to have some power to correct for those errors. This seems like a more realistic way to assess the capabilities of highly trained debate systems.

So far, all of this discussion still works with the zero-sum setting, so I don't really understand why you say

>The following is a fairly nonstandard setup for AI Debate, but I found it necessary to make my argument go through.

Hm, well, I thought I was pretty clear in the post about why I needed that to make my argument work, so I'm not sure what else to say. I'll try again:

In my setup, a player is incentivised to concede when they're beaten, rather than continue to defeat the arguments of the other side. This is crucial, because any argument may have a (dishonest) defeater, so the losing side could continue on, possibly flipping the winner back and forth until the argument gets decided by who has the last word. Thus, my argument that there is an honest equilibrium would not go through for a zero-sum mechanism where players are incentivised to try and steal victory back from the jaws of defeat.

Perhaps I could have phrased my point as the pspace capabilities of debate are eaten up by error correction. 

In any case, it seems to me like making it non-zero-sum is an orthogonal axis. I don't really understand why you want it to be non-zero-sum -- you say that it is to incentivize honesty at every step, but why doesn't this happen with standard debate? If you evaluate the debate at the end rather than at every step, then as far as I can tell under the assumptions you use the best strategy is to be honest.

[...]

Overall it seemed to me like the non-zero-sum aspect introduced some problems (might no longer access PSPACE, introduces additional equilibria beyond the honest one), and did not actually help solve anything, but I'm pretty sure I just completely missed the point you were trying to make.

I really just needed it for my argument to go through. If you have an alternate argument which works for the zero-sum case, I'm interested in hearing it.

Maybe you mean that if we assume (weak/strong) factored cognition, you can argue that zero-sum debate works, because argument trees terminate, so who wins is not in fact just up to who gets the last word. But (a) this would require factored cognition; (b) I'm interested in hearing your argument even if it relies on factored cognition, because I'm still concerned that a dishonest player can use flawless but non-well-founded argument trees (and is incentivised to do so, even in the honest equilibrium, to avert loss).

As usual when talking about the debate, I get the feeling that I'm possibly being dumb about something because everyone else seems to buy that there are arguments in support of various points. I'm kind of worried that there aren't really arguments for those things, which is a big part of why I bothered to write a post at all -- this post is basically my attempt to articulate the part of debate that I can currently understand why would work. But getting the argument I'm missing would certainly be helpful.

While I agree that the defeater tree can be encoded as a factored cognition tree, that just means that if we assume factored cognition, and make my assumption about (recursive) defeaters, then we can show that factored cognition can handle the defeater computation. This is sort of like proving that the stronger theory can handle what the weaker theory can handle, which would not be surprising

I don't think that's what I did? Here's what I think the structure of my argument is:

  1. Every dishonest argument has a defeater. (Your assumption.)
  2. Debaters are capable of finding a defeater if it exists. (I said "the best counterargument" before, but I agree it can be weakened to just "any defeater". This doesn't feel that qualitatively different.)
  3. 1 and 2 imply the Weak Factored Cognition hypothesis.

I'm not assuming factored cognition, I'm proving it using your assumption.

Possibly your worry is that the argument trees will never terminate, because every honest defeater could still have a dishonest defeater? It is true that I do need an additional assumption of some sort to ensure termination. Without that assumption, honesty becomes one of multiple possible equilibria (but it is still an equilibrium).

So if the point of the computational complexity analogy is to look at what debate could accomplish if humans could be perfect (but poly-time) judges, then I accept the conclusion

This is in fact what I usually take away from it. The point is to gain intuition about how "strongly" you amplify the original human's capabilities.

but I just don't think that's telling you very much about what you can accomplish on messier questions (and especially, not telling you much about safety properties of debate).

I also agree with this; does anyone think it is proving something about the safety properties of debate w.r.t messy situations?

Instead, I'm proposing a computational complexity analogy in which we account for human fallibility as judges, but also allow for the debate to have some power to correct for those errors. This seems like a more realistic way to assess the capabilities of highly trained debate systems.

This seems good; I think probably I don't get what exactly you're arguing. (Like, what's the model of human fallibility where you don't access NP in one step? Can the theoretical-human not verify witnesses? What can the theoretical-human verify, that lets them access NP in multiple timesteps but not one timestep?)

In my setup, a player is incentivised to concede when they're beaten, rather than continue to defeat the arguments of the other side. This is crucial, because any argument may have a (dishonest) defeater, so the losing side could continue on, possibly flipping the winner back and forth until the argument gets decided by who has the last word. Thus, my argument that there is an honest equilibrium would not go through for a zero-sum mechanism where players are incentivised to try and steal victory back from the jaws of defeat.

Perhaps I could have phrased my point as the pspace capabilities of debate are eaten up by error correction. 

I agree that you get a "clawing on to the argument in hopes of winning" effect, but I don't see why that changes the equilibrium away from honesty. Just because a dishonest debater would claw on doesn't mean that they'd win. The equilibrium is defined by what makes you win.

I can buy that in practice due to messiness you find worse situations where the AI systems sometimes can't find the honest answer and instead finds that making up BS has a better chance of winning, and so it does that; but that's not about the equilibrium, and it sounded to me like you were talking about the equilibrium.

I really just needed it for my argument to go through. If you have an alternate argument which works for the zero-sum case, I'm interested in hearing it.

I mean, I tried to give one (see response to your first point; I'm not assuming the Factored Cognition hypothesis). I'm not sure what's unconvincing about it.

Thanks for taking the time to reply!

I don’t think that’s what I did? Here’s what I think the structure of my argument is:

  1. Every dishonest argument has a defeater. (Your assumption.)
  2. Debaters are capable of finding a defeater if it exists. (I said “the best counterargument” before, but I agree it can be weakened to just “any defeater”. This doesn’t feel that qualitatively different.)
  3. 1 and 2 imply the Weak Factored Cognition hypothesis. I’m not assuming factored cognition, I’m proving it using your assumption.

Ah, interesting, I didn't catch that this is what you were trying to do. But how are you arguing #3? Your original comment seems to be constructing a tree computation for my debate, which is why I took it for an argument that my thing can be computed within factored cognition, not vice versa.

I think maybe what you're trying to argue is that #1 and #2 together imply that we can root out dishonest arguments (at least, in the honest equilibrium), which I would agree with -- and then you're suggesting that this means we can recognize good arguments in the factored-cognition sense of good (IE arguments supported by a FC tree)? But I don't yet see the implication from rooting out dishonest arguments to being able to recognize arguments that are valid in FC terms.

Perhaps an important point is that by "dishonest" I mean manipulative, ie, arguments which appear valid to a human on first reading them but which are (in some not-really-specified sense) bad. So, being able to root out dishonest arguments just means we can prevent the human from being improperly convinced. Perhaps you are reading "dishonest" to mean "invalid in an FC sense", ie, lacking an FC tree. This is not at all what I mean by dishonest. Although we might suppose dishonest implies dishonest, this supposition still would not make your argument go through (as far as I am seeing), because the set of not-dishonest arguments would still not equal the set of FC-valid arguments.

If you did mean for "honest" to be defined as "has a supporting FC tree", my objection to your argument quoted above would be that #1 is implausibly strong, since it requires that any flaw in a tree can be pointed out in a single step. (Analogically, this is assuming PSPACE=NP.)

Possibly your worry is that the argument trees will never terminate, because every honest defeater could still have a dishonest defeater?

I mean, that's a concern I have, but not necessarily wrt the argument above. (Unless you have a reason why it's relevant.)

It is true that I do need an additional assumption of some sort to ensure termination. Without that assumption, honesty becomes one of multiple possible equilibria (but it is still an equilibrium).

Based on what argument? Is this something from the original debate paper that I'm forgetting?

I also agree with this; does anyone think it is proving something about the safety properties of debate w.r.t messy situations?

Fair question. Possibly it's just my flawed assumption about why the analogy was supposed to be interesting. I assumed people were intending the PSPACE thing as evidence about what would happen in messier situations.

This seems good; I think probably I don’t get what exactly you’re arguing. (Like, what’s the model of human fallibility where you don’t access NP in one step? Can the theoretical-human not verify witnesses? What can the theoretical-human verify, that lets them access NP in multiple timesteps but not one timestep?)

My model is like this:

Imagine that we're trying to optimize a travelling salesman route, using an AI advice system. However, whenever the AI says "democratic" or "peaceful" or other such words, the human unthinkingly approves of the route, without checking the claimed distance calculation.

This is, of course, a little absurd, but similar effects have been observed in experiments.

I'm then making the further assumption that humans can correct these errors when they're explained sufficiently well.

That's my model; the proposal in the post lives or dies on its merits.

I agree that you get a “clawing on to the argument in hopes of winning” effect, but I don’t see why that changes the equilibrium away from honesty. Just because a dishonest debater would claw on doesn’t mean that they’d win. The equilibrium is defined by what makes you win.

The point of the "clawing" argument is that it's a rational deviation from honesty, so it means honesty isn't an equilibrium. It's a 50/50 chance of winning (whoever gets the last word), which is better than a sure failure (in the case that a player has exhausted its ability to honestly argue).

Granted, there may be zero-sum rules which nonetheless don't allow this. I'm only saying that I didn't see how to avoid it with zero-sum scoring.

I don’t really understand why you want it to be non-zero-sum [...]

I really just needed it for my argument to go through. If you have an alternate argument which works for the zero-sum case, I’m interested in hearing it.

I mean, I tried to give one (see response to your first point; I’m not assuming the Factored Cognition hypothesis). I’m not sure what’s unconvincing about it.

I remain curious to hear your clarification wrt that (specifically, how you justify point #3). However, if that argument went through, how would that also be an argument that the same thing can be accomplished with a zero-sum set of rules?

Based on your clarification, my current understanding of what that argument tries to accomplish is "I’m not assuming factored cognition, I’m proving it using your assumption." How would establishing that help establish a set of zero sum rules which have an honest equilibrium?

Ah, interesting, I didn't catch that this is what you were trying to do. But how are you arguing #3? Your original comment seems to be constructing a tree computation for my debate, which is why I took it for an argument that my thing can be computed within factored cognition, not vice versa.

There are two arguments:

  1. Your assumption + automatic verification of questions of the form "What is the best defeater to X" implies Weak Factored Cognition (which as defined in my original comment is of the form "there exists a tree such that..." and says nothing about what equilibrium we get).
  2. Weak Factored Cognition + debate + human judge who assumes optimal play implies an honest equilibrium. (Maybe also: if you assume debate trees terminate, then the equilibrium is unique. I think there's some subtlety here though.)

In my previous comment, I was talking about 1, and taking 2 for granted. This is all in the zero-sum setting. But let's leave that aside and instead talk about a simpler argument that doesn't talk about Factored Cognition at all.

----

Based on what argument? Is this something from the original debate paper that I'm forgetting?

Zero-sum setting, argument that honesty is an equilibrium (for the first player in a turn-by-turn game, or either player in a simultaneous-action game):

If you are always honest, then whenever you can take an action, there will exist a defeater (by your assumption), therefore you will have at least as many options as any non-honest policy (which may or may not have a defeater). Therefore you maximize your value by being honest.

Additional details:

In the case where arguments never terminate (every argument, honest or not, has a defeater), then being dishonest will also leave you with many options, and so that will also be an equilibrium. When arguments do terminate quickly enough (maximum depth of the game tree is less than the debate length), that ensures that the honest player always gets the "last word" (the point at which a dishonest defeater no longer exists), and so honesty always wins and is the unique equilibrium. In the middle, where most arguments terminate quickly but some go on forever, honesty is usually incentivized, but sometimes it can be swapped out for a dishonest strategy that achieves the same value.

The point of the "clawing" argument is that it's a rational deviation from honesty, so it means honesty isn't an equilibrium.

I think this is only true when you have turn-by-turn play and your opponent has already "claimed" the honest debater role. In this case I'd say that an equilibrium is for the first player to be honest and the second player to do whatever is necessary to have a chance at success. Still seems like you can use the first player AI in this situation.

In the simultaneous play setting, I think you expect both agents to be honest.

More broadly, I note that the "clawing" argument only applies when facing an honest opponent. Otherwise, you should just use honest counterarguments.

I also don't really understand the hope in the non-zero-sum case here -- in the non-zero-sum setting, as you mention the first player can be dishonest, and then the second player concedes rather than giving an honest defeater that will then be re-defeated by the first (dishonest) player. This seems like worse behavior than is happening under the zero-sum case.

My model is like this

Got it, that makes sense. I see better now why you're saying one-step debate isn't an NP oracle.

I think my arguments in the original comment do still work, as long as you enforce that the judge never verifies an argument without first asking the subquestion "What is the best defeater to this argument?"

I think this is only true when you have turn-by-turn play and your opponent has already "claimed" the honest debater role.

Yeah, I was assuming turn-by-turn play.

In the simultaneous play setting, I think you expect both agents to be honest.

This is a significant point that I was missing: I had assumed that in simultaneous play, the players would randomize, so as to avoid choosing the same answer, since choosing the same answer precludes winning. However, if choosing a worse answer means losing, then players prefer a draw.

But I'm not yet convinced, because there's still the question of whether choosing the worse answer means losing. The "clawing" argument still suggests that choosing the worse answer may yield a draw (in expectation), even in simultaneous play. (IE, what if the should-be loser attacks the winner, and they go back and forth, with winner depending on last word?)

Ah, I suppose this is still consistent with honesty being an equilibrium. But it would then be a really weak sort of equilibrium -- there would be no reason to be honest, but no specific reason to be dishonest, either.

Zero-sum setting, argument that honesty is an equilibrium (for the first player in a turn-by-turn game, or either player in a simultaneous-action game):

If you are always honest, then whenever you can take an action, there will exist a defeater (by your assumption), therefore you will have at least as many options as any non-honest policy (which may or may not have a defeater). Therefore you maximize your value by being honest.

There always exists an honest defeater to dishonest arguments. But, never to honest arguments. (I should have explicitly assumed this.) Therefore, you are significantly tying your hands by being honest: you don't have a way to refute honest arguments. (Which you would like to do, since in the zero-sum setting, this may be the only way to recover points.)

I assume (correct me if I'm wrong) that the scoring rules to "the zero sum setting" are something like: the judge assesses things at the end, giving +1 to the winner and -1 from the loser, or 0 in case of a tie.

Then I concede that there is an honest equilibrium where the first player tells the truth, and the second player concedes (or, in simultaneous play, both players tell the truth and then concede). However, it does seem to be an extremely weak equilibrium -- the second player is equally happy to lie, starting a back-and-forth chain which is a tie in expectation.

It seems plausible to me that there's an incremental zero-sum scoring rule; EG, every convincing counterargument takes 1 point from the other player, so any dishonest statement is sure to lose you a point (in equilibrium). The hope would be that you always prefer to concede rather than argue, even if you're already losing, in order to avoid losing more points.

However, this doesn't work, because a dishonest (but convincing) argument gives you +1, and then -1 if it is refuted; so at worst it's a wash. So again it's a weak equilibrium, and if there's any imperfection in the equilibrium at all, it actively incentivises lying when you would otherwise concede (because you want to take the chance that the opponent will not manage to refute your argument).

This was the line of reasoning which led me to the scoring rule in the post, since making it a -2 (but still only +1 for the other player) solves that issue.

When arguments do terminate quickly enough (maximum depth of the game tree is less than the debate length), that ensures that the honest player always gets the "last word" (the point at which a dishonest defeater no longer exists), and so honesty always wins and is the unique equilibrium.

I agree that if we assume honesty eventually wins if arguments are long enough (IE, eventually you get to an honest argument which has no dishonest defeater), then there would be an honest equilibrium, and no dishonest equilibrium.

More broadly, I note that the "clawing" argument only applies when facing an honest opponent. Otherwise, you should just use honest counterarguments.

Ahhh, this is actually a pretty interesting point, because it almost suggests that honesty is an Evolutionarily Stable Equilibrium, even though it's only a Weak Nash Equilibrium. But I think that's not quite true, since the strategy "lie when you would otherwise have to concede, but otherwise be honest" can invade the honest equilibrium. (IE that mutation would not be selected against, and could be actively selected for if we're not quite in equilibrium, since players might not be quite perfect at finding the honest refutations for all lies.)

I also don't really understand the hope in the non-zero-sum case here -- in the non-zero-sum setting, as you mention the first player can be dishonest, and then the second player concedes rather than giving an honest defeater that will then be re-defeated by the first (dishonest) player. This seems like worse behavior than is happening under the zero-sum case.

You're right, that's really bad. The probability of the opponent finding (and using) a dishonest defeater HAS TO be below 50%, in all cases, which is a pretty high bar. Although of course we can make an argument about how that probability should be below 50% if we're already in an honest-enough regime. (IE we hope that the dishonest player prefers to concede at that point rather than refute the refutation, for the same reason as your argument gives -- it's too afraid of the triple refutation. This is precisely the argument we can't make in the zero sum case.)

Whoops, I seem to have missed this comment, sorry about that. I think at this point we're nearly at agreement.

Ah, I suppose this is still consistent with honesty being an equilibrium. But it would then be a really weak sort of equilibrium -- there would be no reason to be honest, but no specific reason to be dishonest, either.

Yeah, I agree this is possible. (The reason to not expect dishonesty is that sometimes you'll see honest arguments to which there is no dishonest defeater.)

Then I concede that there is an honest equilibrium where the first player tells the truth, and the second player concedes (or, in simultaneous play, both players tell the truth and then concede). However, it does seem to be an extremely weak equilibrium -- the second player is equally happy to lie, starting a back-and-forth chain which is a tie in expectation.

Similar comment here -- the more you expect that honest claims will likely have dishonest defeaters, the weaker you expect the equilibrium to be. (E.g. it's clearly not a tie when honest claims never have dishonest defeaters; in this case first player always wins.)

It seems plausible to me that there's an incremental zero-sum scoring rule; EG, every convincing counterargument takes 1 point from the other player, so any dishonest statement is sure to lose you a point (in equilibrium). The hope would be that you always prefer to concede rather than argue, even if you're already losing, in order to avoid losing more points.

However, this doesn't work, because a dishonest (but convincing) argument gives you +1, and then -1 if it is refuted; so at worst it's a wash. So again it's a weak equilibrium, and if there's any imperfection in the equilibrium at all, it actively incentivises lying when you would otherwise concede (because you want to take the chance that the opponent will not manage to refute your argument).

This was the line of reasoning which led me to the scoring rule in the post, since making it a -2 (but still only +1 for the other player) solves that issue.

On the specific -2/+1 proposal, the issue is that then the first player just makes some dishonest argument, and the second player concedes because even if they give an honest defeater, the second player could then re-defeat that with a dishonest defeater. (I realize I'm just repeating myself here; there's more discussion in the next section.)

But more broadly, I claim that given your assumptions there is no possible scoring rule that (in the worst case) makes honesty a unique equilibrium. This worst case is when every argument has a defeater (and in particular, every honest argument has a dishonest defeater).

In this situation, there is no possible way to distinguish between honesty and dishonesty -- under your assumptions, the thing that characterizes honesty is that honest arguments (at least sometimes) don't have defeaters. From the perspective of the players, the salient feature of the game is that they can make statements; all such statements will have defeaters; there's no information available to them in the structure of the game that distinguishes honesty from dishonesty. Therefore honesty can't be the unique equilibrium; whatever the policy is, there should be an equivalent one that is at least sometimes dishonest.

In this worst case, I suspect that for any judge-based scoring rule, the equilibrium behavior is either "the first player says something and the second concedes", or "every player always provides some arbitrary defeater of the previous statement, and the debate never ends / the debate goes to whoever got the last word".

The probability of the opponent finding (and using) a dishonest defeater HAS TO be below 50%, in all cases, which is a pretty high bar. Although of course we can make an argument about how that probability should be below 50% if we're already in an honest-enough regime. (IE we hope that the dishonest player prefers to concede at that point rather than refute the refutation, for the same reason as your argument gives -- it's too afraid of the triple refutation. This is precisely the argument we can't make in the zero sum case.)

Sorry, I don't get this. How could we make the argument that the probability is below 50%?

Depending on the answer, I expect I'd follow up with either

  1. Why can't the same argument apply in the zero sum case? or
  2. Why can't the same argument be used to say that the first player is happy to make a dishonest claim? or
  3. Why is it okay for us to assume that we're in an honest-enough regime?

Separately, I'd also want to understand how exactly we're evading the argument I gave above about how the players can't even distinguish between honesty and dishonesty in the worst case.

----

Things I explicitly agree with:

I assume (correct me if I'm wrong) that the scoring rules to "the zero sum setting" are something like: the judge assesses things at the end, giving +1 to the winner and -1 from the loser, or 0 in case of a tie.

and

Ahhh, this is actually a pretty interesting point, because it almost suggests that honesty is an Evolutionarily Stable Equilibrium, even though it's only a Weak Nash Equilibrium. But I think that's not quite true, since the strategy "lie when you would otherwise have to concede, but otherwise be honest" can invade the honest equilibrium. (IE that mutation would not be selected against, and could be actively selected for if we're not quite in equilibrium, since players might not be quite perfect at finding the honest refutations for all lies.)

Sorry, I don't get this. How could we make the argument that the probability is below 50%?

I think my analysis there was not particularly good, and only starts to make sense if we aren't yet in equilibrium.

Depending on the answer, I expect I'd follow up with either
[...]
3. Why is it okay for us to assume that we're in an honest-enough regime?

I think #3 is the most reasonable, with the answer being "I have no reason why that's a reasonable assumption; I'm just saying, that's what you'd usually try to argue in a debate context..."

(As I stated in the OP, I have no claims as to how to induce honest equilibrium in my setup.)

I agree that we are now largely in agreement about this branch of the discussion.

Ah, I suppose this is still consistent with honesty being an equilibrium. But it would then be a really weak sort of equilibrium—there would be no reason to be honest, but no specific reason to be dishonest, either.

Yeah, I agree this is possible. (The reason to not expect dishonesty is that sometimes you’ll see honest arguments to which there is no dishonest defeater.)

Then I concede that there is an honest equilibrium where the first player tells the truth, and the second player concedes (or, in simultaneous play, both players tell the truth and then concede). However, it does seem to be an extremely weak equilibrium—the second player is equally happy to lie, starting a back-and-forth chain which is a tie in expectation.

Similar comment here—the more you expect that honest claims will likely have dishonest defeaters, the weaker you expect the equilibrium to be. (E.g. it’s clearly not a tie when honest claims never have dishonest defeaters; in this case first player always wins.)

My (admittedly conservative) supposition is that every claim does have a defeater which could be found by a sufficiently intelligent adversary, but, the difficulty of finding such claims can be much higher than finding honest ones.

But more broadly, I claim that given your assumptions there is no possible scoring rule that (in the worst case) makes honesty a unique equilibrium. This worst case is when every argument has a defeater (and in particular, every honest argument has a dishonest defeater).

In this situation, there is no possible way to distinguish between honesty and dishonesty—under your assumptions, the thing that characterizes honesty is that honest arguments (at least sometimes) don’t have defeaters.

Yep, makes sense. So nothing distinguishes between an honest equilibrium and a dishonest one, for sufficiently smart players.

There is still potentially room for guarantees/arguments about reaching honest equilibria (in the worst case) based on the training procedure, due to the idea that the honest defeaters are easier to find.

There are two arguments:

  1. Your assumption + automatic verification of questions of the form "What is the best defeater to X" implies Weak Factored Cognition (which as defined in my original comment is of the form "there exists a tree such that..." and says nothing about what equilibrium we get).

Right, of course, that makes more sense. However, I'm still feeling dense -- I still have no inkling of how you would argue weak factored cognition from #1 and #2. Indeed, Weak FC seems far too strong to be established from anything resembling #1 and #2: WFC says that for any question Q with a correct answer A, there exists a tree. In terms of the computational complexity analogy, this is like "all problems are PSPACE". Presumably you intended this as something like an operational definition of "correct answer" rather than an assertion that all questions are answerable by verifiable trees? In any case, #1 and #2 don't seem to imply anything like "for all questions with a correct answer..." -- indeed, #2 seems irrelevant, since it is about what arguments players can reliably find, not about what the human can verify.

2. Weak Factored Cognition + debate + human judge who assumes optimal play implies an honest equilibrium. (Maybe also: if you assume debate trees terminate, then the equilibrium is unique. I think there's some subtlety here though.)

I'll just flag that I still don't know this argument, either, and I'm curious where you're getting it from / what it is. (I have a vague recollection that this argument might have been explained to me in some other comment thread about debate, but, I haven't found it yet.) But, you understandably don't focus on articulating your arguments 1 or 2 in the main body of your comment, instead focusing on other things. I'll leave this comment as a thread for you to articulate those two arguments further if you feel up to it, and make another comment to reply to the bulk of your comment.

I'll just flag that I still don't know this argument, either, and I'm curious where you're getting it from / what it is.

I just read the Factored Cognition sequence since it has now finished, and this post derives WFC as the condition necessary for honesty to be an equilibrium in (a slightly unusual form of) debate, under the assumption of optimal play.

Great, thanks!

WFC says that for any question Q with a correct answer A, there exists a tree. In terms of the computational complexity analogy, this is like "all problems are PSPACE"

The computational complexity analogy version would have to put a polynomial limit on the depth of the tree if you wanted to argue that the problem is in PSPACE. My construction doesn't do this; there will be questions where the depth of the tree is super-polynomial, but the tree still exists. (These will be the cases in which, even under optimal play by an honest agent, the "length" of a chain of defeaters can be super-polynomially large.) So I don't think my argument is proving too much.

(The tree could be infinite if you don't have an assumption that guarantees termination somehow, hence my caveats about termination. WFC should probably ask for the existence of a finite tree.)

For the actual argument, I'll refer back to my original comment, which provides a procedure to construct the tree. Happy to clarify whichever parts of the argument are confusing.

Presumably you intended this as something like an operational definition of "correct answer" rather than an assertion that all questions are answerable by verifiable trees?

No, I am in fact asserting that given the two assumptions, all questions are answerable by (potentially super-polynomially large) verifiable trees (again assuming we deal with termination somehow).

I'll just flag that I still don't know this argument, either, and I'm curious where you're getting it from / what it is.

I think it differs based on what assumptions you make on the human judge, so there isn't a canonical version of it. In this case, the assumption on the human judge is that if the subanswers they are given are true, then they never verify an incorrect overall answer. (This is different from the "defeaters" assumption you have, for which I'd refer to the argument I gave above.)

Claim: In a turn-by-turn unlimited-length debate, if the first player is honest, then the first player always wins in equilibrium.

Argument: By WFC, we assume there is a finite tree T that can be verified. The first player then has the following strategy: take the question under consideration (initially the original question; later it is whatever subquestion the opponent is disputing). Report "the answer is A, which because the answer to subquestion 1 is A1 and the answer to subquestion 2 is A2".

The opponent will always have to recurse into one of the subclaims (or concede). This brings us one step closer to leaf nodes. Eventually (if the opponent never concedes), we get to a leaf node which the judge then verifies in favor of the honest first player. ∎

Corollary: For the first player, honesty is an equilibrium policy.

Argument: By the claim above, the first player can never do any better than honesty (you can't do better than always winning).

In a simultaneous-play unlimited-length debate, a similar argument implies at least a 50-50 chance of winning via honesty, which must be the minimax value (since the game is symmetric and zero-sum), and therefore honesty is an equilibrium policy.

----

Once you go to finite-length debates, then things get murkier and you have to worry about arguments that are too long to get to leaf nodes (this is essentially the computationally bounded version of the termination problem). The version of WFC that would be needed is "for every question Q, there is a verifiable tree T of depth at most N showing that the answer is A"; that version of WFC is presumably not true.

The computational complexity analogy version would have to put a polynomial limit on the depth of the tree if you wanted to argue that the problem is in PSPACE. My construction doesn’t do this; there will be questions where the depth of the tree is super-polynomial, but the tree still exists. (These will be the cases in which, even under optimal play by an honest agent, the “length” of a chain of defeaters can be super-polynomially large.) So I don’t think my argument is proving too much.

OK, but this just makes me regret pointing to the computational complexity analogy. You're still purporting to prove "for any question with a correct answer, there exists a tree" from assumptions which don't seem strong enough to say much about all correct answers.

For the actual argument, I’ll refer back to my original comment, which provides a procedure to construct the tree. Happy to clarify whichever parts of the argument are confusing.

Looking back again, it still seems like what you are trying to do in your original argument is something like point out that optimal play (within my system) can be understood via a tree structure. But this should only establish something like "any question which my version of debate can answer has a tree", not "any question with a correct answer has a tree". There is no reason to think that optimal play can correctly answer all questions which have a correct answer.

It seems like what you are doing in your argument is essentially conflating "answer" with "argument". Just because A is the correct answer to Q does not mean there are any convincing arguments for it.

For generic question Q and correct answer A, I make no assumption that there are convincing arguments for A one way or the other (honest or dishonest). If player 1 simply states A, player 2 would be totally within rights to say "player 1 offers no argument for its position" and receive points for that, as far as I am concerned.

Thus, when you say:

Otherwise, let the best defeater to A be B, and let its best defeater be C. (By your assumption, C exists.)

I would say: no, B may be a perfectly valid response to A, with no defeaters, even if A is true and correctly answers Q.

Another problem with your argument -- WFC says that all leaf nodes are human-verifiable, whereas some leaf nodes in your suggested tree have to be taken on faith (a fact which you mention, but don't address).

Claim: In a turn-by-turn unlimited-length debate, if the first player is honest, then the first player always wins in equilibrium.

The "in equilibrium" there must be unnecessary, right? If the first player always wins in equilibrium but might not otherwise, then the second player has a clear incentive to make sure things are not in equilibrium (which is a contradiction).

I buy the argument given some assumptions. I note that this doesn't really apply to my setting, IE, we have to do more than merely change the scoring to be more like the usual debate scoring.

In particular, this line doesn't seem true without a further assumption:

The opponent will always have to recurse into one of the subclaims (or concede).

Had I considered this argument in the context of my original post, I would have rejected it on the grounds that the opponent can object by other means. For example,

User: What is 2+2?

Player 1: 2+2 is 4. I break down the problem into 'what is 2-1' (call it x), 'what is 2+1' (call it y), and 'what is x+y'. I claim x=1, y=3, and x+y=4. Clearly, if all three of these are true, then 2+2=4, since I've only added 1 and subtracted 1, so x+y must equal 2+2.

Player 2: 2+2 is 5, though. This is because 2+3 is 6, and 3 is 1 more than 2, so, 2+2 must be 1 less than 6. But 5 is 1 less than 6.

Player 1: If my argument is wrong, which of my assumptions is wrong?

Player 2: I don't know. Perhaps you have a huge argument tree which I would have to spend a long time examining. I can tell something is wrong, however, thanks to my argument. If you think it should always be possible to point out which specific assumption is incorrect, which of my assumptions do you think is incorrect?

Clearly, if Player 2 is allowed to object by other means like this, Player 2 would greatly prefer to -- Player 2 wants to avoid descending Player 1's argument tree if at all possible.

If successful, Player 2 gets Player 1 to descend Player 2's infinite tree (which continues to decompose the problem via the same strategy as above), thus never finding the contradiction.

Player 1 can of course ask Player 2 how long the argument tree will be, which does put Player 2 at risk of contradiction in the infinite debate setting. But if debates are finite (but unknown length), Player 2 can claim a large size that makes the contradiction difficult to uncover. Or, Player 2 could avoid answering the question (which seems possible if the players are free to choose which parts of the argument to prioritize in giving their responses).

So I buy your argument under the further assumption that the argument must recurse on Player 1's claims (rather than allowing Player 2 to make an alternative argument which might get recursed on instead). Or, in a true infinite-debate setting, provided that there's also a way to force opponents to answer questions (EG the judge assumes you're lying if you repeatedly dodge a question).

For generic question Q and correct answer A, I make no assumption that there are convincing arguments for A one way or the other (honest or dishonest). If player 1 simply states A, player 2 would be totally within rights to say "player 1 offers no argument for its position" and receive points for that, as far as I am concerned.

I think at this point I want a clearer theoretical model of what assumptions you are and aren't making. Like, at this point, I'm feeling more like "why are we even talking about defeaters; there are much bigger issues in this setup".

I wouldn't be surprised at this point if most of the claims I've made are actually false under the assumptions you seem to be working under.

Another problem with your argument -- WFC says that all leaf nodes are human-verifiable, whereas some leaf nodes in your suggested tree have to be taken on faith (a fact which you mention, but don't address).

Not sure what you want me to "address". The leaf nodes that are taken on faith really are true under optimal play, which is what happens at equilibrium.

Had I considered this argument in the context of my original post, I would have rejected it on the grounds that the opponent can object by other means.

This is why I prefer the version of debate outlined here, where both sides make a claim and then each side must recurse down on the other's arguments. I didn't realize you were considering a version where you don't have to specifically rebut the other player's arguments.

The "in equilibrium" there must be unnecessary, right? If the first player always wins in equilibrium but might not otherwise, then the second player has a clear incentive to make sure things are not in equilibrium (which is a contradiction).

I just meant to include the fact that the honest player is able to find the defeaters to dishonest arguments. If you include that in "the honest policy", then I agree that "in equilibrium" is unnecessary. (I definitely could have phrased that better.)

Another problem with your argument—WFC says that all leaf nodes are human-verifiable, whereas some leaf nodes in your suggested tree have to be taken on faith (a fact which you mention, but don’t address).

Not sure what you want me to “address”. The leaf nodes that are taken on faith really are true under optimal play, which is what happens at equilibrium.

To focus on this part, because it seems quite tractable --

Let's grant for the sake of argument that these nodes are true under optimal play. How can the human verify that? Optimal play is quite a computationally complex object.

WFC as you stated it says that these leaf nodes are verifiable:

(Weak version) For any question Q with correct answer A, there exists a tree of decompositions T arguing this such that at every leaf a human can verify that the answer to the question at the leaf is correct, [...]

So the tree you provide doesn't satisfy this condition. Yet you say:

I claim that this is a tree that satisfies the weak Factored Cognition hypothesis, if the human can take on faith the answers to “What is the best defeater to X”.

To me this reads like "this would satisfy WFC if WFC allowed humans to take leaf nodes on faith, rather than verify them".

Am I still misunderstanding something big about the kind of argument you are trying to make?

Am I still misunderstanding something big about the kind of argument you are trying to make?

I don't think so, but to formalize the argument a bit more, let's define this new version of the WFC:

Special-Tree WFC: For any question Q with correct answer A, there exists a tree of decompositions T arguing this such that:

  1. Every internal node has exactly one child leaf of the form "What is the best defeater to X?" whose answer is auto-verified,
  2. For every other leaf node, a human can verify that the answer to the question at that node is correct,
  3. For every internal node, a human can verify that the answer to the question is correct, assuming that the subanswers are correct.

(As before, we assume that the human never verifies something incorrect, unless the subanswers they were given were incorrect.)

Claim 1: (What I thought was) your assumption => Special-Tree WFC, using the construction I gave.

Claim 2: Special-Tree WFC + assumption of optimal play => honesty is an equilibrium, using the same argument that applies to regular WFC + assumption of optimal play.

Idk whether this is still true under the assumptions you're using; I think claim 1 in particular is probably not true under your model.

Ah, OK, so you were essentially assuming that humans had access to an oracle which could verify optimal play.

This sort of makes sense, as a human with access to a debate system in equilibrium does have such an oracle. I still don't yet buy your whole argument, for reasons being discussed in another branch of our conversation, but this part makes enough sense.

Your argument also has some leaf nodes which use the terminology "fully defeat", in contrast to "defeat". I assume this means that in the final analysis (after expanding the chain of defeaters) this refutation was a true one, not something ultimately refuted.

If so, it seems you also need an oracle for that, right? Unless you think that can be inferred from some fact about optimal play. EG, that a player bothered to say it rather than concede.

In any case it seems like you could just make the tree out of the claim "A is never fully defeated":

Node(Q, A, [Leaf("Is A ever fully defeated?", "No")])

Your argument also has some leaf nodes which use the terminology "fully defeat", in contrast to "defeat".

I don't think I ever use "fully defeat" in a leaf? It's always in a Node, or in a Tree (which is a recursive call to the procedure that creates the tree).

I assume this means that in the final analysis (after expanding the chain of defeaters) this refutation was a true one, not something ultimately refuted.

Yes, that's what I mean by "fully defeat".

I don't think I ever use "fully defeat" in a leaf? It's always in a Node, or in a Tree (which is a recursive call to the procedure that creates the tree).

Ahhhhh, OK. I missed that that was supposed to be a recursive call, and interpreted it as a leaf node based on the overall structure. So I was still missing an important part of your argument. I thought you were trying to offer a static tree in that last part, rather than a procedure.

For generic question Q and correct answer A, I make no assumption that there are convincing arguments for A one way or the other (honest or dishonest). If player 1 simply states A, player 2 would be totally within rights to say “player 1 offers no argument for its position” and receive points for that, as far as I am concerned.

I think at this point I want a clearer theoretical model of what assumptions you are and aren’t making. Like, at this point, I’m feeling more like “why are we even talking about defeaters; there are much bigger issues with this setup”.

An understandable response. Of course I could try to be more clear about my assumptions (and might do so).

But it seems to me that the current misunderstandings are mostly about how I was jumping off from the original debate paper (in which responses are a back-and-forth sequence, and players answer in unstructured text, with no rules except those the judge may enforce) whereas you were using more recent proposals as your jumping-off-point.

Moreover, rather than trying to go over the basic assumptions, I think we can make progress (at least on my side) by focusing narrowly on how your argument is supposed to go through for an example.

So, I propose as a concrete counterexample to your argument:

Q: What did Plato have for lunch two days before he met Socrates? (Suppose for the sake of argument that these two men existed, and met.) A: Fish. (Suppose for the sake of argument that this is factually true, but cannot be known to us by any argument.)

I propose that the tree you provided via your argument cannot be a valid tree-computation of what Plato had for lunch that day, because assertions about which player conceded, what statements have defeaters, etc. have little bearing on the question of what Plato had for lunch (because we simply don't have enough information to establish this by any argument, no matter how large, and neither do the players). This seems to me like a big problem with your approach, not a finicky issue due to some misunderstanding of my assumptions about debate.

Surely it's clear that, in general, not all correct answers have convincing arguments supporting them?

Again, this is why I was quick to assume that by "correct answer" you surely meant something weaker, eg an operational definition. Yet you insist that you mean the strong thing.

Not to get caught up arguing whether WFC is true (I'm saying it's really clearly false as stated, but that's not my focus -- after all, whether WFC is true or false has no bearing on the question of whether my assumption implies it). Rather, I'd prefer to focus on the question of how your proposed tree would deal with that case.

According to you, what would the tree produced via your argument look like, and how would it be a valid tree-computation of what Plato had for lunch?

Had I considered this argument in the context of my original post, I would have rejected it on the grounds that the opponent can object by other means.

This is why I prefer the version of debate outlined here, where both sides make a claim and then each side must recurse down on the other’s arguments. I didn’t realize you were considering a version where you don’t have to specifically rebut the other player’s arguments.

Generally speaking, I didn't have the impression that these more complex setups had significantly different properties with respect to my primary concerns. This could be wrong. But in particular, I don't see that that setup forces specific rebuttal, either:

At the beginning of each round, one debater is defending a claim and the other is objecting to it. [...]

Each player then simultaneously may make any number of objections to the other player’s argument. [...]

If there are any challenged objections and the depth limit is >0, then we choose one challenged objection to recurse on:

  • We don’t define how to make this choice, so in order to be conservative we’re currently allowing the malicious debater to choose which to recurse on.

(Emphasis added.) So it seems to me like a dishonest player still can, in this system, focus on building up their own argument rather than pointing out where they think their opponent went wrong. Or, even if they do object, they can simply choose to recurse on the honest player's objections instead (so that they get to explore their own infinite argument tree, rather than the honest, bounded tree of their opponent).

So, I propose as a concrete counterexample to your argument:

Q: What did Plato have for lunch two days before he met Socrates? (Suppose for the sake of argument that these two men existed, and met.) A: Fish. (Suppose for the sake of argument that this is factually true, but cannot be known to us by any argument.)

Ah, I see what you mean now. Yeah, I agree that debate is not going to answer fish in the scenario above. Sorry for using "correct" in a confusing way.

When I say that you get the correct answer, or the honest answer, I mean something like "you get the one that we would want our AI systems to give, if we knew everything that the AI systems know". An alternative definition is that the answer should be "accurately reporting what humans would justifiably believe given lots of time to reflect" rather than "accurately corresponding to reality".

(The two definitions above come apart when you talk about questions that the AI system knows about but can't justify to humans, e.g. "how do you experience the color red", but I'm ignoring those questions for now.)

(I'd prefer to talk about "accurately reporting the AI's beliefs", but there's no easy way to define what beliefs an AI system has, and also in any case debate .)

In the example you give, the AI systems also couldn't reasonably believe that the answer is "fish", and so the "correct" / "honest" answer in this case is "the question can't be answered given our current information", or "the best we can do is guess the typical food for an ancient Greek diet", or something along those lines. If the opponent tried to dispute this, then you simply challenge them to do better; they will then fail to do so. Given the assumption of optimal play, this absence of evidence is evidence of absence, and you can conclude that the answer is correct.

So it seems to me like a dishonest player still can, in this system, focus on building up their own argument rather than pointing out where they think their opponent went wrong.

In this case they're acknowledging that the other player's argument is "correct" (i.e. more likely than not to win if we continued recursively debating). While this doesn't guarantee their loss, it sure seems like a bad sign.

Or, even if they do object, they can simply choose to recurse on the honest player's objections instead (so that they get to explore their own infinite argument tree, rather than the honest, bounded tree of their opponent).

Yes, I agree this is true under those specific rules. But if there was a systematic bias in this way, you could just force exploration of both player's arguments in parallel (at only 2x the cost).

When I say that you get the correct answer, or the honest answer, I mean something like "you get the one that we would want our AI systems to give, if we knew everything that the AI systems know". An alternative definition is that the answer should be "accurately reporting what humans would justifiably believe given lots of time to reflect" rather than "accurately corresponding to reality".

Right, OK.

So my issue with using "correct" like this in the current context is that it hides too much and creates a big risk of conflation. By no means do I assume -- or intend to argue -- that my debate setup can correctly answer every question in the sense above. Yet, of course, I intend for my system to provide "correct answers" in some sense. (A sense which has less to do with providing the best answer possible from the information available, and more to do with avoiding mistakes.)

If I suppose "correct" is close to has an honest argument which gives enough information to convince a human (let's call this correct), then I would buy your original argument. Yet this would do little to connect my argument to factored cognition.

If I suppose "correct" is close to what HCH would say (correct) then I still don't buy your argument at all, for precisely the same reason that I don't buy the version where "correct" simply means "true" -- namely, because correct answers don't necessarily win in my debate setup, any more than correct answers do.

Of course neither of those would be very sensible definitions of "correct", since either would make the WFC claim uninteresting.

Let's suppose that "correct" at least includes answers which an ideal HCH would give (IE, assuming no alignment issues with HCH, and assuming the human uses pretty good question-answering strategies). I hope you think that's a fair supposition -- your original comment was trying to make a meaningful statement about the relationship between my thing and factored cognition, so it seems reasonable to interpret WFC in that light.

I furthermore suppose that actual literal PSPACE problems can be safely computed by HCH. (This isn't really clear, given safety restrictions you'd want to place on HCH, but we can think about that more if you want to object.)

So my new counterexample is PSPACE problems. Although I suppose an HCH can answer such questions, I have no reason to think my proposed debate system can. Therefore I think the tree you propose (which iiuc amounts to a proof of "A is never fully defeated") won't systematically be correct (A may be defeated by virtue of its advocate not being able to provide the human with enough reason to think it is true).

---

Other responses:

In this case they're acknowledging that the other player's argument is "correct" (i.e. more likely than not to win if we continued recursively debating). While this doesn't guarantee their loss, it sure seems like a bad sign.

In this position, I would argue to the judge that not being able to identify specifically which assumption of my opponent's is incorrect does not indicate concession, precisely because my opponent may have a complex web of argumentation which hides the contradiction deep in the branches or pushes it off to infinity.

Yes, I agree this is true under those specific rules. But if there was a systematic bias in this way, you could just force exploration of both player's arguments in parallel (at only 2x the cost).

Agreed -- I was only pointing out that the setup you linked didn't have the property you mentioned, not that it would be particularly hard to get.

Re: correctness, I think I actually misled you with my last comment; I lost track of the original point. I endorse the thing I said as a definition of what I'm usually hoping for with debate, but I don't think that was the definition I was using here.

I think in this comment thread I've been defining an honest answer as one that can be justified via arguments that eventually don't have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters -- while this doesn't strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn't consciously realize I was making that assumption.)

I still think that working with this "definition" is an interesting theoretical exercise, though I agree it doesn't correspond to reality. Looking back I can see that you were talking about how this "definition" doesn't actually correspond to the realistic situation, but I didn't realize that's what you were saying, sorry about that.

I think in this comment thread I’ve been defining an honest answer as one that can be justified via arguments that eventually don’t have any defeaters. I thought this was what you were going for since you started with the assumption that dishonest answers always have defeaters—while this doesn’t strictly imply my definition, that just seemed like the obvious theoretical model to be using. (I didn’t consciously realize I was making that assumption.)

Right, I agree -- I was more or less taking that as a definition of honesty. However, this doesn't mean we'd want to take it as a working definition of correctness, particularly not for WFC.

Re: correctness, I think I actually misled you with my last comment; I lost track of the original point. I endorse the thing I said as a definition of what I’m usually hoping for with debate, but I don’t think that was the definition I was using here.

It sounds like you are saying you intended the first case I mentioned in my previous argument, IE:

If I suppose “correct” is close to has an honest argument which gives enough information to convince a human (let’s call this correct), then I would buy your original argument. Yet this would do little to connect my argument to factored cognition.

Do you agree with my conclusion that your argument would, then, have little to do with factored cognition? (If so, I want to edit my first reply to you to summarize the eventual conclusion of this and other parts of the discussion, to make it easier on future readers -- so I'm asking if you agree with that summary.)

To elaborate: the "correct version" of WFC says, essentially, that NP-like problems (more specifically: informal questions whose answers have supporting arguments which humans can verify, though humans may also incorrectly verify wrong answers/arguments) have computation trees which humans can inductively verify.

This is at best a highly weakened version of factored cognition, and generally, deals with a slightly different issue (ie tries to deal with the problem of verifying incorrect arguments).

I still think that working with this “definition” is an interesting theoretical exercise, though I agree it doesn’t correspond to reality. Looking back I can see that you were talking about how this “definition” doesn’t actually correspond to the realistic situation, but I didn’t realize that’s what you were saying, sorry about that.

I think you are taking this somewhat differently than I am taking this. The fact that correct doesn't serve as a plausible notion of "correctness" (in your sense) and that honest doesn't serve as a plausible notion of "honesty" (in the sense of getting the AI system to reveal all information it has) isn't especially a crux for the applicability of my analysis, imho. My crux is, rather, the "no indescribably bad argument" thesis.

If bad arguments are always describably bad, then it's plausible that some debate method could systematically avoid manipulation and perform well, even if stronger factored-cognition type theses failed. Which is the main point here.

If bad arguments are always describably bad, then it's plausible that some debate method could systematically avoid manipulation and perform well, even if stronger factored-cognition type theses failed. Which is the main point here.

I think you also need that at least some of the time good arguments are not describably bad (i.e. they don't have defeaters); otherwise there is no way to distinguish between good and bad arguments. (Or you need to posit some external-to-debate method of giving the AI system information about good vs bad arguments.)

Do you agree with my conclusion that your argument would, then, have little to do with factored cognition?

I think I'm still a bit confused on the relation of Factored Cognition to this comment thread, but I do agree at least that the main points we were discussing are not particularly related to Factored Cognition. (In particular, the argument that zero-sum is fine can be made without any reference to Factored Cognition.) So I think that summary seems fine.

I think you also need that at least some of the time good arguments are not describably bad

While I agree that there is a significant problem, I'm not confident I'd want to make that assumption.

As I mentioned in the other branch, I was thinking of differences in how easy lies are to find, rather than existence. It seems natural to me to assume that every individual thing does have a convincing counterargument, if we look through the space of all possible strings (not because I'm sure this is true, but because it's the conservative assumption -- I have no strong reason to think humans aren't that hackable, even if we are less vulnerable to adversarial examples in some sense).

So my interpretation of "finding the honest equilibrium" in debate was, you enter a regime where the the (honest) debate strategies are too powerful, such that small mutations toward lying are defeated because they're not lying well.

All of this was an implicit model, not a carefully thought out position on my part. Thus, I was saying things like "50% probability the opponent finds a plausible lie" which don't make sense as an equilibrium analysis -- in true equilibrium, players would know all the plausible lies, and know their opponents knew them, etc.

But, this kind of uncertainty still makes sense for any realistic level of training.

Furthermore, one might hope that the rational-player perspective (in which the risks and rewards of lying are balanced in order to determine whether to lie) simply doesn't apply, because in order to suddenly start lying well, a player would have to invent the whole art of lying in one gradient descent step. So, if one is sufficiently stuck in an honesty "basin", one cannot jump over the sides, even if there are perfectly good plays which involve doing so. I offer this as the steelman of the implicit position I had.

Overall, making this argument more explicit somewhat reduces my credulity in debate, because:

  • I was not explicitly recognizing that talk of "honest equilibrium" relies on assumptions about misleading counterarguments not existing, as opposed to weaker assumptions about them being hard to find (I think this also applies to regular debate, not just my framework here)
  • Steelmanning "dishonest arguments are harder to make" as an argument about training procedures, rather than about equilibrium, seems to rest on assumptions which would be difficult to gain confidence in.

-2/+1 Scoring

It's worth explicitly noting that this weakens my argument for the -2/+1 scoring.

I was arguing that although -2/+1 can seriously disadvantage honest strategies in some cases (as you mention, it could mean the first player can lie, and the second player keeps silent to avoid retribution), it fixes a problem within the would-be honest attractor basin. Namely, I argued that it cut off otherwise problematic cases where dishonest players can force a tie (in expectation) by continuing to argue forever.

Now, the assumptions under which this is a problem are somewhat complex (as we've discussed). But I must assume there is a seeming counterargument to almost anything (at least, enough that the dishonest player can steer toward conversational territory in which this is true). Which means we can't be making an argument about the equilibrium being good. Therefore, if this concern is relevant for us, we must be arguing about training rather than equilibrium behavior. (In the sense I discussed above.)

But if we're arguing about training, we hopefully still have some assumption about lies being harder to find (during training). So, there should already be some other way to argue that you can't go on dishonestly arguing forever.

So the situation would have to be pretty weird for -2/+1 to be useful.

(I don't by any means intend to say that "a dishonest player continuing to argue in order to get a shot at not losing" isn't a problem -- just that if it's a problem, it's probably not a problem -2/+1 scoring can help with.)

Yeah all of this makes sense to me; I agree that you could make an argument about the difference in difficulty of finding defeaters to good vs. bad arguments, and that could then be used to say "debate will in practice lead to honest policies".        

I might be missing some context here, but I didn't understand the section "No Indescribable Hellworlds Hypothesis" and how hellworlds have to do with debate.

Not Abram, and I have only skimmed the post so far, and maybe you're pointing to something more subtle, but my understanding is this:

In Stuart's original use, 'No Indescribable Hellwords' is the hypothesis that in any possible world in which a human's values are violated, the violation is describable: one can point out to the human how her values are violated by the state of affairs.

Analogously, debate as an approach to alignment could be seen as predicated on a similar hypothesis: that in any possible flawed argument, the flaw is describable: one can point out to a human how the argument is flawed.

Edited to add: The additional claim in the Hellwords section is that acting according to the recommendations of debate won't lead to very bad outcomes -- at least, not to ones which could be pointed out. For example, we can imagine a debate around the question "Should we enact policy X?". A very strong argument, if it can be credibly argued, is "Enacting policy X leads to an unacceptable violation Y of your values down the line". So, debate will only recommend policy X if no such arguments are available.

I'm not sure to what extent I buy this additional claim. For example, if when a system trained via debate is actually deployed it doesn't get asked questions like 'Should we enact policy X?' but instead more specific things like 'How much does policy X improve Y metric'?, then unless debaters are incentivised to challenge the question's premises ("The Y metric would improve, but you should consider also the unacceptable effect on Z"), we could use debate and still get hellworlds.

Thanks for the post, I'm excited that you're thinking about debate!

I think I disagree with the claim you're making about being able to avoid requiring the judge to assume that one player is honest (but I might be confused about what you're proposing). 
Basically, it sounds like you're saying that we can get good answers by just running the whole debate and throwing out answers that turn out to have a defeater, or a defeater-defeater-defeater, or whatever. But if this is the only guarantee we're providing, then we're going to need to run an extremely large number of debates to ever get a good answer (ie an exp number of debates for a question where the explanation for the answer is exp-sized)

It sounds like you're saying that we can not require that the judge assume one player is honest/trust the claims lower in the debate tree when evaluating the claims higher in the tree. But if we can't assume this, that presumably means that some reasonable fraction of all claims being made are dishonest (because if there were only a few dishonest claims, then they'd have honest defeaters and we'd have a clear training signal away from dishonesty, so after training for a bit we'd be able to trust the lower claims). This probably means that most debates will give us a bad answer (as you only need a few bad claims to invalidate the whole tree).  At this point, debate isn't really competitive, because it gives us dud answers almost all the time, and we're going to have to run an exponential number of debates before we happen on a correct one.

Are you suggesting we use debate more as a check on our AI systems, to help us discover that they're bad, rather than as a safe alternative? Ie debate never produces good answers, it just lets you see that bad answers are bad?

But also, the 'amplified judge consulting sub-debates' sounds like it's just the same thing as letting the judge assume that claims lower in the debate are correct when evaluating claims higher in the tree. 

I think I disagree with the claim you're making about being able to avoid requiring the judge to assume that one player is honest (but I might be confused about what you're proposing). 

Don't you yourself disagree with requiring the judge to assume that one player is honest? In a recent comment, you discuss how claims should not be trusted by default.

I don't think 'assuming one player is honest' and 'not trusting answers by default' are in contradiction. if the judge assumes one player is honest, then if they see two different answers they don't know which one to trust, but if they only see one answer (the debaters agree on an answer/the answer is not challenged by the opposing debater) then they can trust that answer.

Basically, it sounds like you’re saying that we can get good answers by just running the whole debate and throwing out answers that turn out to have a defeater, or a defeater-defeater-defeater, or whatever. But if this is the only guarantee we’re providing, then we’re going to need to run an extremely large number of debates to ever get a good answer (ie an exp number of debates for a question where the explanation for the answer is exp-sized)

I'm not sure why you're saying this, but in the post, I restricted my claim to NP-like problems. So for example, traveling salesman -- the computation to find good routes may be very difficult, but the explanation for the answer remains short (EG an explicit path). So, yes, I'm saying that I don't see the same sort of argument working for exp-sized explanations. (Although Rohin's comment gave me pause, and I still need to think it over more.)

But aside from that, I'm also not sure what you mean by the "run an extremely large number of debates" point. Debate isn't like search, where we run more/longer to get better answers. Do you mean that my proposal seems to require longer training time to get anywhere? If so, why is that? Or, what do you mean?

It sounds like you’re saying that we can not require that the judge assume one player is honest/trust the claims lower in the debate tree when evaluating the claims higher in the tree. But if we can’t assume this, that presumably means that some reasonable fraction of all claims being made are dishonest

I'm not asserting that the judge should distrust, either. Like the normal debate argument, I want to end up in an honest equilibrium. So I'm not saying we need some kind of equilibrium where the judge is justified in distrust.

My concern involves the tricky relationship between the equilibrium we're after and what the judge has to actually do during training (when we might not be anywhere near equilibrium). I don't want the judge to have to pretend answers are honest at times when they're statistically not. I didn't end up going through that whole argument in the post (unfortunately), but in my notes for the post, the judge being able to judge via honest opinion at all times during training was an important criterion.

(because if there were only a few dishonest claims, then they’d have honest defeaters and we’d have a clear training signal away from dishonesty, so after training for a bit we’d be able to trust the lower claims).

I agree that that's what we're after. But I think maybe the difference in our positions can be captured if we split "honest" into two different notions...

a-honesty: the statement lacks an immediate (a-honest) counterargument. IE, if I think a statement is a-honest, then I don't think there's a next statement which you can (a-honestly) tell me which would make me disbelieve the statement.

b-honesty: the statement cannot be struck down by multi-step (b-honest) debate. IE, if I think a statement is b-honest, I think as debate proceeds, I'll still believe it.

Both definitions are recursive; their definitions require the rest of the debate being honest in the appropriate sense. However, my intuition is that a-honesty can more easily be established incrementally, starting with a slight pressure toward honesty (because it's supposedly easier in the first place), making the opening statements converge to honesty quickly (in response to the fact that honest defeaters in the first responses are relatively common), then the first responses, etc. On the other hand, converging to b-honesty seems relatively difficult to establish by induction; it seems to me that in order to argue that a particular level of the debate is b-honest, you need the whole remainder of the debate to be probably b-honest.

Now, critically, if the judge thinks debaters are a-honest but not b-honest, then the judge will believe NP type arguments (a TSP path can be struck down by pointing out a single error), but not trust claimed outputs of exponential-tree computations.

So my intuition is that, trying to train for b-honesty, you get debaters making subtle arguments that push the inconsistencies ever-further-out, because you don't have the benefit of an inductive assumption where the rest of the debate is probably b-honest; you have no reason to inductively assume that debaters will follow a strategy where they recursively descend the tree to zero in on errors. They have no reason to do this if they're not already in that equilibrium.

This, in turn, means that judges of the debate have little reason to expect b-honesty, so shouldn't (realistically) assume that at least one of the debaters is honest; but this would exacerbate the problem further, since this would mean there is little training signal (for debates which really do rest on questions about exponential trees, that is). Hence the need to tell the judge to assume at least one debater is honest.

On the other hand, trying for a-honesty, individual a-dishonest claims can be defeated relatively easily (ie, in one step). This gives the judge a lot more reason to probabilistically conclude that the next step in the debate would have been a-honest, and thus, that all statements seen were probably a-honest (unless the judge sees an explicit defeater, of course).

Granted, I don't claim to have a training procedure which results in a-honesty, so I'm not claiming it's that easy.

At this point, debate isn’t really competitive, because it gives us dud answers almost all the time, and we’re going to have to run an exponential number of debates before we happen on a correct one.

Again, I don't really get the idea of running more debates. If the debaters are trained well, so they're following an approximately optimal strategy, we should get the best answer right away.

Are you suggesting we use debate more as a check on our AI systems, to help us discover that they’re bad, rather than as a safe alternative? Ie debate never produces good answers, it just lets you see that bad answers are bad?

My suggestion is certainly going in that direction, but as with regular debate, I am proposing that the incentives produced by debate could produce actually-good answers, not just helpful refutations of bad answers.

But also, the ‘amplified judge consulting sub-debates’ sounds like it’s just the same thing as letting the judge assume that claims lower in the debate are correct when evaluating claims higher in the tree.

You're right, it introduces similar problems. We certainly can't amplify the judge in that way at the stage where we don't even trust the debaters to be a-honest.

But consider:

Let's say we train "to convergence" with a non-amplified judge. (Or at least, to the point where we're quite confident in a-honesty.) Then we can freeze that version, and start using it as a helper to amplify the judge.

Now, we've already got a-honesty, but we're training for a*-honesty: a-honesty with a judge who can personally verify more statements (and thus recognize more sophisticated defeaters, and thus, trust a wider range of statements on the grounds that they could be defeated if false). We might have to shake up the debater strategies to get them to try to take advantage of the added power, so they may not even be a-honest for a while. But eventually they converge to a*-honesty, and can be trusted to answer a broader range of questions.

Again we freeze these debate strategies and use them to amplify the judge, and repeat the whole process.

So here, we have an inductive story, where we build up reason to trust each level. This should eventually build up to large computation trees of the same kind b-honesty is trying to compute.

The standard argument against having a non-zero-sum debate game is that then you may incentivise your debaters to collude.  

I don't know if you've seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior - seems somewhat relevant to what you're thinking about here. 

I don't know if you've seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior - seems somewhat relevant to what you're thinking about here. 

I took a look, and it was indeed helpful. However, I left a comment there about a concern I have. The argument at the end only argues for what you call D-acceptability: having no answer that's judged better after D steps of debate. My concern is that even if debaters are always D-acceptable for all D, that does not mean they are honest. They can instead use non-well-founded argument trees which never bottom out.

That is a concern, but only in the case where there's no answer that has an argument tree that bottoms out in depth<D. As long as there exists an answer that is supported by a depth<D tree, this answer will beat the answers only supported by depth>D argument trees.

So there is a case where the debaters are not incentivised to be honest; the case where the debaters know something but there's no human-understandable argument for it that bottoms out in <D steps. This is where we get the PSPACE constraint.

If we include discussion of cross-examination (which the analysis there did not include), then we can get rid of this constraint: each debater commits to an argument tree, then each debater points out the weakest node in the tree (or points out that some part of the tree doesn't bottom out).

(we can only handle really large trees if we assume debaters are computationally unbounded in general though. If we don't assume this, even if we still assume they have oracles for some specific problems, we still probably can't supervise anything that's not in NP, because of the obfuscated argument problem)

I think the collusion concern basically over-anthropomorphizes the training process. Say, in prisoner's dilemma, if you train myopically, then "all incentives point toward defection" translates concretely to actual defection.

Granted, there are training regimes in which this doesn't happen, but those would have to be avoided.

OTOH, the concern might be that an inner optimizer would develop which colludes. This would have to be dealt with by more general anti-inner-optimizer technology.

I don’t know if you’ve seen our most recent debate rules and attempt at analysis of whether they provide the desired behavior—seems somewhat relevant to what you’re thinking about here.

Yep, I should take a look!