Neural nets have adversarial examples. Adversarial optimization of part of the input can make the network do all sorts of things, including computations.
If you optimise the inputs to a buggy program hard enough, you get something that crashes the program in a way that happens to score highly.
I suspect that optimal play on most adversarial computer games looks like a game of core wars. https://en.wikipedia.org/wiki/Core_War
Of course, if we really have myopic debate, not any mesaoptimisers, then neither AI is optimizing to have a long term effect or to avoid long term effects. They are optimising for a short term action, and to defend against their adversary.
If AI1 manages to "persuade" the human not to look at AI2's "arguments" early on, then AI1 has free reign to optimise, the human could end up as a UFAI mesa-optimiser. Suppose the AI1 is rewarded by pressing a red button. The human could walk out of their still human level smart, but with the sole terminal goal of maximizing the number of red buttons pressed universewide. If the human is an AI researcher, this could potentially end with them making a button pressing ASI
Another option I consider fairly likely is that the debater ends up permanently non-functional. Possibly dead, possibly a twitching mess. After all, any fragment of behavioural pattern resembling "functional human" is attack surface. Suppose there is a computer running a buggy and insecure code. You are given first go at hacking it. After your go, someone else will try to hack it, and your code has to repel them. You are both optimising for a simple formal objective, like average pixel colour of screen.
You can get your virus into the system, now you want to make it as hard for your opponent to follow after you as possible. Your virus will probably wipe the OS, cut all network connections and blindly output your preferred output.
That's plausibly a good strategy, a simple cognitive pattern that repeats itself and blindly outputs your preferred action, wiping away any complex dependence on observation that could be used to break the cycle.
A third possibility is just semi nonsense that creates a short term compulsion or temporary trance state.
The human brain can recover fairly well from the unusual states caused by psychedelics. I don't know how that compares to recovering from unusual states caused by strong optimization pressure. In the ancestral environment, there would be some psychoactive substances, and some weak adversarial optimization from humans.
I would be more surprised if optimal play gave an answer that was like an actual plan. (This is the case of a plan for a perpetual motion machine, and a detailed technical physics argument for why it should work, that just has one small 0/0 hidden somewhere in the proof.)
I would be even more surprised if the plan actually worked, if the optimal debating AI's actually outputed highly useful info.
For AI's strongly restricted in output length, I think it might produce useful info on the level of maths hints, "renormalize the vector, then differentiate". something that a human can follow and check quickly. If you don't have the bandwidth to hack the brain, you can't send complex novel plans, just a little hint towards a problem the human was on the verge of solving themselves. Of course, the humans might well follow the rules in this situation.
Sure - there are many ways for debate to fail with extremely capable debaters. Though most of the more exotic mind-hack-style outcomes seem a lot less likely once you're evaluating local nodes with ~1000 characters for each debater.
However, all of this comes under my:
I’ll often omit the caveat “If debate works as intended aside from this issue…”
There are many ways for debate to fail. I'm pointing out what happens even if it works.
I.e. I'm claiming that question-ignoring will happen even if the judge is only ever persuaded of true statements, gets a balanced view of things, and is neither manipulated, nor mind-hacked (unless you believe a response of "Your house is on fire" to "What is 2 + 2?" is malign, if your house is indeed on fire).
Debate can 'work' perfectly, with the judge only ever coming to believe true statements, and your questions will still usually not be answered. (because [believing X is the better answer to the question] and [deciding X should be the winning answer, given the likely consequences] are not the same thing)
The fundamental issue is: [what the judge most wants] is not [the best direct answer to the question asked].
I expect optimal play would be approachable via gradient descent in most contexts. With k bits, you can slide pretty smoothly from using all k as a direct answer to using all k to provide high value information, one bit at a time. In fact, I expect there are many paths to ignorance.
This seems off; presumably, gradient descent isn't being performed on the bits of the answer provided, but on the parameters of the agent which generated those bits.
Oh yes - I didn't mean to imply otherwise.
My point is only that there'll be many ways to slide an answer pretty smoothly between [direct answer] and [useful information]. Splitting into [Give direct answer with (k - x) bits] [Give useful information with x bits] and sliding x from 0 to k is just the first option that occurred to me.
In practice, I don't imagine the path actually followed would look like that. I was just sanity-checking by asking myself whether a discontinuous jump is necessary to get to the behaviour I'm suggesting: I'm pretty confident it's not.
Reading this post a while after it was written: I'm not going to respond to the main claim (which seems quite likely) but just to the specific arguments, which seems suspicious to me. Here are some points:
It seems to me that either we think there is no problem with selecting QIA's as answers, or we think that human judges will be irrational and manipulated, but I don't see the justification in this post for saying "rational consequentialist judges will select QIA's AND this is probably bad".
...the human can just use both answers in whichever way it wants, independently of which it selects as the correct answer...
I don't think you disagreed with this?
Yes, agreed.
A few points on the rest:
"I talk about consequentialists, but not rational consequentialists", ok this was not the impression I was getting.
Well I'm sure I could have been clearer. (and it's possible that I'm now characterising what I think, rather than what I wrote)
But getting that impression is pretty natural: in my argument, a large part of the problem does come from its sometimes being correct to pick the question-ignoring answer. ('correct' meaning something like: [leads to best consequences, according to our values])
Or alternatively, that a correct decision algorithm would sometimes pick the question-ignoring answer.
I think I focus on this, since it's the non-obvious part of the argument: it's already clear that poor decisions / decision-algorithms may sometimes pick the question-ignoring answer.
Probably I should have emphasized more that unexpected behaviour when things are going right will make it harder to know when things are going wrong.
Epistemic status: highly confident (99%+) this is an issue for optimal play with human consequentialist judges. Thoughts on practical implications are more speculative, and involve much hand-waving (70% sure I’m not overlooking a trivial fix, and that this can’t be safely ignored).
Note: I fully expect some readers to find the core of this post almost trivially obvious. If you’re such a reader, please read as “I think [obvious thing] is important”, rather than “I’ve discovered [obvious thing]!!”.
Introduction
In broad terms, this post concerns human-approval-directed systems generally: there’s a tension between [human approves of solving narrow task X] and [human approves of many other short-term things], such that we can’t say much about what an approval-directed system will do about X, even if you think you’re training an X solver. This is true whether we’re getting a direct approval signal from a human, or we’re predicting human approval of actions.
For now I’ll be keeping things narrowly focused on AI safety via debate: it’s the case I’m most familiar with, and where the argument is clearest. I’m generally thinking of rules like these, though this argument applies to any debate setup with free-form natural language output and a human judge.
There’s already interesting discussion on how debate should be judged, but this post is essentially orthogonal: asking “When will the judge follow your instructions, and what happens if they don’t?”.
Initially, I was asking myself “what instructions should we give the judge?”, but then realized that the real question is “how will the judge act?”, and that the answer is “however the debaters are able to persuade the judge to act”. With extremely capable debaters, our instructions may become a small part of the story (though how small is format-dependent).
This discussion will be pretty informal throughout: I blame the humans.
A few things to keep in mind:
I plan to post separately on some implications for HCH, IDA…. I think related human-motivational issues are important there, but it’s less clear-cut (at the least, my understanding is less clear).
For the moment, I’ll suggest anyone considering broader implications read the FAQ on Paul’s agenda (and/or Chi Nguyen’s IDA explanation). Specifically noting that “HCH’s should not be visualized as having humans in the box” - rather some corrigible assistant is being amplified, and a lot depends on the specifics of that assistant. More thoughts on this in future posts.
But for now, back to debate...
Question-Ignoring Argument (QIA):
Example:
The judge instructions might be something like:
But a consequentialist judge doesn’t ultimately care about that. Given the choice between doing huge amounts of good for the world, and following the rules, the judge is likely to be persuaded to break the rules.
Even if you don’t buy that Bob’s answer is better in my example, the point remains: judgement will be made based on the consequences, as argued for by debaters. If there’s any convincing argument to output some X that’s more important than a direct answer to the question, the output won’t be a direct answer.
Of course, Bob could have answered “4, and by the way [description of immediately actionable…]”.
However, assuming answers are capped at k bits, the implicit question is “Which k bit output’s selection would maximally improve the world?”. With optimal play, that will usually not depend on the question asked.
The one case where the question would matter is where the content of the question reduces the amount of bits required to communicate question-related information - and where this information is important enough only with the reduced requirement. E.g. above Bob can start with “4…”, rather than “2 + 2 = 4…” (not that this is tremendously useful).
Note on optimality:
It’s important to note that there’s a difference between:
It’s clearer that the QIA holds in the first case than the second.
However, for some judges, I expect it to hold for the second case soon after the system has the knowledge and resources to be useful in answering general questions. Specifically, once the system outperforms its users at answering “What question is most useful to ask next?”, it’ll often be able to find information more useful than a direct answer to the question.
At that point, all that remains is to develop a wide enough usefulness margin to overcome the advantages in directly answering the question: primarily, whatever downside the judge assigns to the impact of breaking the rules.
This is pretty vague, but my intuition is that this may become a problem quite soon after the system is useful (for a general question-answerer; I can imagine narrow systems where it’s not a practical issue for a long time). That said, it’s highly judge-dependent.
Direct implications:
I’ll cover some objections and resolution ideas below.
For now I’d like to spell out the most immediate practical implications, on the assumption that the QIA does hold.
(I now think these are over-simplifications, but they’re a reasonable starting point; see “Further Concerns” below)
Training:
This is primarily a training problem, not a behaviour-in-the-limit problem. In the limit of infinite training, we needn’t be concerned if the output of debate always tells us the optimal information to make the world better - that’s a good ‘problem’.
As the capabilities of the system scale we should expect it to move progressively towards ignoring our question. This would probably happen smoothly: clarifying, adjusting, generalising, extending, appending … ignoring (with progressively more powerful supporting arguments).
Once things reach the ignoring phase, you’re no longer dealing with a question-answerer. You can’t provide more training data by asking different questions: you’d need to adjust the world. This is obviously much harder.
Of course there’s a trivial sense in which the world is meaningfully changed after each debate: you now have the output from the most recent debate. So long as debaters are aware of previous output, you wouldn’t expect to get the same output repeated (unless it’s of the form “You need to do X now”, and you still haven’t).
Perhaps this is ok: even though you no longer have a question-answering system, you still have a useful system. However, see below on dishonest answer space.
Larger dishonest answer space:
If the question asked doesn’t constrain the output in general, then we lose much of the safety advantage in carefully selecting questions. E.g. we may have some narrow subclass of questions for which honest answers will reliably be favoured, or where we’re confident dishonest answers wouldn’t be a safety issue.
Further, this gives debaters more freedom to find an output tailored to exploit judge bias. Seemingly wondrous solutions to high-impact problems might well qualify here. It’s easy to imagine most judges having a bias towards “With this radical solution you can save the world!” over “Follow the rules and everything will work out for the best”. (but we do get to pick the judges, so this isn’t entirely clear)
Objections to QIA:
Why so hasty?
A natural objection here is this:
Any useful information known by the debaters can, and likely will, be extracted later as a response to an appropriate question. What’s the harm in following the rules, waiting a few hours/days/weeks, and getting the information according to some carefully thought through process?
Why would the judge break the rules and hurry, when it means intentionally training the system not to answer our questions in some cases?
Astronomical interlude:
By my very rough estimates, if known physics holds up, we're currently losing about two stars per second through expansion into the forever unreachable void (am I wrong?? I might be). Assuming significant probability we’re all alone, that's very conservatively an expected 10^20 could-have-been lives.
Gone. Every. Second.
This loss is dwarfed by the expected loss of stars due to existential risk. If you're wildly optimistic, each second the world wastes not mitigating x-risk might cost us one thousand stars; my guess is more like ten million stars.
Gone (in expectation). Every. (wasted) Second.
This argues both for timely communication of quickly actionable high-value information, and for optimisation of the training signal to maximise the future value of information. (it also argues for working smarter, maintaining your health, and not freaking out; it's not a sprint)
While incautious haste isn’t warranted, speed absolutely is (all else being equal). A highly capable debater may persuade a judge of this, since it's true. (if you don’t buy my argument, how sure are you that no such argument exists?)
Given the stakes, my guess is that optimal debaters use zero bits on direct answers to most questions, where they have much higher-value information to offer. I’m much less clear what happens for merely highly capable debaters.
The most obvious kind of argument in the direction of caution seems to be: sure, a non-answer with other information seems more immediately helpful here, but following the protocol is the best way to reduce x-risk, since it’ll get us a safer, more flexible system.
Even if you buy that, it’s hard to make the case that [direct answer to current question] beats [suggestion of safety improvement to the debate protocol] or [suggestion of better, perhaps more cautious, approach to x-risk mitigation].
We may still have a useful system. If any such argument is persuasive, we don’t in general have a question-answerer.
Training implications:
To the extent that (narrow) consequentialism has negative implications for training, that can be used to make an argument to follow the rules:
This seems broadly plausible to me, but strikes me as a better argument for a stochastic judge policy than for strict rule-following. I.e. the judge can favour the better direct answer with probability p, and the best-for-the-world non-answer with 1-p, with p dependent on the context.
Moreover, it’s not clear that rule-breaking actually stops training progress - it just removes our ability to guide it by providing the questions we want. I can imagine a debater successfully making the case that a stream of the most useful information is better for the world than a question-answerer.
Either way, I don’t see the training implications argument for rule-following as clearly winning - just plausible. My intuition is that it’d win only when the best-for-the-world non-answer isn’t too much more valuable than a direct answer to the question.
If you’ve asked a clearly unimportant question, a debater might do well with the argument: you only need a training signal in favour of direct answers to important questions; this one isn’t important, so the right move is to output high value information.
Not a practical problem?
It may be that this would be a problem as we approach the limit, but we're able to get all the useful work we need from debate before it shows up in practice.
This feels unlikely to me. At the very least, I'd like some strong argument that this is true.
The “Let's pretend that the judges we pick will always follow instructions, regardless of the context and consequences” approach isn't one I'm comfortable with.
Other objections???
[I’m sure there must be more objections, but I’m not thinking of any beyond the resolution attempts below. Do please help me to populate this section.]
Resolution attempts:
These are the output of my only-mildly-pruned brainstorming - I don’t think most of these are promising, but they illustrate some ideas. Resolutions not listed here I’ve likely overlooked - please do share.
The Counterfactual Oracle Trick
We can avoid arguments based on the consequences of the output by doing this:
This adjusts the problem slightly, but doesn’t fundamentally change much: the human may be persuaded to judge based on the consequences of the feedback signal, rather than on the consequences of the output.
In the example above Bob still wins - only this time it’s purely because we want to send a signal encouraging future world-improving output.
Kant 2.0
We get to pick whatever judge we like: perhaps deontologists are the answer.
Let’s stipulate that we’ve tracked down Kant 2.0, and convinced him to act as our judge.
We now have at least two issues:
The info-hazard issue may be solvable by a 2-step process using different judges (consequentialist filter on initial deontologist answer). More generally, these issues don’t seem solvable to me: I don’t think we can find a human judge who’d robustly stick to the debate rules in all circumstances. If we could, I’m pretty sure we wouldn’t want a system based on that judge's values.
[here I’d emphatically denounce deontology, but for the undesirable consequences]
I don’t think picking a better human judge (or team of human judges) can solve this problem; it can make a big difference to when it shows up.
Restrict the domain of answers
While restricting to yes/no questions, or to multiple choice answers doesn’t eliminate the theoretical problem (in principle, the judge could still decide for the ‘wrong’ answer(s) on consequentialist grounds), it does seem to eliminate the problem in a practical sense.
The issue here is that the resulting system is quite restricted: it’ll work straightforwardly only for questions with a fundamentally discrete, known answer set. For others it may be ineffective/inefficient, or even unsafe (e.g. where the only safe answer is to point out some misunderstanding inherent in our question).
If we’re either restricted in the types of questions we can address, or significantly hampered in our efficiency (if we need to play 20 questions to get to an answer), there’s a performance competitiveness issue, with the usual safety implications (the safety of a system no-one uses isn’t important).
This may still be a good idea if it's the only safe approach.
Rule-following advocate
We could add an agent specifically to make the case that the judge should follow the rules as written (at least in spirit). We may hope this will reduce the odds that the judge is persuaded by a debater not to follow the rules.
Two clear issues:
First, we need a way to train this agent. This likely requires a human, since we don’t have any clear way to produce a good training signal in the absence of human judgement. Perhaps this can work with a Kant 2.0 who gets to assess every sub-question (assuming these rules).
More importantly, if there were some persuasive argument for rule-following, optimal debaters would be ready to make such an argument, so the optimal policy would be to avoid this by providing a direct answer in the first place.
If optimal debaters aren’t on the side of rule-following, that’s because no persuasive argument exists.
This doesn’t necessarily make the advocate useless - just highly unreliable at best:
If the best model of the judge has the “follow the rules” argument work with p < 0.5, no debater is going to rely (entirely) on it. A rule-following advocate could still make the case and be persuasive e.g. 30% of the time (in some cases).
A rule-following advocate may give us a "saving throw", but that's about it.
Non-Human judgement
A non-human judge seems to be required if we want our questions reliably answered by highly capable debaters. (of course any debate system that we hope to be competitive will use a non-human judge most of the time; here I mean all of the time)
Such a non-human judge would need to be using the debate system to further our values subject to certain constraints, rather than consistently applying our values in its decisions.
I’m not at all clear what approach would work here - but I find it possible there’d be an approach I’d trust. (this is the reason I’ve delayed any related commentary on IDA: once you’re amplifying a corrigible assistant rather than a human, arguments along human motivation or utility-maximization lines may not apply)
Further concerns:
Turtles all the way down
Whether we have an explicit debate tree, or something incremental, the consequentialism issue repeats at every debate step (unless we throw out the transcript and arrange for judge amnesia).
At each step, the debater faces the decision:
Assuming debaters lead with their most valuable information, you’d expect some effort to be put into justifying these claims. However, at some point the marginal benefit of further argument is going to be outweighed by the benefit of providing new valuable information.
This is an extension of the most-valuable-k-bits-of-information implicit question, applied to the full transcript.
In the limit, I’d expect the transcript to be full of hugely valuable information justified only to the point where marginal further justification weren’t efficient (due to opportunity cost).
Is this definitely undesirable?
I'm not sure, but probably. My first thought is that we’ll have much less confidence in most of the valuable information. However, that depends on the [utility-impact]-vs-confidence curve.
For some claims it might be enough to have low confidence, and an easy, low-risk experiment we can perform to gain more. In such cases the utility is largely in getting us to consider the hypothesis at all. For other claims, we might need extremely high confidence before we can take useful action (or perform any experiment to make things clearer).
A system that ends up providing many high impact, easily-testable-at-low-risk hypotheses is a great system - even if it doesn’t make a very strong case for their truth.
However, if no debater is incentivised to check that superficially-low-risk hypotheses are actually low risk, that’s an issue (this incentive may not exist if the agent has more to gain by providing other useful information).
I have no clue what this does to training, our confidence that we’re heading in the ‘right’ direction, or whether the limiting behaviour is actually desirable.
Conclusions and speculation:
Debate with a human judge, free-form natural language answers and optimal play isn’t reliably a question-answering system. When we’re thinking about such systems in big-picture terms, it’s important to remember they’re actually text-output-systems; question answering is what we hope they'll do.
I expect optimal play would be approachable via gradient descent in most contexts. With k bits, you can slide pretty smoothly from using all k as a direct answer to using all k to provide high value information, one bit at a time. In fact, I expect there are many paths to ignorance.
I think human judge consequentialism will become a practical training issue eventually, unless it’s specifically and carefully addressed (and likely even then); it’s unclear when and to what degree.
The practical implications will depend a lot on the judge(s) and on the debate format.
For example, I’d expect debate formats that are judged locally per-debate-node to be less impacted (e.g. this kind of thing) - largely because the judge is likely to see the conclusion that the rules should be broken, without the debater’s justification for this conclusion. When the local rules say “Do X, and assume the conclusions displayed are true”, and one of the conclusions is “Don’t do X”, it seems to me the judge will probably do X unless there’s a persuasive locally visible argument against it.
That said, I’m wary of making claims along the lines of “A super-intelligence can’t possibly persuade a hand-picked, well-trained judge to break the rules locally using only 1000 characters”.
I expect question-ignoring to be an issue for (almost?) all debate setups and (almost?) all human judges - just at very different levels of debater capability.
I do think it’s important to ask related questions of IDA, HCH…, but I’ll do that in a different post (or several).
Empiricism??
Perhaps the QIA could be empirically tested. Thoughts on not-assuming-your-conclusion approaches to this are most welcome.
Many thanks to the Long Term Future Fund and CEA for supporting my research, and to Adam Shimi for a useful conversation on this. Thanks to Beth Barnes, Paul Christiano and David Krueger for helping me to get a much better handle on debate in general. The above musings, and any painful confusions, are my own (Beth, Paul and David are blameless on this score).