bhauth - AI Alignment Forum

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

I don't expect such a huge gap between debaters and judges that the judge simply can't understand the debaters' concepts

You don't? But this is a major problem in arguments between people. The variation within humans is already more than enough for this! There's a gap like that every 35 IQ points or so. I don't understand why you're confident this isn't an issue.

I guess we've found our main disagreement, at least?

So in this particular case I am saying: if you penalize debaters that are inconsistent under cross-examination, you are giving an advantage to any debater that implements an honest strategy, and so you should expect training to incentivize honesty.

Now you're training for multiple objectives:

You want the debater AI to argue for proposition A or not-A according to its role and convince human judges of that.
You want it to not change its position on sub-arguments.

But (2) is ill-defined. Can sub-arguments be combined for less weighting? Are they all worth the same? What if you have several sub-arguments that all depend on a single sub-2-argument? Good arguments for A or not-A should have lots of disagreements - or do you want to train AI that makes all the same sub-arguments for A or not-A and then says "this implies A / not-A"? I don't think this works.

In response to the linked "HCH" post:

Yes, an agent past some threshold can theoretically make a more-intelligent agent. But that doesn't say anything about alignment; the supposed "question-answering machine" would be subject to instrumental convergence and mesaoptimizer issues, and you'd get value drift with each HCH stage, just as you would with RSI schemes.

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

bhauth7mo10

If you want to disallow appeals to authority

I do, but more importantly, I want to disallow the judge understanding all the concepts here. Suppose the judge says to #1: "What is energy?" or "What is conservation?" and it can't be explained to them - what then?

Also, argument 1 isn't actually correct, E=mc^2 and so on.

That seems right, but why is it a problem? The honest strategy is fine under cross-examination, it will give consistent answers across contexts.

"The honest strategy"? If you have that, you can just ask it and not bother with the debate. If the problem is distinguishing it, and only dishonest actors are changing their answers based on the provided situation, you can just use that info. But why are you assuming you have an "honest strategy" available here?

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

bhauth7mo10

You can recursively decompose the claim "perpetual motion machines are known to be impossible" until you get down to a claim like "such and such experiment should have such and such outcome", which the boss can then perform to determine a winner.

Ah, I don't think you can. Making that kind of abstract conclusion from a practical number of experiments requires abstractions like potential energy, entropy, Noether's theorem, etc - which in this example, the judge doesn't understand. (Without such abstractions, you'd need to consider every possible type of machine separately, which isn't feasible.) This seems like a core of our disagreement here.

You can cross-examine the inventor and show that in other contexts they would agree that perpetual energy machines are impossible.

The debaters are the same AI with different contexts, so the same is true of both debaters. Am I missing something here?

Which paper are you referring to? If you mean doubly efficient debate

Yes, "doubly efficient debate".

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

bhauth7mo20

To clarify the 2nd point, here's an example. Suppose someone presents you with a large box that supposedly produces electricity endlessly. Your boss thinks it works, and you're debating the inventor in front of your boss.

"Perpetual motion machines are known to be impossible" you say, but your boss isn't familiar with that conceptual class or the reasoning involved.

The inventor says, "Here, let's plug in a thing, we can see that the box does in fact produce a little electricity." Your boss finds this very convincing.

The process proposed in the paper is something like, "let's randomly sample every possible machine to see if it does perpetual motion". So the inventor points to the sun and says, "that thing has been making energy continuously and never stops for as long as we've been able to tell". They point to some stars and say the same thing.

The sampling and evaluation is dependent on a conceptual framework that isn't agreed on, and waiting for the sun and stars to burn out isn't very practical.

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

bhauth7mo60

I took a look at the debate papers. I think that's a good angle to take, but they're missing some factors that sometimes make debates between humans fail.

Humans and neural networks both have some implicit representation of probability distributions of output types. The basis behind "I can't explain why but that seems unlikely" can be more accurate than "here's an argument for why that will happen". You're basically delegating the problem of "making AI thinking explainable" to the AI itself, but if you could do that, you could just...make neural networks explainable, perhaps by asking an AI what another AI is doing and doing RLHF on the response. But that doesn't seem to work in general. In other words, the problem is that by using only the arguments output by NNs, those are weaker agents than NNs that don't have to go through production of arguments.
Reasoning about probability distributions means argument branches can be of the type "X is a likely type of thing" vs "X is rare". And empirically checking the distribution can be too expensive. That makes the debate framework not work as well.

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments