14 Speculations against GPT-n writing alignment papers

by Donald Hobson

7th Jun 2021

2 min read

6

14

GPTInterpretability (ML & AI)AI

Frontpage

New Comment

3 comments, sorted by

top scoring

Click to highlight new comments since: Today at 6:29 PM

[-]abramdemski5y20

It seems to me that the last paragraph should update you to thinking that this plan is no worse than the default. IE: yes, this plan creates additional risk because there are complicated pathways a malign gpt-n could use to get arbitrary code run on a big computer. But if people are giving it that chance anyway, it does seem like a small increase in risk with a large potential gain. (Small, not zero, for the chance that your specific gpt-n instance somehow becomes malign when others are safe, eg if something about the task actually activated a subtle malignancy not present during other tasks).

So for me a crux would be, if it's not malign, how good could we expect the papers to actually be?

Reply

[-]Gurkenglas5y10

Translating to a tree of natural language descriptions and back lets you

use different instances of GPT at each step, so that any coded messages would have to rely on the common training set rather than shared internals.
do a random subset of the tasks yourself, so that any subversion of the target system would have to be based on subtle biases, perhaps with a small number of outright lies that it hopes we don't implement manually, rather than the description being fabricated from whole cloth.

Having the thing write papers is merely an existence proof of embedded agency being irrelevant except for deconfusion.

Intelligent agents causally responsible for your existence.

What do you mean you can think of this, I told this to you :D

Reply

[-]Donald Hobson5y10

Maybe you did. I find it hard to distinguish inventing and half remembering ideas.

If the training procedure either

Reliably produces mesaoptimisers with about the same values. or
Reliably produces mesaoptimizers that can acausally cooperate
The rest of the procedure allows one mesaoptimizer to take control of the whole output

Then using different copies of GPT-n trained from different seeds doesn't help.

If you just convert 1% of the english into network yourself, then all it needs to use is some error correction. Even without that, neural net structures are somewhat redundant. A malicious optimizer with a random 1% removed is probably still a malicious optimizer.

Secondly, its goal here is to give us subtly wrong alignment papers. Suppose we produce an examination and transparency step so thorough that no mesa optimizer can slip through. The original mesaoptimiser can just design a network that is doing its best as a mesaoptimiser free predictor, but has subtly maliciously chosen beliefs about AI research.

Reply

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

14

Speculations against GPT-n writing alignment papers

14