2

This is a special post for quick takes by Donald Hobson. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
9 comments, sorted by Click to highlight new comments since:

Alignment crazy idea. Only run optimization power through channels that have been optimized to convey it.  

Like water that flows through pipes, but doesn't escape from leaks. 

Suppose the AI is connected to a robot body. The AI can optimize along the wires, and through the motors. Optimization power can flow along these channels because humans deliberately optimized them to be good at conveying optimization power. But the AI can't use row-hammer. Humans didn't deliberately optimize memory modules to be susceptible. They just happen to be because of physics. Thus the electric interference between memory locations is a channel that optimization power can flow through, but it was not itself optimized to be good at transmitting optimization power. Thus the AI isn't allowed to use it.

  1. How to measure / detect "is optimized to convey optimization"? What's "a channel"?
  2. Humans are far from the only optimizers. This would seem to allow e.g. engineering a modification to a virus that kills all humans, because viruses are optimized. Pointing at "is optimized by humans to convey optimization" seems much harder than just "is optimized to convey optimization".
  3. What's "running optimization power through a channel"? If I set off a nuclear bomb, am I "running optimization through the air" by passing a shockwave through it? If no, then there's a lot of stuff we're ruling out, and if yes, then how can this thing still be pivotal?

I am trying to write something that would make sense if I had as solid and mathy idea of "optimization here" as I do with "information here". 

Viruses are optimizing their own spread, not killing all humans. This seems to be further optimizing an already highly optimized artifact, not flowing optimization through an optimized channel. 

I am not sure, I think it depends on why the AI wants the shockwave. Again, all I have is a fuzzy intuition that says yes in some cases, no in others, and shrugs in a lot of cases. I am trying to figure out if I can get this into formal maths. And if I succeed, I will (probably, unless infohazard or something) describe the formal maths. 

Viruses are optimizing their own spread, not killing all humans. This seems to be further optimizing an > already highly optimized artifact, not flowing optimization through an optimized channel.

Well I'm saying that the virus's ability to penetrate the organism, penetrate cells and nuclei, and hijack the DNA transcription machinery, is a channel. It already exists and was optimized to transmit optimization power: selection on the viral genome is optimization, and it passes through this channel, in that this channel allows the viral genome (when outside of another organism) to modify the behavior of an organism's cells.

(For the record I didn't downvote your original post and don't know why anyone would.)

Yeah, probably. However, note that it can only use this channel if a human has deliberately made an optimization channel that connects in to this process. Ie the AI isn't allowed to invent DNA printers itself.

I think a bigger flaw is where one human decided to make a channel from A to B, another human made a channel from B to C ... until in total there is a channel from A to Z that no human wants and no human knows exists, built entirely out of parts that humans build. 

Ie person 1 decides the AI should be able to access the internet. Person 2 decided that anyone on the internet should be able to run arbitrary code on their programming website, and the AI puts those together, even when no human did. Is that a failure of this design? Not sure. Can't get a clear picture until I have actual maths.

 rough stop button problem ideas.

You want an AI that believes its actions can't effect the button. You could use causal counterfactuals. An imaginary button that presses itself at random. You can scale the likelihood of worlds up and down, to ensure the button is equally likely to be pressed in each world. (Wierd behaviour, not recomended) You can put the AI in the logical counterfactual of "my actions don't influence the chance the button is pressed." if you can figure out logical counterfactuals.

Or you can get the AI to simulate what it would do if it were an X maximizer. If it thinks the button won't be pressed, it does that, otherwise it does nothing. (not clear how to generalize to uncertain AI)

Here is a potential solution to stop button type problems, how does this go wrong?

Taking into account uncertainty, the algorithm is.

Calculate the X maximizing best action in a world where the stop button does nothing.

Calculate the X maximizing best action in a world where the stop button works. 

If they are the same, do that. Otherwise shutdown.

It seems like the button-works action will usually be some variety of "take preemptive action to ensure the button won't be pressed" and so the AI will have a high chance to shut down at each decision step.

Yes. If you have an AI that has been given a small, easily completable task, like putting one block on top of another with a robot arm, that is probably just going to do your simple task. The idea is that you build a fairly secure box, and give the AI a task it can fairly easily achieve in that box. (With you having no intention of pressing the button so long as the AI seems to be acting normally. ) We want to make "just do your task" the best strategy.  If the box is less secure than we thought, or various other things go wrong, the AI will just shut down. 

This approach could fail if the easiest way to cause trouble is nothing to do with the stop button. (Say it can send its code elsewhere over the internet. )

A generalization. You have some real world variable X. You run a bunch of AI's that are identical, except for different dogmatic beliefs about X. If they ever disagree on which action to take, the AI's are using the value of X in the optimization. Shutdown.

This acts as a kind of fuse. Have one AI confidant the door password is 1234, a copy that is confidant it is 5678. If these AI's ever disagree on what to do next then at least one was trying to open the door. Shutdown. A slightly counter-intuitive situation here is you want to make sure that if you did know the password, the easiest way to leave would be the door. Otherwise both AI will agree to climb out a window.