Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed.
My objection is actually mostly to the example itself.
As you mention:
the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course.
Compare with the example:
Suppose we’re designing some secure electronic equipment, and we’re concerned about the system leaking information to adversaries via a radio side-channel.
[...]
But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.
This is analogous to the case of... trying to contain a malign AI which is already not on our side.
Did your description to the participants state that the arguments were high-quality?
If there are high-quality arguments for multiple answers, doesn't that "just" mean that the multiple-choice question is itself low-quality?
Let us make a distinction here between two cases:
In physical systems, we do not have whitebox knowledge. We merely have a finite sample of a blackbox[1]. Sometimes said finite sample of a blackbox appears to match a fairly straightforward machine Y, but that's about the best we can say[2].
And yes, checking if two specific Turing-complete blackboxes are equivalent is undecidable[3], even though checking if two specific Turing-complete whiteboxes may be decidable.
in exactly the same way
It is not exactly the same way, due to the above.
Namely, 'the laws of physics'
(And worse, often doesn't exactly match in the observations thus far, or results in contradictions.)
Trivially, due to indistinguishability issues. For any finite sequence of inputs and outputs, there are multiple machines X and X' which produce that sequence of outputs given the input, but which have later output that diverges. This is not a problem in the whitebox case because said machines are distinguishable.
Your definition of L-knowledge implies there can 'only' be total possible latent variables in the universe that are L-knowable for any given L, I believe.
This isn't strictly a problem, as you can just increase L... but your upper bound on L before the answer is trivially 'yes' is the inverse Kolmogorov complexity of the program trace + o(1). This grows slower than any computable function.
I'd be concerned that for programs of 'realistic' (read: 'fits within the universe') sizes there is no such L.
This is an interesting observation; I don't see how it addresses my point.
There is no exact solution to mutual information from two finite samples. There is no -approximation of mutual information from two finite samples, either.
=====
On the topic of said observation: beware that -approximations of many things are proven difficult to compute, and in some cases even are uncomputable. (The classic being Chaitin's Constant[1].)
In particular, you very often end up with Halting-problem style contradictions when computing properties of systems capable of unbounded computation, and even approximations thereof.
Unfortunately, 'value function of a powerful AI' tends to fall into that category[2].
But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system?
You fundamentally cannot, so it's a moot point. There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.
I do not follow your seeming dismissal of this. You acknowledge it, and then... assert it's not a problem?
An analogy: solving the Halting problem is impossible[3]. It is sometimes useful to handwave a Halting oracle as a component of proofs regardless - but at the end of the day saying 'solving the Halting problem is easy, just use a Halting oracle' is not a solution.
Many people have an intuition like “everything is an imperfect proxy; we can never avoid Goodhart”. The point of the mutual information example is that this is basically wrong.
"Many people have an intuition like "everything is an imperfect halting-problem solver; we can never avoid Turing". The point of the Halting oracle example is that this is basically wrong."
Hopefully this illustrates my point.
In particular, it requires calculating the distributions to infinite accuracy, which in turn requires an infinite sample. (Consider if I have two independent perfectly fair coins. I flip each of them 3x and get HHT/HHT. My mutual information is non-zero!)
At least assuming the Church-Turing hypothesis is correct.
Except potentially if there's an event horizon, although even that's an open question, and in that case it's a moot point because an AI in an event horizon is indistinguishable from no AI.
Strictly speaking, within each others lightcone.
And as soon as you have anything causing a correlation, the probability that other factors exactly cancel said correlation is zero.
I must have missed that part; can you point more specifically to what you're referring to?
It feels wrong to refer you back to your own writing, but much of part 4 was dedicated to talking about these short-term predictors being used to combat neural latency and to do... well, short-term predictions. A flinch detector that goes off 100ms in advance is far less useful than a flinch detector that goes off 300ms in advance, but at the same time a short-term predictor that predicts too far in advance leads to feedback when used as a latency counter (as I asked about/noted in the previous post).
(It's entirely possible that different predictors have different prediction timescales... but then you're just replaced the problem with a meta-problem. Namely: how do predictors choose the timescale?)
To clarify our intuitions here, I propose to go to the slow-learning limit.
However fast you've been imagining the parameters / weights / synapse strength changing in any given circumstance, multiply that learning rate by 0.001. And simultaneously imagine that the person experiences everything in their life with 1000× more repetitions. For example, instead of getting whacked by a golf ball once, they get whacked by a golf ball 1000× (on 1000 different days).
1x the training data with 1x the training rate is not equivalent to 1000x the training data with 1/1000th of the training rate. Nowhere near. The former is a much harder problem, generally speaking.
(And in a system as complex and chaotic as a human there is no such thing as repeating the same datapoint multiple times... related data points yes. Not the same data point.)
(That being said, 1x the training data with 1x the training rate is still harder than 1x the training data with 1/1000th the training rate, repeated 1000x.)
Your perspective, if I understand it, would be that this transformation would make the person flinch more slowly—so slowly that they would get hit by the ball before even starting to flinch.
You appear to be conflating two things here. It's worth calling them out as separate.
Putting a low-pass filter on the learning feedback signal absolutely does cause something to learn a low-passed version of the output. Your statement "In that case, the circuit would be basically incapable of "fast" dynamics (i.e. it would have implicit low-pass filters everywhere)," doesn't really work, precisely because it leads to absurd conclusions. This is what I was calling out.
A low learning rate is something different. (That has other problems...)
If you take any solution to 1, and multiply the learning rate by 0.000001, then it would satisfy 2 as well, right?
My apologies, and you are correct as stated; I should have added something on few-shot learning. Something like a flinch detector likely does not fire 1,000,000x in a human lifetime[1], which means that your slow-learning solution hasn't learnt anything significant by the time the human dies, and isn't really a solution.
I am aware that 1m is likely you just hitting '0' a bunch of times'; humans are great few-shot (and even one-shot) learners. You can't just drop the training rate or else your examples like 'just stand on the ladder for a few minutes and your predictor will make a major update' don't work.
My flinch reflex works fine and I'd put a trivial upper-bound of 10k total flinches (probably even 1k is too high). (I lead a relatively quiet life.)
Certainly, I'm assuming that the intended function is not in O(log n), though I think that's a very mild assumption for any realistic task.
In time, the brain (or any realistic agent) can do processing... but receives sensory data.
I think the prior you're suggesting is basically a circuit size prior. How do you think it differs from that?
Realizable-speed priors are certainly correlated with circuit size priors to some extent, but there are some important differences:
...assuming that Copilot-(N+1) has <1.01x the development cost as Copilot-N. I'd be interested in arguments as to why this would be the case; most programming has diminishing returns where e.g. eking out additional performance from a program costs progressively more development time.