Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed.
My objection is actually mostly to the example itself.
As you mention:
the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course.
Compare with the example:
...Suppose we’re designing some secure electronic
Let us make a distinction here between two cases:
In physical systems, we do not have whitebox knowledge. We merely have a finite sample of a blackbox[1]. Sometimes said finite sample of a blackbox appears to match a fairly straightforward machine Y, but that's about the best we can say[2].
And yes, checking if two specific Turing-complete blackboxes are equivalent is undecidable[3], even though checking if two specific Tur...
Your definition of L-knowledge implies there can 'only' be total possible latent variables in the universe that are L-knowable for any given L, I believe.
This isn't strictly a problem, as you can just increase L... but your upper bound on L before the answer is trivially 'yes' is the inverse Kolmogorov complexity of the program trace + o(1). This grows slower than any computable function.
I'd be concerned that for programs of 'realistic' (read: 'fits within the universe') sizes there is no such L.
This is an interesting observation; I don't see how it addresses my point.
There is no exact solution to mutual information from two finite samples. There is no -approximation of mutual information from two finite samples, either.
=====
On the topic of said observation: beware that -approximations of many things are proven difficult to compute, and in some cases even are uncomputable. (The classic being Chaitin's Constant[1].)
In particular, you very often end up with Halting-problem style contradictions when computing properties of systems capable...
The Hardness of computing mutual information in general is not a very significant barrier to designing systems with (near-)zero mutual information between two components, in exactly the same way that the Hardness of computing whether a given program halts in general is not a very significant barrier to designing software which avoids infinite loops.
But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system?
You fundamentally cannot, so it's a moot point. There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.
I do not follow your seeming dismissal of this. You acknowledge it, and then... assert it's not a problem?
An analogy: solving the Halting problem is impossible[3]. It is someti...
There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.
Thanks for bringing this up; it raises to a technical point which didn't make sense to include in the post but which I was hoping someone would raise in the comments.
The key point: Goodhart problems are about generalization, not approximation.
Suppose I have a proxy for a true utility function , and is always within of u (i.e. ...
I must have missed that part; can you point more specifically to what you're referring to?
It feels wrong to refer you back to your own writing, but much of part 4 was dedicated to talking about these short-term predictors being used to combat neural latency and to do... well, short-term predictions. A flinch detector that goes off 100ms in advance is far less useful than a flinch detector that goes off 300ms in advance, but at the same time a short-term predictor that predicts too far in advance leads to feedback when used as a latency counter (as I asked ...
Certainly, I'm assuming that the intended function is not in O(log n), though I think that's a very mild assumption for any realistic task.
In time, the brain (or any realistic agent) can do processing... but receives sensory data.
I think the prior you're suggesting is basically a circuit size prior. How do you think it differs from that?
Realizable-speed priors are certainly correlated with circuit size priors to some extent, but there are some important differences:
To start with, note that if you push your speed bias far enough (e.g. a strong enough circuit depth complexity or Turing machine time complexity penalty), you just get a lookup table that memorizes everything.
This is true in the TM model[1]. This is not true in the circuit-depth complexity model. Remember that an arbitrary lookup table is O(log n) circuit depth. If my function I'm trying to memorize is f(x) = (x & 1), the fastest circuit is O(1), whereas a lookup table is O(log n).
(This gets even worse in models where lookup is [2] or&...
Just to make sure we're on the same page, I made up the “300ms” number, it could be something else.
Sure; the further you get away from ~300ms the less the number makes sense for e.g. predicting neuron latency, as described earlier.
Also to make sure we're on the same page, I claim that from a design perspective, fast oscillation instabilities are bad, and from an introspective perspective, fast oscillation instabilities don't happen. (I don't have goosebumps, then 150ms later I don't have goosebumps, then 150ms later I do have goosebumps, etc.)
I absolutely ...
Yes
Alright, so we at least agree with each other on this. Let me try to dig into this a little further...
Consider the following (very contrived) example, for a 300ms predictor trying to minimize L2[1] norm:
Context is static throughput the below.
t=0, overrider circuit forces output=1.
t=150ms, overrider circuit switches back to loopback mode.
t=450ms, overrider circuit forces output=0.
t=600ms, overrider circuit switches back to loopback mode.
t=900ms, overrider circuit forces output=1.
etc.
Do you agree that the best a slow-learning predictor that's a pure ...
I disagree; let me try to work through where we diverge.
A 300ms predictor outputting a sine wave with period 300ms into its own supervisor input has zero error, and hence will continue to do so regardless of the learning rate.
Do you at least agree that in this scheme a predictor outputting a sine wave with period 300ms has zero error while in defer-to-predictor mode?
...The predictor is a parametrized function output = f(context, parameters) (where "parameters" are also called "weights"). If (by assumption) context is static, then you're running the function o
I'm liking[1] this theory more and more.
In the static-context case, let’s first consider what happens when the switch is sitting in “defer-to-predictor mode”: Since the output is looping right back to the supervisor, there is no error in the supervised learning module. The predictions are correct. The synapses aren’t changing. Even if this situation is very common, it has no bearing on how the short-term predictor eventually winds up behaving.
One solution to a -300ms delay connected to its own input is a constant output. However, this is part of an in...
Why don't you think that behavior might be solely driven by the hypothalamus & brainstem?
I tend to treat hypothalamus & brainstem reactions as limited to a single rote set of (possibly-repetitive) motions driven by a single clear stimulus. The sort of thing that I could write a bit of Python-esque pseudocode for.
Withdrawal reflexes match that. Hormonal systems match that[1]. Blink reflex matches that. Suckling matches that. Pathfinding from point A to any of points B-Z in the presence of dynamic obstacles, properly orienting, then suckling? Not so ...
Fair! There are many plausible models that the human brain isn't.
My comments above also apply here—in the event that this is true (which I'd still bet against, at least in the human case), it wouldn't impact anything of significance for the series, I think.
I haven't seen much of anything (beyond the obvious) that said sketch explicitly contradicts, I agree.
I realize now that I probably should have explained the why (as opposed to the what) of my sketch a little better[1].
Your model makes a fair bit of intuitive sense to me; your model has an immediately-ob...
The cerebellum sits in the middle of the action, always saying to itself “what signal is about to appear here?”, and then it preemptively sends it. And then a fraction of a second later, it sees whether its prediction was correct, and updates its models if it wasn’t.
How does this cope with feedback loops?
Or is the implicit assumption here that the prediction lookahead is always less than the minimum feedback time delay? (If so, how does it know that?)
The argument here is the same as Section 3.2.1 of the previous post: the genome cannot know exactly which neurons (if any) will store any particular learned-from-scratch pattern, and therefore cannot hardwire a connection to them.
This may be more suited as a comment on the previous post. However, it ties into this, so I'll put it here.
I understand the argument that if the upstream circuit learns entirely from scratch, you can't really have hardwired downstream predictors, for lack of anything stable to hardwire them to.
I don't see a clear argument for the ...
...assuming that Copilot-(N+1) has <1.01x the development cost as Copilot-N. I'd be interested in arguments as to why this would be the case; most programming has diminishing returns where e.g. eking out additional performance from a program costs progressively more development time.