I think there’s a lot going on with your equivocating the speed prior over circuits w/ a speed prior over programs.
I think a lot of the ideas in this direction are either confused by the difference between circuit priors and program priors, or at least treating them as equivalent. Unfortunately a lot of this is vague until you start specifying the domain of model. I think specifying this more clearly will help communicating about these ideas. To start with this myself, when I talk about circuit induction, I’m talking about things that look like large randomly initialized bayes nets (or deep neural networks).
Program Induction Priors are bad: I would claim that any program induction priors (simplicity prior, speed prior, others) are almost always a bad fit for developing useful intuitions about the behavior of large random bayes net machines.
Confusion between circuit induction speed prior and simplicity prior: I think your point about double descent is wrong — in particular, the speed is largely unchanged in double descent experiments, since the *width* is the parameter being varied, and all deep neural networks of the same depth have approximately the same speed (unless you mean something weird by speed).
Circuit Simplicity: You give circuit-size and circuit-depth as examples of a “speed prior”, which seems pretty nonstandard, especially when describing it as “not the simplicity prior”.
More than Speed and Simplicity: I think there are other metrics that provide interesting priors over circuits, like likelihood under some initialization distribution. In particular, I think “likelihood under the initialization distribution” is the prior that matters most, until we develop techniques that let us “hack the prior”.
Connection to Infinite-Size Neural Networks: I think research about neural networks approaching/at the infinite limit looks a lot like physics about black holes — and similarly can tell us interesting things about dynamics we should expect. In particular, for systems optimized by gradient descent, we end up with infinitesimal/nonexistent feature learning in the limit — which is interesting because all of the sub-modules/sub-circuits we start with are all we’ll ever have! This means that even if there are “simple” or “fast” circuits, if they’re not likely under the initialization distribution, then we expect they’ll have a vanishingly small effect on the output. (One way of thinking about this is in terms of the NTK, that even if we have extremely powerfully predictive modules, their predictive power will be overwhelmed by the much more common and simple features)
Hacking the Prior: Right now we don’t have a good understanding of the behavior of partially-hand coded neural networks, but I think they could serve as a new/distinct class of models (with regards to what functions are likely under the initialization distribution). Concretely, this could look like us “hand-programming” circuits or parts of neural networks, then randomly initializing the rest, and see if during training the model learns to use those programmed functions.
To start with, note that if you push your speed bias far enough (e.g. a strong enough circuit depth complexity or Turing machine time complexity penalty), you just get a lookup table that memorizes everything.
This is true in the TM model[1]. This is not true in the circuit-depth complexity model. Remember that an arbitrary lookup table is O(log n) circuit depth. If my function I'm trying to memorize is f(x) = (x & 1), the fastest circuit is O(1), whereas a lookup table is O(log n).
(This gets even worse in models where lookup is [2] or [3])
I humbly suggest a variant family of priors: realizable-speed[4] priors. That is:
A simplish example of this prior might be the following: assume a simple 3d circuit model where gates take O(1) time and space and wires take O(length) time and space, and neither gates nor wires can overlap.
This prior discourages giant lookup tables even in the limiting case, while retaining many of the advantages of speed priors.
(It does have the interesting issue that it says nothing about what happens outside the lightcone of the computation...)
(I realize here that I'm being very sloppy with asymptotic notation. I'm bouncing between bits / gates / elements / etc.)
It also isn't true in the TM model variant where you include the depth of the FSM decode circuit instead of treating a single TM step as constant-time.
Volume accessible within time t scales as
Assuming power/heat scales linearly with #gates[6], max gates within a radius scales with and so you get instead. Ditto, eventually you run into e.g. Bekenstein bound issues, although this isn't a particularly realistic concern.
This is a terrible name. I am fully open to better ones.
Which is a terrible idea given that galactic algorithms are a thing.
Assuming it scales superlinearly you scale even worse...
This is not true in the circuit-depth complexity model. Remember that an arbitrary lookup table is O(log n) circuit depth. If my function I'm trying to memorize is f(x) = (x & 1), the fastest circuit is O(1), whereas a lookup table is O(log n).
Certainly, I'm assuming that the intended function is not in O(log n), though I think that's a very mild assumption for any realistic task.
I think the prior you're suggesting is basically a circuit size prior. How do you think it differs from that?
Certainly, I'm assuming that the intended function is not in O(log n), though I think that's a very mild assumption for any realistic task.
In time, the brain (or any realistic agent) can do processing... but receives sensory data.
I think the prior you're suggesting is basically a circuit size prior. How do you think it differs from that?
Realizable-speed priors are certainly correlated with circuit size priors to some extent, but there are some important differences:
Thanks to Paul Christiano, Mark Xu, Abram Demski, Kate Woolverton, and Beth Barnes for some discussions which informed this post.
In the ELK report, Paul, Mark, and Ajeya express optimism about penalizing computation time as a potentially viable way to select the direct translator over the human imitator:
I am more skeptical—primarily because I am more skeptical of the speed prior's ability to do reasonable things in general. That being said, the speed prior definitely has a lot of nice things going for it, and I do think it's worth taking a careful look at both the good and the bad that the speed prior has to offer. Conceptually, what we want to pay attention to when evaluating a prior from an AI safety perspective is threefold: it needs to favor good models over bad models (e.g. the direct translator over the human imitator), it needs to be competitive to implement, and it needs to favor models with good generalization over models with bad generalization (e.g. the resulting models need to themselves be performance competitive).
Before I do that, however, an important preliminary: there are multiple different forms/types of speed priors, so when I say “the speed prior,” I really mean a class of priors including:
The basic structure of this post will be a dialog of sorts between a pro-speed-prior and an anti-speed-prior perspective. I'll start with some of the arguments in favor of the speed prior and how the anti-speed-prior perspective might respond, then give some arguments against the speed prior and how the pro-speed-prior perspective might respond.
Why you should love the speed prior
Why you should hate the speed prior
Conclusion?
I don't really have a single, unified conclusion to take away from all of this. Like I said at the beginning, I think I tend towards skepticism of the speed prior's ability to solve AI safety problems, at least singlehandedly, but I can't dismiss it completely and I think there are clearly strong and compelling reasons to like it. I do feel like moving in the direction of speed bias is likely to increase safety all things considered, though I also feel like there's a reasonable chance that doing so might also reduce competitiveness—in which case it's very unclear to me if that's where we want to place our alignment tax.