Yeah. I think this sort of thing is why Eliezer thinks we're doomed
Hmm, interesting...but wasn't he more optimistic a few years ago, when his plan was still "pull off a pivotal act with a limited AI"? I thought the thing that made him update towards doom was the apparent difficulty of safely making even a limited AI, plus shorter timelines.
other gestured-example I've heard is "upload aligned people who think hard for 1000 subjective years and hopefully figure something out."
Ah, that actually seems like it might work. I guess the problem is that an A...
Oh, melting the GPUs would not actually be a pivotal act
Well yeah, that's my point. It seems to me that any pivotal act worthy of the name would essentially require the AI team to become an AGI-powered world government, which seems pretty darn difficult to pull off safely. The superpowered-AI-propaganda plan falls under this category. The long-lasting nanomachines idea is cute, but I bet people would just figure out ways to evade the nanomachines' definition of 'GPU'.
...Note that these aren't intended to be very good/realistic suggestions, they're just m
it's much more likely that someone could actually perform a unilateral pivotal act; it is a far easier problem, even after accounting for the problems the OP mentions in Part 1.
What I've never understood about the pivotal act plan is exactly what the successful AGI team is supposed to do after melting the GPUs or whatever. Every government on Earth will now consider them their enemy; they will immediately be destroyed unless they can defend themselves militarily, then countries will simply rebuild the GPU factories and continue on as before(except now i...
Oh, melting the GPUs would not actually be a pivotal act. There would need to be some way to prevent new GPUs from being built in order for it to be a pivotal act.
Military capability is not strictly necessary; a pivotal act need not necessarily piss off world governments. AGI-driven propaganda, for instance, might avoid that.
Alternatively, an AGI could produce nanomachines which destroy GPUs, are extremely hard to eradicate, but otherwise don't do much of anything.
(Note that these aren't intended to be very good/realistic suggestions, they're just meant to point to different dimensions of the possibility space.)
Regarding your first point, I think when people say that language models "don't bring us closer to full code automation" they mean there's no way of improving/upgrading language models such that they implement full code automation. I think it would be better to argue against that claim directly instead of bringing up language model's productivity-boosting effects. There are many things that could potentially boost programmers' productivity -- better nootropics, say -- but it seems overly broad to say that they all "bring us closer to full code automation", even if it might be causally true that they reduce the time to automation in expectation.
For reasons elaborated upon in this post and its comments, I'm kinda skeptical of these results. Basically the claims made are
(A) That the parameter->function map is "biased towards simple functions". It's important to distinguish simple --> large volume and large volume --> simple. Simple --> large volume is the property that Solomonoff induction has and what makes it universal, but large volume-->simple is what is proven in these papers(plus some empirical evidence of unclear import)
(B) SGD being equivalent to random selection. The evidenc...
There's an important distinction[1] to be made between these two claims:
A) Every function with large volume in parameter-space is simple
B) Every simple function has a large volume in parameter space
For a method of inference to qualify as a 'simplicity prior', you want both claims to hold. This is what lets us derive bounds like 'Solomonoff induction matches the performance of any computable predictor', since all of the simple, computable predictors have relatively large volume in the Solomonoff measure, so they'll be picked out after boundedly many mistake...
Hmm, so regarding the linear combinations, it's true that there are some linear combinations that will change by in the large-width limit -- just use the vector of partial derivatives of the output at some particular input, this sum will change by the amount that the output function moves during the regression. Indeed, I suspect(but don't have a proof) that these particular combinations will span the space of linear combinations that change non-trivially during training. I would dispute "we expect most linear combinations to change" though -- the CLT ...
I don't think taking linear combinations will help, because adding terms to the linear combination will also increase the magnitude of the original activation vector -- e.g. if you add together units, the magnitude of the sum of their original activations will with high probability be , dwarfing the O(1) change due to change in the activations. But regardless, it can't help with transfer learning at all, since the tangent kernel(which determines learning in this regime) doesn't change by definition.
What empirical results do you think are being co...
The result that NTK does not learn features in the large N limit is not in dispute at all -- it's right there on page 15 of the original NTK paper, and indeed holds after arbitrarily many steps of backprop. I don't think that there's really much room for loopholes in the math here. See Greg Yang's paper for a lengthy proof that this holds for all architectures. Also worth noting that when people 'take the NTK limit' they often don't initialize an actual net at all, they instead use analytical expressions for what the inner product of the gradients would be at N=infinity to compute the kernel directly.
The asymmetry between the output function and the intermediate neuron functions comes from backprop -- from the fact that the gradients are backprop-ed through weight matrices with entries of magnitude O(). So the gradient of the output w.r.t itself is obviously 1, then the gradient of the output w.r.t each neuron in the preceding layer is O(), since you're just multiplying by a vector with those entries. Then by induction all other preceding layers' gradients are the sum of N random things of size O(1/N), and so are of size O() again. So tak...
Not sure if I agree regarding the real-world usefulness. For the non-IID case, PAC-Bayes bounds fail, and to re-instate them you'd need assumptions about how quickly the distribution changes, but then it's plausible that you could get high probability bounds based on the most recent performance. For small datasets, the PAC-Bayes bounds suffer because they scale as . (I may edit the post to be clearer about this)
Agreed that analyzing how the bounds change under different conditions could be insightful though. Ultimately I suspect that...
The neural tangent kernel guys have a paper where they give a heuristic argument explaining the double descent curve(in number of parameters) using the NTK.
Nice survey. The result about double descent even occurring in dataset size is especially surprising.
Regarding the 'sharp minima can generalize' paper, they show that there exist sharp minima with good generalization, not flat minima with poor generalization, so they don't rule out flatness as an explanation for the success of SGD. The sharp minima they construct with this property are also rather unnatural: essentially they multiply the weights of layer 1 by a constant and divide the weights of layer 2 by the same constant. The piecewise linearity of Re
...No worries, I also missed the earlier posts when I wrote mine. There's lots of stuff on this website.
I endorse your rephrasing of example 1. I think my position is that it's just not that hard to create a "self-consistent probability distribution". For example, say you trained an RNN to predict sequences, like in this post. Despite being very simple, it already implicitly represents a probability distribution over sequences. If you train it with back-propagation on a confusing article involving pyrite, then its weights will be updated t...
Example 1 basically seems to be the problem of output diversity in generative models. This can be a problem in generative models, but there are ways around it. e.g. instead of outputting the highest-probability individual sequence, which will certainly look "manipulative" as you say, sample from the implied distribution over sequences. Then the sentence involving "pyrite" will be output with probability proportional to how likely the model thinks "pyrite" is on its own, disregarding subsequent tokens.
For example 2, I wrote a s...
Yeah, if you train the algorithm by random sampling, the effect I described will take place. The same thing will happen if you use an RL algorithm to update the parameters instead of an unsupervised learning algorithm(though it seems willfully perverse to do so -- you're throwing away a lot of the structure of the problem by doing this, so training will be much slower)
I also just found an old comment which makes the exact same argument I made here. (Though it now seems to me that argument is not necessarily correct!)
Reflective Oracles are a bit of a weird case case because their 'loss' is more like a 0/1 loss than a log loss, so all of the minima are exactly the same(If we take a sample of 100000 universes to score them, the difference is merely incredibly small instead of 0). I was being a bit glib referencing them in the article; I had in mind something more like a model parameterizing a distribution over outputs, whose only influence on the world is via a random sample from this distribution. I think that such models should in general have fixed points fo...
Is there a reason you think a reflective oracle (or equivalent) can't just be selected "arbitrarily", and will likely be selected to maximize some score?
The gradient descent is not being done over the reflective oracles, it's being done over some general computational model like a neural net. Any highly-performing solution will necessarily look like a fixed-point-finding computation of some kind, due to the self-referential nature of the predictions. Then, since this fixed-point-finder is *internal* to the model, it will be optimize...
If we assume Sleeping Beauty has lots of information, we might expect that the shortest matching program will look like a simulation of physical law plus a "bridging law" that, given this simulation, tells you what symbols get written to the tape
I agree. I still think that the probabilities would be closer to 1/2, 1/4, 1/4. The bridging law could look like this: search over the universe for compact encodings of my memories so far, then see what is written next onto this encoding. In this case, it would take no more bits to specify waking up on ...
I still don't see how you're getting those probabilities. Say it takes 1 bit to describe the outcome of the coin toss, and assume it's easy to find all the copies of yourself(ie your memories) in different worlds. Then you need:
1 bit to specify if the coin landed heads or tails
If the coin landed tails, you need 1 more bit to specify if it's Monday or Tuesday.
So AIXI would give these scenarios P(HM)=0.50, P(TM)=0.25, P(TT)=0.25.
Well, it COULD be the case that the K-complexity of the memory-erased AIXI environment is lower, even when it learns that this happened. The reason for this is that there could be many possible past AIXI's who have their memory erased/altered and end up in the same subjective situation. Then the memory-erasure hypothesis can use the lowest K-complexity AIXI who ends up with these memories. As the AIXI learns more it can gradually piece together which of the potential past AIXI's it actually was and the K-complexity will go back up again.
EDIT: Oh,...
Incidentally, you can use the same idea to have RO-AIXI do anthropic reasoning/bargaining about observers that are in a broader reference class than 'exact same sense data', by making the mapping O -> O' some sort of coarse-graining.
" P(HM)=0.49, P(TM)=0.49, P(TT)=0.2 " -- Are these supposed to be mutually exclusive probabilities?
" There is a turing machine that writes the memory-wiped contents to tape all in one pass. " - Yes, this is basically what I said. ('environment' above could include 'the world' + bridging laws). But you also need to alter the reward structure a bit to make it match our usual intuition of what 'memory-wiping' means, and this has significance for decision theory.
Consider, if your own memory was erased, you wo...
I think the framework of RO-AIXI can be modified pretty simply to include memory-tampering.
Here's how you do it. Say you have an environment E and an RO-AIXI A running in it. You have run the AIXI for a number of steps, and it has a history of observations O. Now we want to alter its memory to have a history of observations O'. This can be implemented in the environment as follows:
1. Create a new AIXI A', with the same reward function as the original and no memories. Feed it the sequence of observations O'.
2. Run A' in place of A for the remainder of E. In...
The weight could be something like the algorithmic probability over strings(https://en.wikipedia.org/wiki/Algorithmic_probability), in which case universes like ours with a concise description would get a fairly large chunk of the weight.
What about neighboring Everett branches where humanity succeeds at alignment? If you think alignment isn't completely impossible, it seems such branches should have at least roughly comparable weight to branches where we fail, so trade could be possible.
my guess is it's not worth it on account of transaction-costs. what're they gonna do, trade half a universe of paperclips for half a universe of Fun? they can already get half a universe of Fun, by spending on Fun what they would have traded away to paperclips!
and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)
there's also an issue where it's n... (read more)