All of interstice's Comments + Replies

There aren’t enough simulators above us that care enough about us-in-particular to pay in paperclips. There are so many things to care about! Why us, rather than giant gold obelisks?

What about neighboring Everett branches where humanity succeeds at alignment? If you think alignment isn't completely impossible, it seems such branches should have at least roughly comparable weight to branches where we fail, so trade could be possible.

my guess is it's not worth it on account of transaction-costs. what're they gonna do, trade half a universe of paperclips for half a universe of Fun? they can already get half a universe of Fun, by spending on Fun what they would have traded away to paperclips!

and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)

there's also an issue where it's n... (read more)

Yeah. I think this sort of thing is why Eliezer thinks we're doomed

Hmm, interesting...but wasn't he more optimistic a few years ago, when his plan was still "pull off a pivotal act with a limited AI"? I thought the thing that made him update towards doom was the apparent difficulty of safely making even a limited AI, plus shorter timelines.

other gestured-example I've heard is "upload aligned people who think hard for 1000 subjective years and hopefully figure something out."

Ah, that actually seems like it might work. I guess the problem is that an A... (read more)

5Adele Lopez
It seems relatively plausible that you could use a Limited AGI to build a nanotech system capable of uploading a diverse assortment of (non-brain, or maybe only very small brains) living tissue without damaging them, and that this system would learn how to upload tissue in a general way. Then you could use the system (not the AGI) to upload humans (tested on increasingly complex animals). It would be a relatively inefficient emulation, but it doesn't seem obviously doomed to me. Probably too late once hardware is available to do this though.

Oh, melting the GPUs would not actually be a pivotal act

Well yeah, that's my point. It seems to me that any pivotal act worthy of the name would essentially require the AI team to become an AGI-powered world government, which seems pretty darn difficult to pull off safely. The superpowered-AI-propaganda plan falls under this category. The long-lasting nanomachines idea is cute, but I bet people would just figure out ways to evade the nanomachines' definition of 'GPU'.

Note that these aren't intended to be very good/realistic suggestions, they're just m

... (read more)
2Raymond Arnold
Yeah. I think this sort of thing is why Eliezer thinks we're doomed – getting the humanity to coordinate collectively seems doomed (i.e. see Gain of Function Research), and there are no weak pivotal acts that aren't basically impossible to execute safely. The nanomachine gpu-melting pivotal act is meant to be a gesture at the difficulty / power level, not an actual working example. The other gestured-example I've heard is "upload aligned people who think hard for 1000 subjective years and hopefully figure something out." I've heard someone from MIRI argue that one is also unworkable but wasn't sure on the exact reasons.

it's much more likely that someone could actually perform a unilateral pivotal act; it is a far easier problem, even after accounting for the problems the OP mentions in Part 1.

What I've never understood about the pivotal act plan is exactly what the successful AGI team is supposed to do after melting the GPUs or whatever. Every government on Earth will now consider them their enemy; they will immediately be destroyed unless they can defend themselves militarily, then countries will simply rebuild the GPU factories and continue on as before(except now i... (read more)

Oh, melting the GPUs would not actually be a pivotal act. There would need to be some way to prevent new GPUs from being built in order for it to be a pivotal act.

Military capability is not strictly necessary; a pivotal act need not necessarily piss off world governments. AGI-driven propaganda, for instance, might avoid that.

Alternatively, an AGI could produce nanomachines which destroy GPUs, are extremely hard to eradicate, but otherwise don't do much of anything.

(Note that these aren't intended to be very good/realistic suggestions, they're just meant to point to different dimensions of the possibility space.)

Regarding your first point, I think when people say that language models "don't bring us closer to full code automation" they mean there's no way of improving/upgrading language models such that they implement full code automation. I think it would be better to argue against that claim directly instead of bringing up language model's productivity-boosting effects. There are many things that could potentially boost programmers' productivity -- better nootropics, say -- but it seems overly broad to say that they all "bring us closer to full code automation", even if it might be causally true that they reduce the time to automation in expectation.

1Michaël Trazzi
The problem with arguing against that claim is that nobody knows whether transformers/scaling language models are sufficient for full code automation. To take your nootropics example, an analogy would be if nootropics were legal, did not have negative side effects, with a single company giving "beta access" (for now) to a new nootropic in unlimited amount at no cost to a market of tens of millions of users, that the data from using this nootropic was collected by the company to improve the product, that there actually were 100k peer-reviewed publications per year in the field of nootropics, where most of the innovation behind the tech came from a >100B-parameters model trained on open-source nootropic chemistry instructions. Would such advancements be evidence for something major we're not certain about (e.g. high bandwidth brain computer interface) or just evidence for increased productivity that would be reinjected into more nootropic investments?

For reasons elaborated upon in this post and its comments, I'm kinda skeptical of these results. Basically the claims made are

(A) That the parameter->function map is "biased towards simple functions". It's important to distinguish simple --> large volume and large volume --> simple. Simple --> large volume is the property that Solomonoff induction has and what makes it universal, but large volume-->simple is what is proven in these papers(plus some empirical evidence of unclear import)

(B) SGD being equivalent to random selection. The evidenc... (read more)

2Alex Flint
Thanks for these pointers. Is that the empirical evidence attempts to demonstrate simple --> large volume but is inconclusive, or is it that the empirical evidence does not even attempt to demonstrate simple --> large volume? Well they do take many samples from what they call P_SGD and P_B and compare these as distributions, so it seems a little unfair to say that the evidence is that the performance is similar, since that would suggest that they were just comparing max performance by SGD to max performance by NNGP. Re breaking down on more difficult problems: yes, I agree, we will have to wait and see and we shouldn't be too optimistic given the paper you point to in your own post. It sounds like you do think there is some chance that neural network generalization is due to an architectural bias towards simplicity. I would be very interested in your take on other (non-Mingard) research on this overall question if you have time to jot down some notes.

There's an important distinction[1] to be made between these two claims:

A) Every function with large volume in parameter-space is simple

B) Every simple function has a large volume in parameter space

For a method of inference to qualify as a 'simplicity prior', you want both claims to hold. This is what lets us derive bounds like 'Solomonoff induction matches the performance of any computable predictor', since all of the simple, computable predictors have relatively large volume in the Solomonoff measure, so they'll be picked out after boundedly many mistake... (read more)

3Daniel Kokotajlo
Ah, OK. Interesting, thanks. Would you agree with the following view: "The NTK/GP stuff has neural nets implementing a "psuedosimplicity prior" which is maybe also a simplicity prior but might not be, the evidence is unclear. A psuedosimplicity prior is like a simplicity prior except that there are some important classes of kolmogorov-simple functions that don't get high prior / high measure." Which would you say is more likely: The NTK/GP stuff is indeed not universally data efficient, and thus modern neural nets aren't either, or (b) NTK/GP stuff is indeed not universally data efficient, and thus modern neural nets aren't well-characterized by the NTK/GP stuff.

Hmm, so regarding the linear combinations, it's true that there are some linear combinations that will change by in the large-width limit -- just use the vector of partial derivatives of the output at some particular input, this sum will change by the amount that the output function moves during the regression. Indeed, I suspect(but don't have a proof) that these particular combinations will span the space of linear combinations that change non-trivially during training. I would dispute "we expect most linear combinations to change" though -- the CLT ... (read more)

I don't think taking linear combinations will help, because adding terms to the linear combination will also increase the magnitude of the original activation vector -- e.g. if you add together units, the magnitude of the sum of their original activations will with high probability be , dwarfing the O(1) change due to change in the activations. But regardless, it can't help with transfer learning at all, since the tangent kernel(which determines learning in this regime) doesn't change by definition.

What empirical results do you think are being co... (read more)

5johnswentworth
They wouldn't be random linear combinations, so the central limit theorem estimate wouldn't directly apply. E.g. this circuit transparency work basically ran PCA on activations. It's not immediately obvious to me what the right big-O estimate would be, but intuitively, I'd expect the PCA to pick out exactly those components dominated by change in activations - since those will be the components which involve large correlations in the activation patterns across data points (at least that's my intuition). I think this claim is basically wrong: There's a very big difference between "no change to first/second order" and "no change". Even in the limit, we do expect most linear combinations of the activations to change. And those are exactly the changes which would potentially be useful for transfer learning. And the tangent kernel not changing does not imply that transfer learning won't work, for two reasons: starting at a better point can accelerate convergence, and (probably more relevant) the starting point can influence the solution chosen when the linear system is underdetermined (which it is, if I understand things correctly). I do think the empirical results pretty strongly suggest that the NTK/GP model captures everything important about neural nets, at least in terms of their performance on the original task. If that's true, and NTKs can't be used for transfer learning, then that would imply that transfer learning in normal nets works for completely different reasons from good performance on the original task, and that good performance on the original task has nothing to do with learning features. Those both strike me as less plausible than these proofs about "NTK not learning features" being misinterpreted. (I also did a quick google search for transfer learning with NTKs. I only found one directly-relevant study, which is on way too small and simple a system for me to draw much of a conclusion from it, but it does seem to have worked.) BTW, thanks for hum

The result that NTK does not learn features in the large N limit is not in dispute at all -- it's right there on page 15 of the original NTK paper, and indeed holds after arbitrarily many steps of backprop. I don't think that there's really much room for loopholes in the math here. See Greg Yang's paper for a lengthy proof that this holds for all architectures. Also worth noting that when people 'take the NTK limit' they often don't initialize an actual net at all, they instead use analytical expressions for what the inner product of the gradients would be at N=infinity to compute the kernel directly.

6johnswentworth
Alright, I buy the argument on page 15 of the original NTK paper. I'm still very skeptical of the interpretation of this as "NTK models can't learn features". In general, when someone proves some interesting result which seems to contradict some combination of empirical results, my default assumption is that the proven result is being interpreted incorrectly, so I have a high prior that that's what's happening here. In this case, it could be that e.g. the "features" relevant to things like transfer learning are not individual neuron activations - e.g. IIRC much of the circuit interpretability work involves linear combinations of activations, which would indeed circumvent this theorem. This whole class of concerns would be ruled out by empirical results - e.g. experimental evidence on transfer learning with NTKs, or someone applying the same circuit interpretability techniques to NTKs which are applied to standard nets.

The asymmetry between the output function and the intermediate neuron functions comes from backprop -- from the fact that the gradients are backprop-ed through weight matrices with entries of magnitude O(). So the gradient of the output w.r.t itself is obviously 1, then the gradient of the output w.r.t each neuron in the preceding layer is O(), since you're just multiplying by a vector with those entries. Then by induction all other preceding layers' gradients are the sum of N random things of size O(1/N), and so are of size O() again. So tak... (read more)

2johnswentworth
Ok, that's at least a plausible argument, although there are some big loopholes. Main problem which jumps out to me: what happens after one step of backprop is not the relevant question. One step of backprop is not enough to solve a set of linear equations (i.e. to achieve perfect prediction on the training set); the relevant question is what happens after one step of Newton's method, or after enough steps of gradient descent to achieve convergence. What would convince me more is an empirical result - i.e. looking at the internals of an actual NTK model, trying the sort of tricks which work well for interpreting normal NNs, and seeing how well they work. Just relying on proofs makes it way too easy for an inaccurate assumption to sneak in - like the assumption that we're only using one step of backprop. If anyone has tried that sort of empirical work, I'd be interested to hear what it found.

Not sure if I agree regarding the real-world usefulness. For the non-IID case, PAC-Bayes bounds fail, and to re-instate them you'd need assumptions about how quickly the distribution changes, but then it's plausible that you could get high probability bounds based on the most recent performance. For small datasets, the PAC-Bayes bounds suffer because they scale as . (I may edit the post to be clearer about this)

Agreed that analyzing how the bounds change under different conditions could be insightful though. Ultimately I suspect that... (read more)

The neural tangent kernel guys have a paper where they give a heuristic argument explaining the double descent curve(in number of parameters) using the NTK.

Nice survey. The result about double descent even occurring in dataset size is especially surprising.

Regarding the 'sharp minima can generalize' paper, they show that there exist sharp minima with good generalization, not flat minima with poor generalization, so they don't rule out flatness as an explanation for the success of SGD. The sharp minima they construct with this property are also rather unnatural: essentially they multiply the weights of layer 1 by a constant and divide the weights of layer 2 by the same constant. The piecewise linearity of Re

... (read more)
2Evan Hubinger
Ah—thanks for the summary. I hadn't fully read that paper yet, though I knew it existed and so I figured I would link it, but that makes sense. Seems like in that case the flat vs. sharp minima hypothesis still has a lot going for it—not sure how that interacts with the lottery tickets hypothesis, though.

No worries, I also missed the earlier posts when I wrote mine. There's lots of stuff on this website.

I endorse your rephrasing of example 1. I think my position is that it's just not that hard to create a "self-consistent probability distribution". For example, say you trained an RNN to predict sequences, like in this post. Despite being very simple, it already implicitly represents a probability distribution over sequences. If you train it with back-propagation on a confusing article involving pyrite, then its weights will be updated t... (read more)

1Steve Byrnes
Thanks, that's helpful! I'll have to think about the "self-consistent probability distribution" issue more, and thanks for the links. (ETA: Meanwhile I also added an "Update 2" to the post, offering a different way to think about this, which might or might not be helpful.) Let me try the gradient descent argument again (and note that I am sympathetic, and indeed I made (what I think is) that exact argument a few weeks ago, cf. Self-Supervised Learning and AGI Safety, section title "Why won't it try to get more predictable data?"). My argument here is not assuming there's a policy of trying to get more predictable data for its own sake, but rather that this kind of behavior arises as a side-effect of an algorithmic process, and that all the ingredients of that process are either things we would program into the algorithm ourselves or things that would be incentivized by gradient descent. The ingredients are things like "Look for and learn patterns in all accessible data", which includes both low-level patterns in the raw data, higher-level patterns in the lower-level patterns, and (perhaps unintentionally) patterns in accessible information about its own thought process ("After I visualize the shape of an elephant tusk, I often visualize an elephant shortly thereafter"). It includes searching for transformations (cause-effect, composition, analogies, etc.) between any two patterns it already knows about ("sneakers are a type of shoe", or more problematically, "my thought processes resemble the associative memory of an AGI"), and cataloging these transformations when they're found. Stuff like that. So, "make smart hypotheses about one's own embodied situation" is definitely an unintended side-effect, and not rewarded by gradient descent as such. But as its world-model becomes more comprehensive, and as it continues to automatically search for patterns in whatever information it has access to, "make smart hypotheses about one's own embodied situation" would just be

Example 1 basically seems to be the problem of output diversity in generative models. This can be a problem in generative models, but there are ways around it. e.g. instead of outputting the highest-probability individual sequence, which will certainly look "manipulative" as you say, sample from the implied distribution over sequences. Then the sentence involving "pyrite" will be output with probability proportional to how likely the model thinks "pyrite" is on its own, disregarding subsequent tokens.

For example 2, I wrote a s... (read more)

3Steve Byrnes
Thank you for the links!! Sorry I missed them! I'm not sure I understand your comments though and want to clarify: I'm going to try to rephrase what you said about example 1. Maybe the text in any individual journal article about pyrite is perplexing, but given that the system expects some article about pyrite there, it should ramp the probabilities of individual articles up or down such that the total probability of seeing a journal article about pyrite, conditional on the answer "pyrite", is 100%. (By the same token, "The following is a random number: 2113164" is, in a sense, an unsurprising text string.) I agree with you that a system that creates a sensible, self-consistent probability distribution for text strings would not have a problem with example 1 if we sample from that distribution. (Thanks.) I am concerned that we will build a system with heuristic-guided search processes, not self-consistent probability estimates, and that this system will have a problem with example 1. After all, humans are subject to the conjunction fallacy etc., I assume AGIs will be too, right? Unless we flag this as a critical safety requirement and invent good techniques to ensure it. (I updated the post in a couple places to clarify this point, thanks again.) For gradient descent, yes they are "only updated towards what they actually observe", but they may "observe" high-level abstractions and not just low-level features. It can learn about a new high-level context in which the low-level word sequence statistics would be very different than when superficially-similar text appeared in the past. So I don't understand how you're ruling out example 2 on that basis. I mostly agree with what you say about fixed points in principle, but with the additional complication that the system's beliefs may not reflect reality, especially if the beliefs come about through abstract reasoning (in the presence of imperfect information) rather than trial-and-error. If the goal is "No manipulativ

Yeah, if you train the algorithm by random sampling, the effect I described will take place. The same thing will happen if you use an RL algorithm to update the parameters instead of an unsupervised learning algorithm(though it seems willfully perverse to do so -- you're throwing away a lot of the structure of the problem by doing this, so training will be much slower)

I also just found an old comment which makes the exact same argument I made here. (Though it now seems to me that argument is not necessarily correct!)

Reflective Oracles are a bit of a weird case case because their 'loss' is more like a 0/1 loss than a log loss, so all of the minima are exactly the same(If we take a sample of 100000 universes to score them, the difference is merely incredibly small instead of 0). I was being a bit glib referencing them in the article; I had in mind something more like a model parameterizing a distribution over outputs, whose only influence on the world is via a random sample from this distribution. I think that such models should in general have fixed points fo... (read more)

Is there a reason you think a reflective oracle (or equivalent) can't just be selected "arbitrarily", and will likely be selected to maximize some score?

The gradient descent is not being done over the reflective oracles, it's being done over some general computational model like a neural net. Any highly-performing solution will necessarily look like a fixed-point-finding computation of some kind, due to the self-referential nature of the predictions. Then, since this fixed-point-finder is *internal* to the model, it will be optimize... (read more)

3Jessica Taylor
I think the fixed point finder won't optimize the fixed point for minimizing expected log loss. I'm going to give a concrete algorithm and show that it doesn't exhibit this behavior. If you disagree, can you present an alternative algorithm? Here's the algorithm. Start with some oracle (not a reflective oracle). Sample ~1000000 universes based on this oracle, getting 1000000 data points for what the reflective oracle outputs. Move the oracle 1% of the way from its current position towards the oracle that would answer queries correctly given the distribution over universes implied by the data points. Repeat this procedure a lot of times (~10,000). This procedure is similar to gradient descent. Here's an example universe: M:=if O(M,0.3)=1 then flip(0.9) else 0 Note the presence of two reflective oracles that are stable equilibria: one where P(O(M,0.3)=1)=0, and one where P(O(M,0.3)=1)=1. Notice that the first has lower expected log loss than the second. Let's parameterize oracles by numbers in [0,1] representing P(O(M,0.3)=1) (since this is the only relevant query). Start with oracle 0.5. If we sample 1000000 universes, about 45% of them have outcome 1. So, based on these data points, P(M())=0.45, so the oracle based on these data points will say P(O(M,0.3)=1)=1, i.e. it is parameterized by 1. So we move our current oracle (0.5) 1% of the way towards the oracle 1, yielding oracle 0.505. We repeat this a bunch of times, eventually getting an oracle parameterized by a number very close to 1. So, this procedure yields an oracle with suboptimal expected log loss. It is not the case that the fixed point finder minimizes expected log loss. The neural net case is different, but not that much; it would give the same answer in this particular case, since the model can just be parameterized by a single real number.
If we assume Sleeping Beauty has lots of information, we might expect that the shortest matching program will look like a simulation of physical law plus a "bridging law" that, given this simulation, tells you what symbols get written to the tape

I agree. I still think that the probabilities would be closer to 1/2, 1/4, 1/4. The bridging law could look like this: search over the universe for compact encodings of my memories so far, then see what is written next onto this encoding. In this case, it would take no more bits to specify waking up on ... (read more)

I still don't see how you're getting those probabilities. Say it takes 1 bit to describe the outcome of the coin toss, and assume it's easy to find all the copies of yourself(ie your memories) in different worlds. Then you need:

1 bit to specify if the coin landed heads or tails

If the coin landed tails, you need 1 more bit to specify if it's Monday or Tuesday.

So AIXI would give these scenarios P(HM)=0.50, P(TM)=0.25, P(TT)=0.25.

1Charlie Steiner
I'm not thinking of it like specifying parts of the toy problem. I'm thinking of it as if for each of HM, TM, and TT, the observer is about to recieve 2 bits that describe which situation they're in, and the only object that matters for the probability of each is the shortest program that reproduces all past observations plus the next 2 bits. If we assume Sleeping Beauty has lots of information, we might expect that the shortest matching program will look like a simulation of physical law, plus a "bridging law" that, given this simulation, tells you what symbols get written to the tape. It is in this context that is seems like HM and TM are equally complex - you're simulating the same chunk of universe and have what seems like (naively) a similar bridging law. It's only for tuesday that you obviously need a different program to reproduce the data.

Well, it COULD be the case that the K-complexity of the memory-erased AIXI environment is lower, even when it learns that this happened. The reason for this is that there could be many possible past AIXI's who have their memory erased/altered and end up in the same subjective situation. Then the memory-erasure hypothesis can use the lowest K-complexity AIXI who ends up with these memories. As the AIXI learns more it can gradually piece together which of the potential past AIXI's it actually was and the K-complexity will go back up again.

EDIT: Oh,... (read more)

Incidentally, you can use the same idea to have RO-AIXI do anthropic reasoning/bargaining about observers that are in a broader reference class than 'exact same sense data', by making the mapping O -> O' some sort of coarse-graining.

" P(HM)=0.49, P(TM)=0.49, P(TT)=0.2 " -- Are these supposed to be mutually exclusive probabilities?

" There is a turing machine that writes the memory-wiped contents to tape all in one pass. " - Yes, this is basically what I said. ('environment' above could include 'the world' + bridging laws). But you also need to alter the reward structure a bit to make it match our usual intuition of what 'memory-wiping' means, and this has significance for decision theory.

Consider, if your own memory was erased, you wo... (read more)

1Charlie Steiner
Oops, that should have been 0.02 :) Good point about caring for yourself even if you expect to lose the memory of (e.g.) the current hour. AIXI only cares about the tapes that are successors of the current one. Maybe expand the tape from write-only to also have some erasing operations? I think there are probably some other toy problems that illustrate this issue a lot better than Sleeping Beauty, where AIXI equating memory loss with death might not actually change its decisions much in the bet.

I think the framework of RO-AIXI can be modified pretty simply to include memory-tampering.

Here's how you do it. Say you have an environment E and an RO-AIXI A running in it. You have run the AIXI for a number of steps, and it has a history of observations O. Now we want to alter its memory to have a history of observations O'. This can be implemented in the environment as follows:

1. Create a new AIXI A', with the same reward function as the original and no memories. Feed it the sequence of observations O'.

2. Run A' in place of A for the remainder of E. In... (read more)

1Diffractor
Not quite. If taking bet 9 is a prerequisite to taking bet 10, then AIXI won't take bet 9, but if bet 10 gets offered whether or not bet 9 is accepted, then AIXI will be like "ah, future me will take the bet, and wind up with 10+ϵ in the heads world and -20+2ϵ in the tails world. This is just a given. I'll take this +15/-15 bet as it has net positive expected value, and the loss in the heads world is more than counterbalanced by the reduction in the magnitude of loss for the tails world" Something else feels slightly off, but I can't quite pinpoint it at this point. Still, I guess this solves my question as originally stated, so I'll PM you for payout. Well done! (btw, you can highlight a string of text and hit crtl+4 to turn it into math-mode)
1interstice
Incidentally, you can use the same idea to have RO-AIXI do anthropic reasoning/bargaining about observers that are in a broader reference class than 'exact same sense data', by making the mapping O -> O' some sort of coarse-graining.
1Charlie Steiner
I don't think AIXI needs any special sauce to understand memory wiping. There is a turing machine that writes the memory-wiped contents to tape all in one pass. It's just going to be a simulation of the universe with a slightly more complicated "bridging law." Different programs that write the right thing to the tape are on equal footing, no matter whether they describe different universes or different parts of the same universe. So we might expect an assignment more like P(HM)=0.49, P(TM)=0.49, P(TT)=0.02 (EDIT: fixed) (not sure whether we ahould expect it to be harder to describe the normal Tails branch). And then AIXI will act on this using whatever action it predicts will get the greatest signal in the reward channel.

The weight could be something like the algorithmic probability over strings(https://en.wikipedia.org/wiki/Algorithmic_probability), in which case universes like ours with a concise description would get a fairly large chunk of the weight.