See also this post by Quintin Pope: https://www.lesswrong.com/posts/JFibrXBewkSDmixuo/hypothesis-gradient-descent-prefers-general-circuits
Interesting. This prank seems to be one you could play on a Logical Inductor, I wonder what the outcome would be? One fact that's possibly related is that computable functions are continuous. This would imply that whatever computable function Omega applies to your probability estimate, there exists a fixed point probability you can choose where you'll be correct about the monkey probability. Of course if you're a bounded agent thinking for a finite amount of time, you might as well be outputting rational probability estimates, in which case functions like become computable for Omega.
Counterargument 2 still seems correct to me.
Techniques like Density Functional Theory give pretty-accurate results for molecular systems in an amount of time far less than a full quantum mechanical calculation would take. While in theory quantum computing is a complexity class beyond what classical computers can handle, in practice it seems that it's possible to get quite good results, even on a classical computer. The hardness of simulating atoms and molecules looks like it depends heavily on progress in algorithms, rather than being based on the hardness of simulating arbitrary quantum circuits. Even we humans are continuing to research and come up with improved techniques for molecular simulation. Look up "Ferminet" for one relatively recent AI-based advance. The concern is that a superintelligence may be able to skip ahead in this timeline of algorithmic improvements.
While physicists sometimes claim to derive things from first principles, in practice these derivations often ignore a lot of details which still has to be justified using experiments.
Other ways that approximations can be justified:
simulating complex nanomachines outright is exceedingly unlikely.
As you mentioned above, once you have a set of basic components in your toolbox that are well understood by you, the process of designing things becomes much easier. So you only really need the expensive physics simulations for designing your basic building blocks. After that, you can coarse-grain these blocks in the larger design you're building. When designing transistors, engineers have to worry about the geometry of the transistor and use detailed simulations of how charge carriers will flow in the semiconductor. In a circuit simulator like LTSpice that's all abstracted away.
This is really cool, thanks for posting it. I also would not have expected this result. In particular, the fact that the top right vector generalizes across mazes is surprising. (Even generalizing across mouse position but not maze configuration is a little surprising, but not as much.)
Since it helps to have multiple interpretations of the same data, here's an alternative one: The top right vector is modifying the neural network's perception of the world, not its values. Let's say the agent's training process has resulted in it valuing going up and to the right, and it also values reaching the cheese. Maybe it's utility looks like x+y+10*[found cheese]
(this is probably very over-simplified). In that case, the highest reachable x+y
coordinate is important for deciding whether it should go to the top right, or if it should go directly to the cheese. Now if we consider how the top right vector was generated, the most obvious interpretation is that it should make the agent think there's a path all the way to the top right corner, since that's the difference between the two scenarios that were subtracted to produce it. So the agent concludes that the x+y
part of its utility function is dominant, and proceeds to try and reach the top right corner.
Predictions:
Thanks for the link! Looks like they do put optimization effort into choosing the subspace, but it's still interesting that the training process can be factored into 2 pieces like that.
I find the prospect of training on model on just 40 parameters to be very interesting. Almost unbelievable, really, to the point where I'm tempted to say: "I notice that I'm confused". Unfortunately, I don't have access to the paper and it doesn't seem to be on sci-hub, so I haven't been able to resolve my confusion. Basically, my general intuition is that each parameter in a network probably only contributes a few bits of optimization power. It can be set fairly high, fairly low, or in between. So if you just pulled 40 random weights from the network, that's maybe 120 bits of optimization power. Which might be enough for MNIST, but probably not for anything more complicated. So I'm guessing that most likely a bunch of other optimization went into choosing exactly which 40 dimensional subspace we should be using. Of course, if we're allowed to do that then we could even do it with a 1 dimensional subspace: Just pick the training trajectory as your subspace!
Generally with the mindspace thing, I don't really think about the absolute size or dimension of mindspace, but the relative size of "things we could build" and "things we could build that would have human values". This relative size is measured in bits. So the intuition here would be that it takes a lot of bits to specify human values, and so the difference in size between these two is really big. Now maybe if you're given Common Crawl, it takes fewer bits to point to human values within that big pile of information. But it's probably still a lot of bits, and then the question is how do you actually construct such a pointer?
I agree that demons are unlikely to be a problem, at least for basic gradient descent. They should have shown up by now in real training runs, otherwise. I do still think gradient descent is a very unpredictable process (or to put it more precisely: we still don't know how to predict gradient descent very well), and where that shows up is in generalization. We have a very poor sense of which things will generalize and which things will not generalize, IMO.
On training AI systems using human feedback: This is way better than nothing, and it's great that OpenAI is doing it, but has the following issues:
On training models to assist in human evaluation and point out flaws in AI outputs: Doing this is probably somewhat better than not doing it, but I'm pretty skeptical that it provides much value:
On using AI systems, in particular large language models, to advance alignment research: This is not going to work.
I took Nate to be saying that we'd compute the image with highest faceness according to the discriminator, not the generator. The generator would tend to create "thing that is a face that has the highest probability of occurring in the environment", while the discriminator, whose job is to determine whether or not something is actually a face, has a much better claim to be the thing that judges faceness. I predict that this would look at least as weird and nonhuman as those deep dream images if not more so, though I haven't actually tried it. I also predict that if you stop training the discriminator and keep training the generator, the generator starts generating weird looking nonhuman images.
This is relevant to Reinforcement Learning because of the actor-critic class of systems, where the actor is like the generator and the critic is like the discriminator. We'd ideally like the RL system to stay on course after we stop providing it with labels, but stopping labels means we stop training the critic. Which means that the actor is free to start generating adversarial policies that hack the critic, rather than policies that actually perform well in the way we'd want them to.
7: Did I forget some important question that someone will ask in the comments?
Yes!
Is there a way to deal with the issue of there being multiple ROSE points in some games? If Alice says "I think we should pick ROSE point A" and Bob says "I think we should pick ROSE point B", then you've still got a bargaining game left to resolve, right?
Anyways, this is an awesome post, thanks for writing it up!
Seems like figure 1 from Miller et al is a plot of test performance vs. "out of distribution" test performance. One might expect plots of training performance vs. "out of distribution" test performance to have more spread.