With help from David Manheim, this post has now been turned into a paper. Thanks to everyone who commented on the post!
Rob, are you able to disclose why people at Open Phil are interested in learning more decision theory? It seems a little far away from the AI strategy reports they've been publishing in recent years, and it also seemed like they were happy to keep funding MIRI (via their Committee for Effective Altruism Support) despite disagreements about the value of HRAD research, so the sudden interest in decision theory is intriguing.
Mostly personal interest on my part (I was working on a blog post on the topic, now up), though I do think that the topic has broader relevance.
I was reading parts of Superintelligence recently for something unrelated and noticed that Bostrom makes many of the same points as this post:
...If the frontrunner is an AI system, it could have attributes that make it easier for it to expand its capabilities while reducing the rate of diffusion. In human-run organizations, economies of scale are counteracted by bureaucratic inefficiencies and agency problems, including difficulties in keeping trade secrets. These problems would presumably limit the growth of a machine intelligence project so long as it is op
HCH is the result of a potentially infinite exponential process (see figure 1) and thereby, computationally intractable. In reality, we can not break down any task into its smallest parts and solve these subtasks one after another because that would take too much computation. This is why we need to iterate distillation and amplification and cannot just amplify.
In general your post talks about amplification (and HCH) as increasing the capability of the system and distillation as saving on computation/making things more efficient. But my understanding, based...
I still don't understand how corrigibility and intent alignment are different. If neither implies the other (as Paul says in his comment starting with "I don't really think this is true"), then there must be examples of AI systems that have one property but not the other. What would a corrigible but not-intent-aligned AI system look like?
I also had the thought that the implicative structure (between corrigibility and intent alignment) seems to depend on how the AI is used, i.e. on the particulars of the user/overseer. For example if you have an intent-alig...
IDA tries to prevent catastrophic outcomes by searching for a competitive AI that never intentionally optimises for something harmful to us and that we can still correct once it’s running.
I don't see how the "we can still correct once it’s running" part can be true given this footnote:
...However, I think at some point we will probably have the AI system autonomously execute the distillation and amplification steps or otherwise get outcompeted. And even before that point we might find some other way to train the AI in breaking down tasks that doesn’t involve h
I'm confused about the tradeoff you're describing. Why is the first bullet point "Generating better ground truth data"? It would make more sense to me if it said instead something like "Generating large amounts of non-ground-truth data". In other words, the thing that amplification seems to be providing is access to more data (even if that data isn't the ground truth that is provided by the original human).
Also in the second bullet point, by "increasing the amount of data that you train on" I think you mean increasing the amount of data from the original h
...The addition of the distillation step is an extra confounder, but we hope that it doesn't distort anything too much -- its purpose is to improve speed without affecting anything else (though in practice it will reduce capabilities somewhat).
I think this is the crux of my confusion, so I would appreciate if you could elaborate on this. (Everything else in your answer makes sense to me.) In Evans et al., during the distillation step, the model learns to solve the difficult tasks directly by using example solutions from the amplification step. But if c
...It seems like "agricultural revolution" is used to mean both the beginning of agriculture ("First Agricultural Revolution") and the 18th century agricultural revolution ("Second Agricultural Revolution").
I read the post and parts of the paper. Here is my understanding: conditions similar to those in Theorem 2 above don't exist, because Alex's paper doesn't take an arbitrary utility function and prove instrumental convergence; instead, the idea is to set the rewards for the MDP randomly (by sampling i.i.d. from some distribution) and then show that in most cases, the agent seeks "power" (states which allow the agent to obtain high rewards in the future). So it avoids the twitching robot not by saying that it can't make use of additional resources, but by sa
...Can you say more about Alex Turner's formalism? For example, are there conditions in his paper or post similar to the conditions I named for Theorem 2 above? If so, what do they say and where can I find them in the paper or post? If not, how does the paper avoid the twitching robot from seeking convergent instrumental goals?
One additional source that I found helpful to look at is the paper "Formalizing Convergent Instrumental Goals" by Tsvi Benson-Tilsen and Nate Soares, which tries to formalize Omohundro's instrumental convergence idea using math. I read the paper quickly and skipped the proofs, so I might have misunderstood something, but here is my current interpretation.
The key assumptions seem to appear in the statement of Theorem 2; these assumptions state that using additional resources will allow the agent to implement a strategy that gives it strictly higher utility
...I'm confused about what it means for a hypothesis to "want" to score better, to change its predictions to get a better score, to print manipulative messages, and so forth. In probability theory each hypothesis is just an event, so is static, cannot perform actions, etc. I'm guessing you have some other formalism in mind but I can't tell what it is.
To me, it seems like the two distinctions are different. There seem to be three levels to distinguish:
The base objective vs mesa-object
...Like Wei Dai, I am also finding this discussion pretty confusing. To summarize my state of confusion, I came up with the following list of ways in which preferences can be short or long:
By "short" I mean short in sense (1) and (2). "Short" doesn't imply anything about senses (3), (4), (5), or (6) (and "short" and "long" don't seem like good words to describe those axes, though I'll keep using them in this comment for consistency).
By "preferences-on-reflection" I mean long in sense (3) and neither in sense (6). There is a hypothesis that "humans with AI help" is a reasonable way to capture preferences-on-reflection, but they aren't defined to be the same. I don...
Thanks. It looks like all the realistic examples I had of weak HCH are actually examples of strong HCH after all, so I'm looking for some examples of weak HCH to help my understanding. I can see how weak HCH would compute the answer to a "naturally linear recursive" problem (like computing factorials) but how would weak HCH answer a question like "Should I get laser eye surgery?" (to take an example from here). The natural way to decompose a problem like this seems to use branching.
Also, I just looked again at Alex Zhu's FAQ for Paul's agenda, and Alex's e
...Thanks! I found this answer really useful.
I have some follow-up questions that I'm hoping you can answer:
My solution for #3:
Define by . We know that is continuous because and the identity map both are, and by the limit laws. Applying the intermediate value theorem (problem #2) we see that there exists such that . But this means , so we are done.
Counterexample for the open interval: consider defined by . First, we can verify that if then , so indeed maps to . To see that there is no fixed point, note that the only solution to in is , which is no
EDIT: I've got another framing that I thought would be more useful for later problems, but I was wrong. I still think there is some value in understanding this proof as well.
In particular, look at this diagram on Wikipedia. It would be better if the whole upper triangle was blue and the whole lower triangle were red instead of just one side (you can arbitrarily decide whether to paint the rest of the diagonal blue or red). If x=0 and x=1 aren't fixed points, then they must be blue and red respectively. If we split [0,1] into n components of size ...
Here is my attempt, based on Hoagy's proof.
Let be an integer. We are given that and . Now consider the points in the interval . By 1-D Sperner's lemma, there are an odd number of such that and (i.e. an odd number of "segments" that begin below zero and end up above zero). In particular, is an even number, so there must be at least one such number . Choose the smallest and call this number .
Now consider the sequence . Since this sequence takes values in
I'm having trouble understanding why we can't just fix in your proof. Then at each iteration we bisect the interval, so we wouldn't be using the "full power" of the 1-D Sperner's lemma (we would just be using something close to the base case).
Also if we are only given that is continuous, does it make sense to talk about the gradient?
Yeah agreed, in fact I don't think you even need to continually bisect, you can just increase n indefinitely. Iterating becomes more dangerous as you move to higher dimensions because an n dimensional simplex with n+1 colours that has been coloured according to analogous rules doesn't necessarily contain the point that maps to zero.
On the second point, yes I'd been assuming that a bounded function had a bounded gradient, which certainly isn't true for say sin(x^2), the final step needs more work, I like the way you did it in the proof below.
"I'm having trouble understanding why we can't just fix n=2 in your proof. Then at each iteration we bisect the interval, so we wouldn't be using the "full power" of the 1-D Sperner's lemma (we would just be using something close to the base case)." - You're right, you can prove this without using the full power of Sperner's lemma. I think it becomes more useful for the multi-dimensional case.
I didn't log the time I spent on the original blog post, and it's kinda hard to assign hours to this since most of the reading and thinking for the post happened while working on the modeling aspects of the MTAIR project. If I count just the time I sat down to write the blog post, I would guess maybe less than 20 hours.
As for the "convert the post to paper" part, I did log that time and it came out to 89 hours, so David's estimate of "perhaps another 100 hours" is fairly accurate.