Some of Nate’s quick thoughts (paraphrased), after chatting with him:
Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-me...
using GOFAI methods
"Nope" to this part. I otherwise like this comment a lot!
...The main thing I'm claiming is that MIRI said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.
The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value speci
Nate and Eliezer have already made some of the high-level points I wanted to make, but they haven't replied to a lot of the specific examples and claims in the OP, and I see some extra value in doing that. (Like, if you think Eliezer and Nate are being revisionist in their claims about what past-MIRI thought, then them re-asserting "no really, we used to believe X!" is less convincing than my responding in detail to the specific quotes Matt thinks supports his interpretation, while providing examples of us saying the opposite.)
...However, I distinctly recall
Thanks for this comment. I think this is a good-faith reply that tries to get to the bottom of the disagreement. That said, I think you are still interpreting me as arguing that MIRI said AI wouldn't understand human values, when I explicitly said that I was not arguing that. Nonetheless, I appreciate the extensive use of quotations to precisely pinpoint where you disagree; this is high-quality engagement.
The main thing I'm claiming is that MIRI people said it would be hard to specify (for example, write into a computer) an explicit function that reflects ...
Straw-EY: Complexity of value means you can't just get the make-AI-care part to happen by chance; it's a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says "and now call GPT and ask it what's good". So now it's a very small number of bits.
To which I say: "dial a random phone number and ask the person who answers what's good" can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to lev...
But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. [...] I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!
Quoting myself in April:
..."MIRI's argument for AI risk depended on AIs being bad at n
That makes sense, but I say in the post that I think we will likely have a solution to the value identification problem that's "about as good as human judgement" in the near future.
We already have humans who are smart enough to do par-human moral reasoning. For "AI can do par-human moral reasoning" to help solve the alignment problem, there needs to be some additional benefit to having AI systems that can match a human (e.g., some benefit to our being able to produce enormous numbers of novel moral judgments without relying on an existing text corpus or hiring thousands of humans to produce them). Do you have some benefit in mind?
Basically, I think your later section--"Maybe you think"--is pointing in the right direction, and requiring a much higher standard than human-level at moral judgment is reasonable and consistent with the explicit standard set by essays by Yudkowsky and other MIRI people. CEV was about this; talk about philosophical competence or metaphilosophy was about this. "Philosophy with a deadline" would be a weird way to put it if you thought contemporary philosophy was good enough.
I don't think this is the crux. E.g., I'd wager the number of bits you need to get in...
My own suggestion would be to use a variety of different phrasings here, including both "capabilities" and "intelligence", and also "cognitive ability", "general problem-solving ability", "ability to reason about the world", "planning and inference abilities", etc. Using different phrases encourages people to think about the substance behind the terminology -- e.g., they're more likely to notice their confusion if the stuff you're saying makes sense to them under one of the phrasings you're using, but doesn't make sense to them under another of the phrasin...
Does "par-human reasoning" mean at the level of an individual human or at the level of all of humanity combined?
If it's the former, what human should we compare it against? 50th percentile? 99.999th percentile?
I partly answered that here, and I'll edit some of this into the post:
...By 'matching smart human performance... across all the scientific work humans do in that field' I don't mean to require that there literally be nothing humans can do that the AI can't match. I do expect this kind of AI to quickly (or immediately) blow humans out of the water, but t
Thanks for posting this, Andrea_Miotti and remember! I noticed a lot of substantive errors in the transcript (and even more errors in vonk's Q&A transcript), so I've posted an edited version of both transcripts. I vote that you edit your own post to include the revisions I made.
Here's a small sample of the edits I made, focusing on ones where someone may have come away from your transcript with a wrong interpretation or important missing information (as opposed to, e.g., the sentences that are just very hard to parse in the original transcript because ...
How about the distinction between (A) “An AGI kills every human, and the people who turned on the AGI didn’t want that to happen” versus (B) “An AGI kills every human, and the people who turned on the AGI did want that to happen”?
I think the misuse vs. accident dichotomy is clearer when you don't focus exclusively on "AGI kills every human" risks. (E.g., global totalitarianism risks strike me as small but non-negligible if we solve the alignment problem. Larger are risks that fall short of totalitarianism but still involve non-morally-humble developers dam...
FYI, the timestamp is for the first Discord message. If the log broke out timestamps for every part of the message, it would look like this:
...[2:21 PM]
It's about the size of the information bottleneck. The human genome is 3 billion base pairs drawn from 4 possibilities, so 750 megabytes. Let's say 90% of that is junk DNA, and 10% of what's left is neural wiring algorithms. So the code that wires a 100-trillion-synapse human brain is about 7.5 megabytes. Now an adult human contains a lot more information than this. Your spinal cord is about 70 million neurons
I don't know Nate's response, but his take on agent-foundations-ish research in A note about differential technological development (and the fact that he and MIRI have been broadly pro-interpretability-work to date) might help clarify how he thinks about cases like this.
...[...]
I feel relatively confident that a large percentage of people who do capabilities work at OpenAI, FAIR, DeepMind, Anthropic, etc. with justifications like "well, I'm helping with alignment some too" or "well, alignment will be easier when we get to the brink" (more often EA-adjacent th
The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period. Per Eliezer on Arbital, the "easiest-to-align" condition probably means that you want the act that requires minimal cognitive abilities, out of the set of acts that suffice to prevent the world from being destroyed:
...In the context of AI alignment, the "Principle of Minimality" or "Principle of Least Everything" sa
The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period.
FYI, I think there's a huge difference between "I think humanity needs to aim for a pivotal act" and "I recommend to groups pushing the capabilities frontier forward to aim for pivotal act". I think pivotal acts require massive amounts of good judgement to do right, and, like, I think capabilities researchers have...
The definitions given in the post are:
- ASI-boosted humans — We solve all of the problems involved in aiming artificial superintelligence at the things we’d ideally want.
[...]
- misaligned AI — Humans build and deploy superintelligent AI that isn’t aligned with what we’d ideally want.
I'd expect most people to agree that "We solve all of the problems involved in aiming artificial superintelligence at the things we'd ideally want" yields outcomes that are about as good as possible, and I'd expect most of the disagreement to turn (either overtly or in some su...
Predictions, using the definitions in Nate's post:
Strong Utopia | |
Elicit Prediction (forecast.elicit.org/binary/questions/J_BQIG-KD)
| Elicit Prediction (forecast.elicit.org/binary/questions/DY7buVchR)
|
Elicit Prediction (forecast.elicit.org/binary/questions/88hE3y6i8)
| Elicit Prediction (forecast.elicit.org/binary/questions/gcMXnPnT1)
|
Weak Utopia | |
Elicit Prediction (forecast.elicit.org/binary/questions/isgYV7473)
| Elicit Prediction (forecast.elicit.org/bi |
My example with the 100 million referred to question 1.
Yeah, I'm also talking about question 1.
I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences.
Seems obviously false as a description of my values (and, I'd guess, just about every human's).
Consider the simple example of a universe that consists of two planets: mine, and another person's. We don't have spaceships, so we can't interact. I am not therefore indifferent to whether the other person is being horribly tortured for thousands of years.
If I...
I'm curious what is the evidence you see that this is false as a description of the values of just about every human, given that
But, two rooms with trillion people each is virtually the same as one room with two trillion. The returns on interactions with additional people fall off exponentially past the Dunbar number.
You're conflating "would I enjoy interacting with X?" with "is it good for X to exist?". Which is almost understandable given that Nate used the "two people can have more fun in the same room" example to illustrate why utility isn't linear in population. But this comment has an IMO bizarre amount of agreekarma (26 net agreement, with 11 votes), which makes me wonder if...
First, you can consider preferences that are impartial but sublinear in the number of people. So, you can disagree with Nate's room analogy without the premise "stuff only matters if it adds to my own life and experiences".
Second, my preferences are indeed partial. But even that doesn't mean "stuff only matters if it adds to my own life and experiences". I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences. More details here.
Third, I don't know what do you mean by "good". The questions that I unders...
(And we aren't perfect recognizers of 'functional, safe-to-use nanofactory' or other known-to-me things that might save the world.)
Also from Ronny:
There's also an important disanalogy between generating/recognizing faces and learning 'human values', which is that humans are perfect human face recognizers but not perfect recognizers of worlds high in 'human values'.
That means that there might be world states or plans in the training data or generated by adversarial training that look to us, and ML trained to recognize these things the way we recognize them, like they are awesome, but are actually awful.
Ronny Fernandez asked me, Nate, and Eliezer for our take on Twitter. Copying over Nate's reply:
...briefly: A) narrow non-optimizers can exist but won't matter; B) wake me when the allegedly maximally-facelike image looks human; C) ribosomes show that cognition-bound superpowers exist; D) humans can't stack into superintelligent corps, but if they could then yes plz value-load
(tbc, I appreciate Katja saying all that. Hooray for stating what you think, and hooray again when it's potentially locally unpopular! If I were less harried I might give more than a twee
Could someone clarify the relevance of ribosomes?
Also from Ronny:
There's also an important disanalogy between generating/recognizing faces and learning 'human values', which is that humans are perfect human face recognizers but not perfect recognizers of worlds high in 'human values'.
That means that there might be world states or plans in the training data or generated by adversarial training that look to us, and ML trained to recognize these things the way we recognize them, like they are awesome, but are actually awful.
Note: "ask them for the faciest possible thing" seems confused.
How I would've interpreted this if I were talking with another ML researcher is "Sample the face at the point of highest probability density in the generative model's latent space". For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.
I'm guessing what he has in mind is more like "take a GAN discriminator / image classifier &...
e.g. by trying to apply standards of epistemic uncertainty to the state of this essence?
I would say that there's a logical object that a large chunk of human moral discourse is trying to point at — something like "the rules of the logical game Morality", analogous to "the rules of the logical game Chess". Two people can both be discussing the same logical object "the rules of Chess", but have different beliefs about what that logical object's properties are. And just as someone can be mistaken or uncertain bout the rules of chess — or about their int...
I also contend that it's the more epistemically humble position, because you're not saying that it's for sure that a pivotal act should be performed, but just that it's quite plausible given the specifics of the current world situation
The thing I'd say in favor of this position is that I think it better fits the evidence. I think the problem with the opposing view is that it's wrong, not that it's more confident. E.g., if I learned that Nate assigns probability .9 to "a pivotal act is necessary" (for some operationalization of "necessary") while Critch ass...
"The goal should be to cause the future to be great on its own terms"
What the heck is this supposed to mean? Great according to the Inherent Essence Of Goodness that lives inside futures, rather than as part of human evaluations?
The rest of the quote explains what this means:
...The goal should be to cause the future to be great on its own terms, without locking in the particular moral opinions of humanity today — and without locking in the moral opinions of any subset of humans, whether that’s a corporation, a government, or a nation.
(If you can't s
The present is "good on its own terms", rather than "good on Ancient Romans' terms", because the Ancient Romans weren't able to lock in their values. If you think this makes sense (and is a good thing) in the absence of an Inherent Essence Of Goodness, then there's no reason to posit an Inherent Essence Of Goodness when we switch from discussing "moral progress after Ancient Rome" to "moral progress after circa-2022 civilization".
The present is certainly good on my terms (relative to ancient Rome). But the present itself doesn't care. It's not the type of ...
The wisest moves we've made as a species to date (ending slavery? ending smallpox? landing on the moon?) didn't particularly look like "worldwide collaborations".
I think Nate might've been thinking of things like:
Seems right to me.
I'd guess Nate might say one of:
Thanks for the update, Ajeya! I found the details here super interesting.
I already thought that timelines disagreements within EA weren't very cruxy, and this is another small update in that direction: I see you and various MIRI people and Metaculans give very different arguments about how to think about timelines, and then the actual median year I tend to hear is quite similar.
(And also, all of the stated arguments on all sides continue to seem weak/inconclusive to me! So IMO there's not much disagreement, and it would be very easy for all of us to be wro...
Some added context for this list: Nate and Eliezer expect the first AGI developers to encounter many difficulties in the “something forces you to stop and redesign (and/or recode, and/or retrain) large parts of the system” category, with the result that alignment adds significant development time.
By default, safety-conscious groups won't be able to stabilize the game board before less safety-conscious groups race ahead and destroy the world. To avoid this outcome, humanity needs there to exist an AGI group that…
One caveat to the claim that we should prioritize serial alignment work over parallelizable work, is that this assumes an omniscient and optimal allocator of researcher-hours to problems.
Why do you think it assumes that?
This isn't a coincidence; the state of alignment knowledge is currently "we have no idea what would be involved in doing it even in principle, given realistic research paths and constraints", very far from being a well-specified engineering problem. Cf. https://intelligence.org/2013/11/04/from-philosophy-to-math-to-engineering/.
If you succeed at the framework-inventing "how does one even do this?" stage, then you can probably deploy an enormous amount of engineering talent in parallel to help with implementation, small iterative improvements, building-upon-foundations, targeting-established-metrics, etc. tasks.
From A central AI alignment problem: capabilities generalization, and the sharp left turn:
...Suppose that the fictional team OpenMind is training up a variety of AI systems, before one of them takes that sharp left turn. Suppose they've put the AI in lots of different video-game and simulated environments, and they've had good luck training it to pursue an objective that the operators described in English. "I don't know what those MIRI folks were talking about; these systems are easy to direct; simple training suffices", they say. At the same time, they apply
(Most of the QR-upvotes at the moment are from me. I think 1-4 are all good questions, for Nate or others; but I'm extra excited about people coming up with ideas for 3.)
When I think about the strawberry problem, it seems unnatural, and perhaps misleading of our attention, since there's no guarantee there's even a reasonable solution.
Why would there not be a solution?
To clarify, I said there might not be a reasonable solution (i.e. such that solving the strawberry problem isn't significantly harder than solving pivotal-act alignment).
Not directly answering your Q, but here's why it seems unnatural and maybe misleading-of-attention. Copied from a Slack message I sent:
...First, I suspect that even an aligned AI would fail the "duplicate a strawberry and do nothing else" challenge, because such an AI would care about human life and/or about cooperating with humans, and would be asked to stand by while 1.8 humans
On my model, the point of ass numbers isn't to demand perfection of your gut (e.g., of the sort that would be needed to avoid multiple-stage fallacies when trying to conditionalize a lot), but to:
Collecting all of the quantitative AI predictions I know of MIRI leadership making on Arbital (let me know if I missed any):
On my model, the point of ass numbers isn't to demand perfection of your gut (e.g., of the sort that would be needed to avoid multiple-stage fallacies when trying to conditionalize a lot), but to:
Note that I advocate for considering much more weird solutions, and also thinking much more weird world states when talking with the "general world". While in contrast, on LW and AF, I'd like to see more discussion of various "boring" solutions on which the world can roughly agree.
Can I get us all to agree to push for including pivotal acts and pivotal processes in the Overton window, then? :) I'm happy to publicly talk about pivotal processes and encourage people to take them seriously as options to evaluate, while flagging that I'm ~2-5% on them be...
- With pretty high confidence, you expect sharp left turn to happen (in almost all trajectories)
- This is to a large extent based on the belief that at some point "systems start to work really well in domains really far beyond the environments of their training" which is roughly the same as "discovering a core of generality" and few other formulations. These systems will be in some meaningful sense fundamentally different from eg Gato
That's right, though the phrasing "discovering a core of generality" here sounds sort of mystical and mysterious to me, which ma...
In my view, in practice, the pivotal acts framing actually pushes people to consider a more narrow space of discrete powerful actions, "sharp turns", "events that have a game-changing impact on astronomical stakes".
My objection to Critch's post wasn't 'you shouldn't talk about pivotal processes, just pivotal acts'. On the contrary, I think bringing in pivotal processes is awesome.
My objection (more so to "Pivotal Act" Intentions, but also to the new one) is specifically to the idea that we should socially shun the concept of "pivotal acts", and socia...
An example of a possible "pivotal act" I like that isn't "melt all GPUs" is:
Use AGI to build fast-running high-fidelity human whole-brain emulations. Then run thousands of very-fast-thinking copies of your best thinkers. Seems to me this plausibly makes it realistic to keep tabs on the world's AGI progress, and locally intervene before anything dangerous happens, in a more surgical way rather than via mass property destruction of any sort.
Looking for pivotal acts that are less destructive (and, more importantly for humanity's sake, less difficult to align)...
Some hopefully-unnecessary background info for people attempting this task:
A description of corrigibility Eliezer wrote a few months ago: "'corrigibility' is meant to refer to the sort of putative hypothetical motivational properties that prevent a system from wanting to kill you after you didn't build it exactly right".
An older description of "task-directed AGI" he wrote in 2015-2016: "A task-based AGI is an AGI intended to follow a series of human-originated orders, with these orders each being of limited scope", where the orders can be "accomplished using bounded amounts of effort and resources (as opposed to the goals being more and more fulfillable using more and more effort)."
Ronny Fernandez on Twitter:
...I think I don’t like AI safety analogies with human evolution except as illustrations. I don’t think they’re what convinced the people who use those analogies, and they’re not what convinced me. You can convince yourself of the same things just by knowing some stuff about agency.
Corrigibility, human values, and figure-out-while-aiming-for-human-values, are not short description length. I know because I’ve practiced finding the shortest description lengths of things a lot, and they just don’t seem like the right sort of thing.
Also
From an Eliezer comment:
Interventions on the order of burning all GPUs in clusters larger than 4 and preventing any new clusters from being made, including the reaction of existing political entities to that event and the many interest groups who would try to shut you down and build new GPU factories or clusters hidden from the means you'd used to burn them, would in fact really actually save the world for an extended period of time and imply a drastically different gameboard offering new hopes and options. [...]
If Iceland did this, it would plausibly need...
I kind of like the analogous idea of an alignment target as a repeller cone / dome.
Corrigibility is a repeller. Human values aren't a repeller, but they're a very narrow target to hit.
A lot of models of what can or can't work in AI alignment depends on intuitions about whether to expect "true discontinuities" or just "steep bits".
Note that Nate and Eliezer expect there to be some curves you can draw after-the-fact that shows continuity in AGI progress on particular dimensions. They just don't expect these to be the curves with the most practical impact (and they don't think we can identify the curves with foresight, in 2022, to make strong predictions about AGI timing or rates of progress).
Quoting Nate in 2018:
...On my model, the key point
I'm not Eliezer, but my high-level attempt at this:
...[...] The things I'd mainly recommend are interventions that:
- Help ourselves think more clearly. (I imagine this including a lot of trying-to-become-more-rational, developing and following relatively open/honest communication norms, and trying to build better mental models of crucial parts of the world.)
- Help relevant parts of humanity (e.g., the field of ML, or academic STEM) think more clearly and understand the situation.
- Help us understand and resolve major disagreements. (Especially current disagreements
In the context of a conversation with Balaji Srinivasan about my AI views snapshot, I asked Nate Soares what sorts of alignment results would impress him, and he said:
... (read more)