I think we (mostly) all agree that we want to somehow encode human values into AGIs. That's not a new idea. The devil is in the details.
I see how my above question seems naive. Maybe it is. But if one potential answer to the alignment problem lies in the way our brains work, maybe we should try to understand that better, instead of (or in addition to) letting a machine figure it out for us through some kind of "value learning". (Copied from my answer to AprilSR:) I stumbled across two papers from a few years ago by a psychologist, Mark Muraven, who thinks that the way humans deal with conflicting goals could be important for AI alignment (https://arxiv.org/abs/1701.01487 and https://arxiv....
The problem with this model is, its predictions depend a lot on how you draw the boundary around "field". Take Yudkowsky's example of startups. How do we explain small startups succeed where large companies failed?
I don't quite see how this is a problem for the model. The narrower you draw the boundary, the more jumpy progress will be, right?
Successful startups are big relative to individuals, but not that big relative to the world as a whole. If we're talking about a project / technology / company that can rival the rest of the world in its output, then t...
I don't quite see how this is a problem for the model. The narrower you draw the boundary, the more jumpy progress will be, right?
So, you're saying: if we draw the boundary around a narrow field, we get jumpy/noisy progress. If we the draw the boundary around a broad field, all the narrow subfields average out and the result is less noise. This makes a lot of sense, thank you!
The question is, what metric do we use to average the subfields. For example, on some metrics the Manhattan project might be a rather small jump in military-technology-averaged-ove...
In my view, the biological anchors and the Very Serious estimates derived therefrom are really useful for the following very narrow yet plausibly impactful purpose
I don't understand why it's not just useful directly. Saying that the numbers are not true upper or lower bounds seems like it's expecting way too much!
They're not even labeled as bounds (at least in the headline). They're supposed to be "anchors".
Suppose you'd never done the analysis to know how much compute a human brain uses, or how much compute all of evolution had used. Wouldn't this report ...
Claim 4: GPT-N need not be "trying" to predict the next word. To elaborate: one model of GPT-N is that it is building a world model and making plans in the world model such that it predicts the next word as accurately as possible. This model is fine on-distribution but incorrect off-distribution. In particular, it predicts that GPT-N would e.g. deliberately convince humans to become more predictable so it can do better on future next-word predictions; this model prediction is probably wrong.
I got a bit confused by this section, I think because the word "mo...
Let , where and
[...] The second rule says that is orthogonal to itself
Should that be "is not orthogonal to itself"? I thought the meant non-orthogonal, so would think means that is not orthogonal to itself.
(The transcript accurately reflects what was said in the talk, but I'm asking whether Scott misspoke.)
But once you let it do more computation, then it doesn't have to know anything at all, right? Like, maybe the best go bot is, "Train an AlphaZero-like algorithm for a million years, and then use it to play."
I know more about go than that bot starts out knowing, but less than it will know after it does computation.
I wonder if, when you use the word "know", you mean some kind of distilled, compressed, easily explained knowledge?
This is commonly said on the basis of his $1b pledge
Wasn't it supposed to be a total of $1b pledged, from a variety of sources, including Reid Hoffman and Peter Thiel, rather than $1b just from Musk?
EDIT: yes, it was.
Sam, Greg, Elon, Reid Hoffman, Jessica Livingston, Peter Thiel, Amazon Web Services (AWS), Infosys, and YC Research are donating to support OpenAI. In total, these funders have committed $1 billion, although we expect to only spend a tiny fraction of this in the next few years.
For those organizations that do choose to compete... I think it is highly likely that they will attempt to build competing systems in basically the exact same way as the first organization did
...
It's unlikely for there to exist both aligned and misaligned AI systems at the same time
If the first group sunk some cost into aligning their system, but that wasn't integral to its everyday task performance, wouldn't a second competing group be somewhat likely to skimp on the alignment part?
It seems like this calls into the question the claim that we wouldn't get ...
This is helpful. Thanks!
Are the two frames labeled and in section 3.3 (with an agent that thinks about either red or green and walks or stays home, and an environment that's safe or has bears) equivalent? (I would guess so, given the section that example was in, but didn't see it stated explicitly.)
I didn't feel like any comment I would have made would have anything more to say than things I've said in the past.
FWIW, I say: don't let that stop you! (Don't be afraid to repeat yourself, especially if there's evidence that the point has not been widely appreciated.)
Unfortunately, I also only have so much time, and I don't generally think that repeating myself regularly in AF/LW comments is a super great use of it.
Next, we might imagine GPT-N to just be an Oracle AI, which we would have better hopes of using well. But I don't expect that an approximate Oracle AI could be used safely with anything like the precautions that might work for a genuine Oracle AI. I don't know what internal optimizers GPT-N ends up building along the way, but I'm not going to count on there being none of them.
Is the distinguishing feature between Oracle AI and approximate Oracle AI, as you use the terms here, just about whether there are inner optimizers or not?
(When I started the paragrap...
As for planning, we've seen the GPTs ascend from planning out the next few words, to planning out the sentence or line, to planning out the paragraph or stanza. Planning out a whole text interaction is well within the scope I could imagine for the next few iterations, and from there you have the capability of manipulation without external malicious use.
Perhaps a nitpick, but is what it does planning?
Is it actually thinking several words ahead (a la AlphaZero evaluating moves) when it decides what word to say next, or is it just doing free-writing, and it j...
That's not a nitpick at all!
Upon reflection, the structured sentences, thematically resolved paragraphs, and even JSX code can be done without a lot of real lookahead. And there's some evidence it's not doing lookahead - its difficulty completing rhymes when writing poetry, for instance.
(Hmm, what's the simplest game that requires lookahead that we could try to teach to GPT-3, such that it couldn't just memorize moves?)
Thinking about this more, I think that since planning depends on causal modeling, I'd expect the latter to get good before the former. But I probably overstated the case for its current planning capabilities, and I'll edit accordingly. Thanks!
This framing makes sense to me. Thanks!
First, do we now have an example of an AI not using cognitive capacities that it had, because the 'face' it's presenting wouldn't have those cognitive capacities?
This does seem like an interesting question. But I think we should be careful to measure against the task we actually asked the system to perform.
For example, if I ask my system to produce a cartoon drawing, it doesn't seem very notable if I get a cartoon as a result rather than a photorealistic image, even if it could have produced the latter.
Maybe what this just means is that we should track wha...
For example, if I ask my system to produce a cartoon drawing, it doesn't seem very notable if I get a cartoon as a result rather than a photorealistic image, even if it could have produced the latter.
Consider instead the scenario where I show a model a photo of a face, and the model produces a photo of the side of that face. An interesting question is "is there a 3d representation of the face in the model?". It could be getting the right answer that way, or it could be getting it some other way.
Similarly, when it models a 'dumb' ch...
I agree. And I thought Arthur Breitman had a good point on one of the related Twitter threads:
GPT-3 didn't "pretend" not to know. A lot of this is the AI dungeon environment. If you just prompt the raw GPT-3 with: "A definition of a monotreme follows" it'll likely do it right. But if you role play, sure, it'll predict that your stoner friend or young nephew don't know.
Yeah, it seems like deliberately pretending to be stupid here would be predicting a less likely sequence, in service of some other goal.
I wonder how long we'll be in the "prompt programming" regime. As Nick Cammarata put it:
We should actually be programming these by manipulating the hidden layers and prompts are a stand-in until we can.
My guess is that OpenAI will pretty quickly (within the next year) find a much better way to interface with what GPT-3 has learned.
Do others agree? Any reason to think that wouldn't be possible (or wouldn't give significant benefits)?
I guess the rightmost term could be zero or negative, right? (If the difference between T and P is greater than or equal to the difference between P and S.) In that case, the payoffs would be such that there's no credence you could have that the other player will play Hare that would justify playing Hare yourself (or justify it as non-defection, that is).
So my claim #1 is always true, but claim #2 depends on the payoff values.
In other words, Stag Hunt could be subdivided into two games: one where the payoffs never justify playing Hare (as non-defection), and one where they sometimes do, depending on your credence that the other player will play Stag.
Combining the two conditions, we have
Since , this holds for some nonempty subinterval of .
I want to check that I'm following this. Would it be fair to paraphrase the two parts of this inequality as:
1) If your credence that the other player is going to play Stag is high enough, you won't even be tempted to play Hare.
2) If your credence that the other player is going to play Hare is high enough, then i
This is a good question, and I don't know the answer. My guess is that Paul would say that that is a potential problem, but different from the one being addressed in this post. Not sure though.
Thank you, this was helpful. I hadn't understood what was meant by "the generalization is now coming entirely from human beliefs", but now it seems clear. (And in retrospect obvious if I'd just read/thought more carefully.)
Ah, I see. It sounds like the key thing I was missing was that the strangeness of the prior only matters when you're testing on a different distribution than you trained on. (And since you can randomly sample from x* when you solicit forecasts from humans, the train and test distributions can be considered the same.)
In this case, I can pay humans to make forecasts for many randomly chosen x* in D*, train a model f to predict those forecasts, and then use f to make forecasts about the rest of D*.
The generalization is now coming entirely from human beliefs, not from the structural of the neural net — we are only applying neural nets to iid samples from D*.
Perhaps a dumb question, but don't we now have the same problem at one remove? The model for predicting what the human would predict would still come from a "strange" prior (based on the l2 norm, or whatever)
...Deep learning AGI implies mesa optimization: Since deep learning is so sample inefficient, it cannot reach human levels of performance if we apply deep learning directly to each possible task T. (For example, it has to relearn how the world works separately for each task T.) As a result, if we do get AGI primarily via deep learning, it must be that we used deep learning to create a new optimizing AI system, and that system was the AGI.
I don't quite understand what this is saying.
Suppose we train a giant deep learning model via self-supervised learning on a
Got it, thank you!
I see. So, restating in my own terms -- outer alignment is in fact about whether getting what you asked for is good, but for the case of prediction, the malign universal prior argument says that "perfect" prediction is actually malign. So this would be a case of getting what you wanted / asked for / optimized for, but that not being good. So it is an outer alignment failure.
Whereas an inner alignment failure would necessarily involve not hitting optimal performance at your objective. (Otherwise it would be an inner alignment success, and an outer alignment failure.)
Is that about right?
Great post, thank you!
However, I think I don't quite understand the distinction between inner alignment and outer alignment, as they're being used here. In particular, why would the possible malignity of the universal prior be an example of outer alignment rather than inner?
I was thinking of outer alignment as being about whether, if a system achieves its objective, is that what you wanted. Whereas inner alignment was about whether it's secretly optimizing for something other than the stated objective in the first place.
From that perspective, wouldn't mali
Hmm, maybe I'm missing something basic and should just go re-read the original posts, but I'm confused by this statement:
So what we do here is say "belief set A is strictly 'better' if this particular observer always trusts belief set A over belief set B", and "trust" is defined as "whatever we think belief set A believes is also what we believe".
In this, belief set A and belief set B are analogous to A[C] and C (or some c in C), right? If so, then what's the analogue of "trust... over"?
If we...
Notably, we need to trust A[C] even over our own beliefs, that is, if A[C] believes something, we discard our position and adopt A[C]'s belief.
To clarify, this is only if we (or the process that generated our beliefs) fall into class C, right?
The authors don't really suggest an explanation; the closest they come is speculating that at the interpolation threshold there's only ~one model that can fit the data, which may be overfit, but then as you increase further the training procedure can "choose" from the various models that all fit the data, and that "choice" leads to better generalization. But this doesn't make sense to me, because whatever is being used to "choose" the better model applies throughout training, and so even at the interpolation thr...
This means that if there's more than twice the power coming from one move than from another, the former is more likely than the latter. In general, if one set of possibilities contributes 2K the power of another set of possibilities, the former set is at least K times more likely than the latter.
Where does the 2 come from? Why does one move have to have more than twice the power of another to be more likely? What happens if it only has 1.1x as much power?
Remember how, as the agent gets more farsighted, more of its control comes from Chocolate and Hug, while also these two possibilities become more and more likely?
I don't understand this bit -- how does more of its control come from Chocolate and Hug? Wouldn't you say its control comes from Wait!? Once it ends up in Candy, Chocolate, or Hug, it has no control left. No?
We bake the opponent's policy into the environment's rules: when you choose a move, the game automatically replies.
And the opponent plays to win, with perfect play?
Imagine we only care about the reward we get next turn. How many goals choose Candy over Wait? Well, it's 50-50 – since we randomly choose a number between 0 and 1 for each state, both states have an equal chance of being maximal.
I got a little confused at the introduction of Wait!, but I think I understand it now. So, to check my understanding, and for the benefit of others, some notes:
It's meant to be analogous to imputing a value in a causal Bayes net
Aha! I thought it might be borrowing language from some technical term I wasn't familiar with. Thanks!
Takeoff does matter, in that I expect that this worldview is not very accurate/good if there's discontinuous takeoff, but imputing the worldview I don't think takeoff matters.
Minor question: could you clarify what you mean by "imputing the worldview" here? Do you mean something like, "operating within the worldview"? (I ask because this doesn't seem to be a use of "impute" that I'm familiar with.)
And I think claim 5 is basically in line with what, say, Bostrom would discuss (where stabilization is a thing to do before we attempt to build a sovereign).
You mean in the sense of stabilizing the whole world? I'd be surprised if that's what Yann had in mind. I took him just to mean building a specialized AI to be a check on a single other AI.
there is a troll who will blow up the bridge with you on it, if you cross it "for a dumb reason"
Does this way of writing "if" mean the same thing as "iff", i.e. "if and only if"?
I can't resist giving this pair of rather incongruous quotes from the paper
Could you spell out what makes the quotes incongruous with each other? It's not jumping out at me.
The authors acknowledged that the modifications they did to RL "brings RL closer to the frameworks of decision theory and game theory" (AFAICT, the algorithms they end up with are nearly pure decision/game theory) but given that some researchers have been focused on decision theory for a long time exactly because a decision theoretic agent can be reflectively stable, it seems incongruous to also write "perhaps surprisingly, there are modifications of the RL objective that remove the agent’s incentive to tamper with the reward function."
Got it, that's helpful. Thank you!
Very clear presentation! As someone outside the field who likes to follow along, I very much appreciate these clear conceptual frameworks and explanations.
I did however get slightly lost in section 1.2. At first reading I was expecting this part:
which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers.
to say, "... gap between the behavioral objective and the intended goal of the programmers." (In which case the inner alignment problem would be a subcomponent...
Phrases I've used: [intended/desired/designer's] [objective/goal]
I think "designer's objective" would fit in best with the rest of the terminology in this post, though "desired objective" is also good.
I don't have a good term for that, unfortunately—if you're trying to build an aligned AI, "human values" could be the right term, though in most cases you really just want "move one strawberry onto a plate without killing everyone," which is quite a lot less than "optimize for all human values." I could see how meta-objective might make sense if you're thinking about the human as an outside optimizer acting on the system, though I would shy away from using that term like that, as anyone familiar with meta-learning will assume you mean the objective of a me
...To me the difference is that when I read 5 I'm thinking about people being careless or malevolent, in an everyday sense of those terms, whereas when I read 4 I'm thinking about how maybe there's no such thing as a human who's not careless or malevolent, if given enough power and presented with a weird enough situation.
I conclude from this that CDT should equal EDT (hence, causality must account for logical correlations, IE include logical causality). By "CDT" I really mean any approach at all to counterfactual reasoning; counterfactual expectations should equal evidential expectations.
As with most of my CDT=EDT arguments, this only provides an argument that the expectations should be equal for actions taken with nonzero probability. In fact, the amount lost to Dutch Book will be proportional to the probability of the action in question. So, differing counterfa...
To clarify, you do do the human's instrumental sub-goals though, just not extra ones for yourself, right?
With respect to whether it is what I want, I wouldn't say that I want any of these things in particular, I'm more pointing at the existence of systems that aren't goal-directed, yet behave like an agent.
Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn't when the human doesn't)?
In which case, would it be fair to summarize (part of) your argument as:
1) Many of the potential problems with building safe superintelligent systems ...
For example, imitation learning allows you to create an agent that behaves similarly to another agent—I would classify this as “Agent AI that is not goal-directed”.
Let's say that I'm pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at "whatever goal Rohin has"
What causes the agent to switch from X to Y?
Are you thinking of the ...
Are you thinking of the "agent" as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?
I was imagining something more like B for the imitation learning case.
I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you'd likely get an agent that continues to pursue goal X even after you've s...
In this post (original here), Paul Christiano analyzes the ambitious value learning approach.
I find it a little bit confusing that Rohin's note refers to the "ambitious value learning approach", while the title of the post refers to the "easy goal inference problem". I think the note could benefit from clarifying the relationship of these two descriptors.
As it stands, I'm asking myself -- are they disagreeing about whether this is easy or hard? Or is "ambitious value learning" the same as "goal inference" (...
I think this is actually the part that I most "disagree" with. (I put "disagree" in quotes, because there are forms of these theses that I'm persuaded by. However, I'm not so confident that they'll be relevant for the kinds of AIs we'll actually build.)
1. The smart part is not the agent-y part
It seems to me that what's powerful about modern ML systems is their ability to do data compression / patter... (read more)
Agree that self-supervised learning powers both GPT-3 updates and human brain world-model updates (details & caveats). (Which isn’t to say that GPT-3 is exactly the same as the human brain world-model—there are infinitely many different possible ML algorithms that all ... (read more)