As someone who expects LLMs to be a dead end, I nonetheless think this post makes a valid point and does so using reasonable and easy to understand arguments. I voted +1.
I don't find this framing compelling. Particularly wrt to this part:
Obedience — AI that obeys the intention of a human user can be asked to help build unsafe AGI, such as by serving as a coding assistant. (Note: this used to be considered extremely sci-fi, and now it's standard practice.)
I grant the point that an AI that does what the user wants can still be dangerous (in fact it could outright destroy the world). But I'd describe that situation as "we successfully aligned AI and things went wrong anyway" rather than "we failed to align AI". I grant t...
The post defending the claim is Reward is not the optimization target. Iirc, TurnTrout has described it as one of his most important posts on LW.
I know he's talking about alignment, and I'm criticizing that extremely strong claim. This is the main thing I wanted to criticize in my comment! I think the reasoning he presents is not much supported by his publicly available arguments.
Ok, I don't disagree with this. I certainly didn't develop a gears-level understanding of why [building a brain-like thing with gradient descent on giant matrices] is doomed after reading the 2021 conversations. But that doesn't seem very informative either way; I didn't spend that much time trying to grok his arguments.
I also don't really get your position. You say that,
[Eliezer] confidently dismisses ANNs
but you haven't shown this!
In Surface Analogies and Deep Causes, I read him as saying that neural networks don't automatically yield intelligence just because they share surface similarities with the brain. This is clearly true; at the very least, using token-prediction (which is a task for which (a) lots of training data exist and (b) lots of competence in many different domains is helpful) is a second requirement. If you take the network of GPT-4 and trained it
This document doesn't look to me like something a lot of people would try to write. Maybe it was one of the most important things to write, but not obviously so. Among the steps (1) get the idea to write out all reasons for pessimism, (2) resolve to try, (3) not give up halfway through, and (4) be capable, I would not guess that 4 is the strongest filter.
Yes, but I didn't mean to ask whether it's relevant, I meant to ask whether it's accurate. Does the output of language models, in fact, feel like this? Seemed like something relevant to ask you since you've seen lots of text completions.
And if it does, what is the reason for not having long timelines? If neural networks only solved the easy part of the problem, that implies that they're a much smaller step toward AGI than many argued recently.
I think what you get is a person talking with no inhibitions whatsoever. Language models don’t match that.
What do you picture a language model with no inhibitions to look like? Because if I try to imagine it, then "something that outputs reasonable sounding text until sooner or later it fails hard" seems to be a decent fit. Of course haven't thought much about the generator/assessor distinction.
I mean, surely "inhibitions" of the language model don't map onto human inhibitions, right? Like, a language model without the assessor module (or a much worse a...
(Extremely speculative comment, please tell me if this is nonsense.)
If it makes sense to differentiate the "Thought Generator" and "Thought Assessor" as two separate modules, is it possible to draw a parallel to language models, which seem to have strong ability to generate sentences, but lack the ability to assess if they are good?
My first reaction to this is "obviously not since the architecture is completely different, so why would they map onto each other?", but a possible answer could be "well if the brain has them as separate modules, it could mean t...
I don't completely get this.
Let's call the short term predictor (in the long term predictor circuit) , so if tries to predict [what predicts in 0.3s], then the correct prediction would be to immediately predict the output at whatever point in the future the process terminates (the next ground truth injection?). In particular, would always predict the same until the ground truth comes in. But if I understand correctly, this is not what's going on.
So second try: is really still only trying to predict 0.3s into the future, making it less of a "long t...
Author here. One thing I think I've done wrong in the post is to equate black-box-search-in-large-parametrized-space with all of machine learning. I've now added this paragraph at the end of chapter 1:
...Admittedly, the inner alignment model is not maximally general. In this post, we've looked at black box search, where we have a parametrized model and do SGD to update the parameters. This describes most of what Machine Learning is up to in 2020, but it does not describe what the field did pre-2000 and, in the event of a paradigm shift similar to the deep l
I strongly believe that (1) well-being is objective, (2) well-being is quantifiable, and (3) Open Individualism is true (i.e., the concept of identity isn't well-defined, and you're subjectively no less continuous with the future self if any other person than your own future self).
If (1-3) are all true, then utilitronium is the optimal outcome for everyone even if they're entirely selfish. Furthermore, I expect an AGI to figure this out, and to the extent that it's aligned, it should communicate that if it's asked. (I don't think an AGI will therefore deci...
I don't have any reason why this couldn't happen. My position is something like "morality is real, probably precisely quantifiable; seems plausible that in the scenario of humans with autonomy and aligned AI, this could lead to an asymmetry where more people tend toward utilitronium over time". (Hence why I replied, you didn't seem to consider that possibility.) I could make up some mechanisms for this, but probably you don't need me for that. Also seems plausible that this doesn't happen. If it doesn't happen, maybe the people who get to decide what happe...
This comment seems to be consistent with the assumption that the outcome 1 year after the singularity is locked in forever. But the future we're discussing here is one where humans retain autonomy (?), and in that case, they're allowed to change their mind over time, especially if humanity has access to a superintelligent aligned AI. I think a future where we begin with highly suboptimal personal utopias and gradually transition into utilitronium is among the more plausible outcomes. Compared with other outcomes where Not Everyone Dies, anyway. Your credence may differ if you're a moral relativist.
But the future we’re discussing here is one where humans retain autonomy (?), and in that case, they’re allowed to change their mind over time, especially if humanity has access to a superintelligent aligned AI.
What if the humans ask the aligned AI to help them be more moral, and part of what they mean by "more moral" is having fewer doubts about their current moral beliefs? This is what a "status game" view of morality seems to predict, for the humans whose status games aren't based on "doing philosophy", which seems to be most of them.
1: To me, it made it more entertaining and thus easier to read. (No idea about non-anecdotal data, would also be interested.)
3: Also no data; I strongly suspect the metric is generally good because... actually I think it's just because the people I find worth listening to are overwhelmingly not condescending. This post seems highly usual in several ways.
Is Humbali right that generic uncertainty about maybe being wrong, without other extra premises, should increase the entropy of one's probability distribution over AGI, thereby moving out its median further away in time?
My answer to this is that
First, no update whatsoever should take place because a probability distribution already expresses uncertainty, and there's no mechanism by which the uncertainty increased. Adele Lopez independently (and earlier) came up with the same answer.
Second, if there were an update -- say EY learned "one of the steps us
The total absence of obvious output of this kind from the rest of the "AI safety" field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors. Go read fantasy novels about demons and telepathy, if you want a better appreciation of the convergent incentives of agents facing mindreaders than the "AI safety" field outside myself is currently giving you.
While this this may be a fair criticism, I feel like someone ought to point out...
isn't trying to do anything like "sketch a probability distribution over the dynamics of an AI project that is nearing AGI". This includes all technical MIRI papers I'm familiar with.
I think this specific scenario sketch is from a mainstream AI safety perspective a case where we've already failed - i.e. we've invented a useless corrigibility intervention that we confidently but wrongly think is scalable.
...And if you try training the AI out of that habit in a domain of lower complexity and intelligence, it is predicted by me that generalizing that trained AI
Survey on model updates from reading this post. Figuring out to what extent this post has led people to update may inform whether future discussions are valuable.
Results: (just posting them here, doesn't really need its own post)
The question was to rate agreement on the 1=Paul to 9=Eliezer axis before and after reading this post.
Data points: 35
Mean:
Median:
Anynymous Comments:
Agreement more on need for actions tha
I'll take a shot at this. Let and be the sets of actions of Alice and Bob. Let (where 'n' means 'nice') be function that orders by how good the choices are for Alice, assuming that Alice gets to choose second. Similarly, let (where 's' means 'selfish') be the function that orders by how good the choices are for Bob, assuming that Alice gets to choose second. Choose some function measuring similarity between two orderings of a finite set (should range over ); the alignment of with is then .
Example: in...
I think the Go example really gets to the heart of why I think Debate doesn't cut it.
Your comment is an argument against using Debate to settle moral questions. However, what if Debate is trained on Physics and/or math questions, with the eventual goal of asking "what is a provably secure alignment proposal?"
In the ball-attached-to-a-pole example, the honest debater has assigned probabilities that are indistinguishable from what you would do if you knew noting except that the claim is false. (I.e., assign probabilities that doubt each component equally.) I'm curious how difficult it is to find the flaw in this argument structure. Have you done anything like showing these transcripts to other experts and seeing if they will be able to answer it?
If I had to summarize this finding in one sentence, it would be "it seems like an expert can generally find a set of ...
Fantastic sequence! Certainly, for anyone other than you, the deconfusion/time investment ratio of reading this is excellent. You really succeeded in making the core insights accessible. I'd even say it compares favorably to the recommended sequences in the Alignment Forum in that regard.
I've never read the "Towards a new Impact Measure" post, but I assume doing so is redundant now since this sequence is the 'updated' version.
(This sequence inspired me to re-read Reinforcement Learning: An Introduction, hence the break.)
I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. But setting seems to sacrifice quite a lot of performance. Is this real or am I missing something?
Namely, whenever there's an action which doesn't change the state and leads to 1 reward, and a sequence of actions such that has reward with (and all have 0 reward), then it's conceivable that would c...
Many thanks for taking the time to find errors.
I've fixed #1-#3. Arguments about the universal prior are definitely not something I want to get into with this post, so for #2 I've just made a vague statement that misalignment can arise for other reasons and linked to Paul's post.
I'm hesitant to change #4 before I fully understand why.
...I'm not exactly sure what you're trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world
An early punchline in this sequence was "Impact is a thing that depends on the goals of agents; it's not about objective changes in the world." At that point, I thought "well, in that case, impact measures require agents to learn those goals, which means it requires value learning." Looking back at the sequence now, I realize that the "How agents impact each other" part of the sequence was primarily about explaining why we don't need to do that and the previous post was declaring victory on that front, but it took me seeing the formalism here to really get...
I was initially writing a comment about how AUP doesn't seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone), but then I checked the post again and realized that it disincentivizes changes of power in both directions. This rules out the failure modes I had in mind. (It wouldn't press a button that blows up the earth...)
It does seem that AUP will make it so an agent doesn't want to be shut off, though. If it's shut off, its power goes way down (to zero if...
The technical appendix felt like it was more difficult than previous posts, but I had the advantage of having tried to read the paper from the preceding post yesterday and managed to reconstruct the graph & gamma correctly.
The early part is slightly confusing, though. I thought AU is a thing that belongs to the goal of an agent, but the picture made it look as if it's part of the object ("how fertile is the soil?"). Is the idea here that the soil-AU is slang for "AU of goal 'plant stuff here'"?
I did interpret the firs...
Thoughts after reading and thinking about this post
The thing that's bugging me here is that Power and Instrumental convergence seem to be almost the same.
In particular, it seems like Power asks [a state]: "how good are you across all policies" and Instrumental Convergence asks: "for how many policies are you the best?". In an analogy to tournaments where policies are players, power cares about the average performance of a player across all tournaments, and instrumental convergence about how many first places that player got. In tha...
Thoughts I have at this point in the sequence
In addition, current RL is episodic, so we should only expect that RL agents are goal-directed over the current episode and not in the long-term.
Is this true? Since ML generally doesn't choose an algorithm directly but runs a search over a parameter space, it seems speculative to assume that the resulting model, if it is a mesa-optimizer and goal-directed, only cares about its episode. If it learned that optimizing for X is good for reward, it seems at least conceivable that it won't understand that it shouldn't care about instances of X that appear in future episodes.
I might be confused here, but it seems to me that it's easy to interpret the arguments in this post as evidence in the wrong direction.
I see the following three questions as relevant:
1. How much sets human brains apart from other brains?
2. How much does the thing that humans have and animals don't matter?
3. How much does better architecture matter for AI?
Questions #2 and #3 seem positively correlated – if the thing that humans have is important, it's evidence that architectural changes matter a lot. However, holding #2 constant, #1 an...
Ex 5 (fixed version)
Let denote the triangle. For each , construct a 2-d simplex with nodes in , where the color of a point corresponds to the place in the disk that carries that point to, then choose to be a point within a trichromatic triangle in the graph. Then is a bounded sequence having a limit point . Let be the center of the disc; suppose that . Then there is at least one region of the disc that doesn't touch. Let be the distance to the furthest side, that is, let ...
I'm late, but I'm quite proud of this proof for #4:
Call the large triangle a graph and the triangles simply triangles. First, note that for any size, there is a graph where the top node is colored red, the remaining nodes on the right diagonal are colored green, and all nodes not on the right diagonal are colored blue. This graph meets the conditions, and has exactly one trichromatic triangle, namely the one at the top (no other triangle contains a red node). It is trivial to see that this graph can be changed into an arbitrary graph by re-col...
Ex 1
Let and . Given an edge , let denote the map that maps the color of the left to that of the right node. Given a , let . Let denote the color blue and the color green. Let be 1 if edge is bichromatic, and 0 otherwise. Then we need to show that . We'll show , which is a striclty stronger statement than the contrapositive.
For , the LHS is equivalent to , and indeed equals if is bichromatic, and o...
Ex 4
Given a computable function , define a function by the rule . Then is computable, however, because for , we have that and .
Ex 5:
We show the contrapositive: given a function halt, we construct a surjective function from to as follows: enumerate all turing machines, such that each corresponds to a string. Given a , if does not decode to a turing machine, set . If it does, let denote that turning machine. Let ...
Ex 1
Exercise 1: Let and let . Suppose that , then let be an element such that . Then by definition, and . So , a contradiction. Hence , so that is not surjective.
Ex 2
Exercise 2: Since is nonempty, it contains at least one element . Let be a function without a fixed point, then , so that and are two different elements in (this is the only thing we shall use the function for).
Let for nonempty. Suppose by contradiction that is surject...
I really don't think this is a reasonable measure for ability to do long term tasks, but I don't have the time or energy to fight this battle, so I'll just register my prediction that this paper is not going to age well.