even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy - if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!)
It seems to me that your analogy is the wrong way arond. IE, the right analogy would be "if I knew that a bunch of 5-year olds were reading my mind, I have...actually, a pretty good idea how to think deceptive thoughts in a way that avoids their tools".
(For what it's ...
After reading the first section and skimming the rest, my impression is that the document is a good overview, but does not present any detailed argument for why godlike AI would lead to human extinction. (Except for the "smarter species" analogy, which I would say doesn't qualify.) So if I put on my sceptic hat, I can imagine reading the whole document in detail and somewhat-justifiably going away with "yeah, well, that sounds like a nice story, but I am not updating based on this".
That seems fine to me, given that (as far as I am concerned) no detailed co...
Some suggestions for improving the doc (I noticed the link to the editable version too late, apologies):
What is AI? Who is building it? Why? And is it going to be a future we want?
Something weird with the last sentence here (substituting "AI" for "it" makes the sentence un-grammatical).
Machines of hateful competition need not have such hindrances.
"Hateful" seems likely to put off some readers here, and I also think it is not warranted -- indifference is both more likely and also sufficient for extinction. So "Machines of indifferent competition" might work...
I want to flag that the overall tone of the post is in tension with the dislacimer that you are "not putting forward a positive argument for alignment being easy".
To hint at what I mean, consider this claim:
Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.
I think this claim is only valid if you are in a situation such as "your probability of scheming was >95%, and this was based basically only on this particular version of the 'counting argument' ". That is, if you somehow thought that we had ...
Quick reaction:
I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).
What would you suggest instead? Something like [50% chance the AI kills > 99% of people]?
(My current take is that for a majority reader, sticking to "literal extinction" is the better tradeoff between avoiding confusion/verbosity and accuracy. But perhaps it deserves at least a footnote or some other qualification.)
I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).
That seems fair. For what it's worth, I think the ideas described in the sequence are not sensitive to what you choose here. The point isn't as much to figure out whether the particular arguments go through or not, but to ask which properties must your model have, if you want to be able to evaluate those arguments rigorously.
(For context: My initial reaction to the post was that this is misrepresenting the MIRI-position-as-I-understood-it. And I am one of the people who strongly endorse the view that "it was never about getting the AI to predict human preferences". So when I later saw Yudkowsky's comment and your reaction, it seemed perhaps useful to share my view.)
It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?
My reaction to this is that: Actually, current LLMs do care about our preferences...
Nitpicky comment / edit request: The circle inversion figure was quite confusing to me. Perhaps add a note to it saying that solid green maps onto solid blue, red maps onto itself, and dotted green maps onto dotted blue. (Rather than colours mapping to each other, which is what I intuitively expected.)
Fun example: The evolution of offensive words seems relevant here. IE, we frown upon using currently-offensive words, so we end up expressing ourselves using some other words. And over time, we realise that those other words are (primarily used as) Doppelgangers, and mark them as offensive as well.
Two points that seem relevant here:
(At the moment, both of these questions seem open.)
This made me think of "lawyer-speak", and other jargons.
More generally, this seems to be a function of learning speed and the number of interactions on the one hand, and the frequency with which you interact with other groups on the other. (In this case, the question would be how often do you need to be understandable to humans, or to systems that need to be understandable to humans, etc.)
I would like to point out one aspects of the "Vulnerable ML systems" scenario that the post doesn't discuss much: the effect on adversarial vulnerability on widespread-automation worlds.
Using existing words, some ways of pointing towards what I mean are: (1) Adversarial robustness solved after TAI (your case 2), (2) vulnerable ML systems + comprehensive AI systems, (3) vulnerable ML systems + slow takeoff, (4) fast takeoff happening in the middle of (3).
But ultimately, I think none of these fits perfectly. So a longer, self-contained description is somethi...
An idea for increasing the impact of this research: Mitigating the "goalpost moving" effect for "but surely a bit more progress on capabilities will solve this".
I suspect that many people who are sceptical of this issue will, by default, never sit down and properly think about this. If they did, they might make some falsifiable predictions and change their minds --- but many of them might never do that. Or perhaps many people will, but it will all happen very gradually, and we will never get a good enough "coordination point" that would allow us to take ne...
For the purpose of this section, we will consider adversarial robustness to be solved if systems cannot be practically exploited to cause catastrophic outcomes.
Regarding the predictions, I want to make the following quibble: According to the definition above, one way of "solving" adversarial robustness is to make sure that nobody tries to catastrophically exploit the system in the first place. (In particular, exploitable AI that takes over the world is no longer exploitable.)
So, a lot with this definition rests on how do you distinguish between "cannot be ...
Yup, this is a very good illustration of the "talking past each other" that I think is happening with this line of research. (I mean with adversarial attacks on NNs in general, not just with Go in particular.) Let me try to hint at the two views that seem relevant here.
1) Hinting at the "curiosity at best" view: I agree that if you hotfix this one vulnerability, then it is possible we will never encounter another vulnerability in current Go systems. But this is because there aren't many incentives to go look for those vulnerabilities. (And it might even be...
My reaction to this is something like:
Academically, I find these results really impressive. But, uhm, I am not sure how much impact they will have? As in: it seems very unsurprising[1] that something like this is possible for Go. And, also unsurprisingly, something like this might be possible for anything that involves neural networks --- at least in some cases, and we don't have a good theory for when yes/no. But also, people seem to not care. So perhaps we should be asking something else? Like, why is that people don't care? Suppose you managed to d...
When I started working on this project, a number of people came to me and told me (with varying degrees of tact) that I was wasting my time on a fool's errand. Around half the people told me they thought it was extremely unlikely I'd find such a vulnerability. Around the other half told me such vulnerabilities obviously existed, and there was no point demonstrating it. Both sets of people were usually very confident in their views. In retrospect I wish I'd done a survey (even an informal one) before conducting this research to get a better sense of people'...
Another piece of related work: Simon Zhuang, Dylan Hadfield-Mennel: Consequences of Misaligned AI.
The authors assume a model where the state of the world is characterized by multiple "features". There are two key assumptions: (1) our utility is (strictly) increasing in each feature, so -- by definition -- features are things we care about (I imagine money, QUALYs, chocolate). (2) We have a limited budget, and any increase in any of the features always has a non-zero cost. The paper shows that: (A) if you are only allowed to tell your optimiser about a stri...
Incidentally, I am trying to come up with a "better" model for "this stuff", one that would have predictive power over reality. (As opposed to starting out with a clear bottom line.) No solutions yet, but I do have some thoughts. If other people are also actively working on this, I would be happy to talk.
I might be interpretting things wrong, but it seems to me that the paper is doing things the wrong way around. That is, (it seems to me that) the paper sets out to prove that Goodhart's law is an issue and picks a setting where this will be the case --- as opposed to picking a setting, then investigating whether/when Goodhart's law is an issue.
By this, I don't mean to say that the paper is bad; it is good. I merely mean to say that we should view it as a nice metaphor that formalises some intuitions about Goodhart's law, rather than as a model that is "cau...
My key point: I strongly agree with (my perception of) the intuitions behind this post, but I wouldn't agree with the framing. Or with the title. So I think it would be really good to come up with a framing of these intuitions that wouldn't be controversial.
To illustrate on an example: Suppose I want to use a metal rod for a particular thingy in a car. And there is some safety standard for this, and the metal rod meets this standard.[1] And now suppose you instead have that same metal rod, except the safety standard does not exist. I expect most...
Adding the missing baseline scenario: There is a bunch of open-source versions of AutoGPT that have been explicitly tasked with destroying the world (for the lolz). One day, an employe of one of the leading companies plugs the latest predictive model into this (to serve as an argument on twitter for the position that AI risks are overblown). And then we all die.
This post seems related to an exaggerated version of what I believe: Humans are so far from "agents trying to maximize utility" that to understand how to AI to humans, we should first understand what it means to align AI to finite-state machines. (Doesn't mean it's sufficient to understand the latter. Just that it's a prerequisite.)
As I wrote, going all the way to finite-state machines seems exaggerated, even as a starting point. However, it does seem to me that starting somewhere on that end of the agent<>rock spectrum is the better way to go about understanding human values :-). (At least given what we already know.)
Flagging confusion / potential disagreement: I think only predicting humans is neither sufficient nor necessary for the results to be aligned / helpful / not doom. Insufficient because if misaligned AGI is already in control, or likely going to be in control later, predicting arbitrary existing humans seems unsafe. [Edit: I think this is very non- obvious and needs further supporting arguments.] Not necessary because it should be fine to predict any known-to-be-safe process. (As long as you do this in a well-founded manner / not predicting itself.)
Reiterating two points people already pointed out, since they still aren't fixed after a month. Please, actually fix them, I think it is important. (Reasoning: I am somewhat on the fence on how big weight to assign to the simulator theory, I expect so are others. But as a mathematician, I would feel embarrassed to show this post to others and admit that I take it seriously, when it contains so egregious errors. No offense meant to the authors, just trying to point at this as an impact-limiting factor.)
Proposition 1: This is false, and the proof is wrong. F...
Oh, I think I agree - if the choice is to use AI assistants or not, then use them. If they need adapting to be useful for alignment, then do adapt them.
But suppose they only work kind-of-poorly - and using them for alignment requires making progress on them (which will also be useful for capabilities), and you will not be able to keep those results internal. And that you can either do this work or do literally nothing. (Which is unrealistic.) Then I would say doing literally nothing is better. (Though it certainly feels bad, and probably costs you your job. So I guess some third option would be preferable.)
> (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
Either I misunderstand this or it seems incorrect.
Hm, I think you are right --- as written, the claim is false. I think some version of (X) --- the assumption around your ability to differentially use AI assistants for alignment --- will still be relevant; it will just need a bit more careful phrasing. Let me know if this makes sense:
To get a more realistic assumption, perhaps we could want to talk about (speedup) "how much are AI assi...
My ~2-hour reaction to the challenge:[1]
(I) I have a general point of confusion regarding the post: To the extent that this is an officially endorsed plan, who endorses the plan?
Reason for confusion / observations: If someone told me they are in charge of an organization that plans to build AGI, and this is their plan, I would immediately object that the arguments ignore the part where progress on their "alignment plan" make a significant contribution to capabilities research. Thereforey, in the worlds where the proposed strategy fails, they are making thi...
To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn't really safe.
As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about "officially approved actions" very loud. And run a secondary thread of thinking about "non-approved thoughts" in the background --- "quieter" and in a more fragmented manner, and killing it off everytime it made it into the inne...
One way in which the world seems brittle / having free energy AI could use to gain advantage: We haven't figured out good communication practices for the digital age. We don't have good collective epistemics. And we dont seem to be on track to have this solved in the next 20 years. As a result I expect that with enough compute and understanding of network science, and perhaps a couple more things, you could sabotage the whole civilization. ("Enough" is meant to stand for "a lot, but within reach of an early AGI". Heck, if Google somehow spent the next 5 years just on that, I would give them fair odds.)
Re sharp left turn: Maybe I misunderstand the "sharp left turn" term, but I thought this just means a sudden extreme gain in capabilities? If I am correct, then I expect you might get "sharp left turn" with a simulator during training --- eg, a user fine-tunes it on one additional dataset, and suddenly FOOOM. (Say, suddenly it can simulate agents that propose takeover plans that would actually work, when previously they failed at this with identical prompting.)
One implication I see is that it if the simulator architecture becomes frequently used, it might ...
Explanation for my strong downvote/disagreement:
Sure, in the ideal world, this post would have a much better scholarship.
In the actual world, there are tradeoffs between the number of posts and the quality of scholarship. The cost is both the time and the fact that doing literature review is a chore. If you demand good scholarship, people will write slower/less. With some posts this is a good thing. With this post, I would rather have an attrocious scholarship and 1% higher chance of the sequence having one more post in it. (Hypothetical example. I expect the real tradeoffs are less favourable.)
An attempted paraphrase, to hopefully-disentangle some claims:
Eliezer, list of AGI lethalities: pivotal acts are (necessarily?) "outside of the Overton window, or something"[1].
Critch, preceding post: Strategies involving non-Overton elements are not worth it
Critch, this post: there are pivotal outcomes you can via a strategy with no non-Overton elements
Eliezer, this comment: the "AI immune system" example is not an example of a strategy with no non-Overton elements
Possible reading: Critch/the reader/Eliezer currently wouldn't be able to name a strategy to...
(Not very sure I understood your description right, but here is my take:)
However, I share the general sentiment behind your post --- I also don't understand...
Personally, the author believes that SPI might “add up to normality” --- that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations.
I'm a bit confused by this claim. To me it's a bit unclear what you mean by "adding up to normality". (E.g.: Are you claiming that A) humans in current-day strategic interactions shouldn't change their behavior in response to learning about SPIs (because 1) they are already using them or 2) doing things that are somehow equivalent to them)? Or are you...
My other complaint is that in some places you state some claim X in a way that (to me) suggests that you think that Tobi Baumann or Vince and I (or whoever else is talking/writing about surrogate goals/SPIs) have suggested that X is false, when really Tobi, Vince and I are very much aware of X and have (although perhaps to an insufficient extent) stated X.
Thank you for pointing that out. In all these cases, I actually know that you "stated X", so this is not an impression I wanted to create. I added a note at the begging of the document to hopefully clarify this.
Perfect, that is indeed the diffeence. I agree with all of what you write here.
In this light, the reason for my objection is that I understand how we can make a commitment of the first type, but I have no clue how to make a commitment of the second type. (In our specific example, once demand unarmed is an option -- once SPI is in use -- the counterfactual world where there is only demand armed just seems so different. Wouldn't history need to go very differently? Perhaps it wouldn't even be clear what "you" is in that world?)
But I agree that with SDA-AGIs,...
That is -- I think* -- a correct way to parse it. But I don't think it false... uhm, that makes me genuinely confused. Let me try to re-rephrase, see if uncovers the crux :-).
You are in a world where most (1- Ɛ ) of the bandits demand unarmed when paired with a caravan commited to [responding to demand unarmed the same as it responds to demand armed] (and they demand armed against caravans without such commitment). The bandit population (ie, their strategies) either remains the same (for simplicity) or the strategies that led to more profit increase in r...
I agree with (1) and (2), in the same way that I would agree that "one boxing will work in some settings and fail to work in others" and "whether you should one box depends on the larger context it appears in". I'd find it weird to call this an "objection" to one boxing though.
I agree that (1)+(2) isn't significant enough to qualify as "an objection". I think that (3)+(4)+(my interpretation of it? or something?) further make me believe something like (2') below. And that seems like an objection to me.
(2') Whether or not it works-as-intended depends on...
I think I agree with your claims about commiting, AI designs, and self-modifying into two-boxer being stupid. But I think we are using a different framing, or there is some misunderstanding about what my claim is. Let me try to rephrase it:
(1) I am not claiming that SPI will never work as intended (ie, get adopted, don't change players' strategies, don't change players' "meta strategies"). Rather, I am saying it will work in some settings and fail to work in others.
(2) Whether SPI works-as-intended depends on the larger context it appears in. (Some example...
Humans and human institutions can't easily make credible commitments.
That seems right. (Perhaps with the exception of legal contracts, unless one of the parties is powerful enough to make the contract difficult to enforce.) And even when individual people in an institution have powerful commitment mechanisms, this is not the same as the institution being able to credible commit. For example, suppose you have a head of a state that threatens suicidal war unless X happens, and they are stubborn enough to follow up on it. Then if X happens, you might get a coup instead, thus avoiding the war.
It seems like your main objection arises because you view SPI as an agreement between the two players.
I would say that my main objection is that if you know that you will encounter SPI in situation X, you have an incentive to alter the policy that you will be using in X. Which might cause other agents to behave differently, possibly in ways that lead to the threat being carried out (which is precisely the thing that SPI aimed to avoid).
In the bandit case, suppose the caravan credibly commits to treating nerf guns identically to regular runs. And suppose t...
the tl;dr version of the full report:
The surrogate goals (SG) idea proposes that an agent might adopt a new seemingly meaningless goal (such as preventing the existence of a sphere of platinum with a diameter of exactly 42.82cm or really hating being shot by a water gun) to prevent the realization of threats against some goals they actually value (such as staying alive) [TB1, TB2]. If they can commit to treating threats to this goal as seriously as threats to their actual goals, the hope is that the new goal gets threatened instead. In particular, the purp...
As a concrete tool for discussing AI risk scenarios, I wonder if it doesn't have too many parameters? Like you have to specify the map (at least locally), how far we can see, what will the impact be of this and that research...
I agree that the model does have quite a few parameters. I think you can get some value out of it already by being aware of what the different parameters are and, in case of a disagreement, identifying the one you disagree about the most.
>If we currently have some AI system x, we can ask which system are reachable from x...
Thank you for the comment. As for the axes, the y-axis always denotes the desirability of the given value-system (except for Figure 1). And you are exactly right with the x-axis --- that is a multidimensional space of value-systems that we put on a line, because drawing this in 3D (well, (multi+1)-D :-) ) would be a mess. I will see if I can make it somewhat clearer in the post.
- If I could assume things like "they are much better at reading my inner monologue than my non-verbal thoughts", then I could create code words for prohibited things.
- I could think in words they don't know.
- I could think in complicated concepts they haven't understood yet. Or references to events, or my memories, that they don't know.
- I could leave a part of my plans implicit, and only figure out the details later.
- I could harm them through some action for which they won't understand that it is harmful, so they might not be alarmed even if they catch me thinkin
... (read more)