All of Sam Clarke's Comments + Replies

Re the argument for "Why internalization might be difficult", I asked Evan Hubinger for his take on your rendition of the argument, and he thinks it's not right.

Rather, the argument that Risks from Learned Optimization makes that internalization would be difficult is that:

  • ~all models with good performance on a diverse training set probably have to have a complex world model already, which likely includes a model of the base objective,
  • so having the base objective re-encoded in a separate part of the model that represents its objective is just a waste of
... (read more)
1Rafael Harth
Thanks! I agree it's an error, of course. I've changed the section, do you think it's accurate now?)

Re: corrigibility being "anti-natural" in a certain sense - I think I have a better understanding of this now:

  • Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
  • AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
  • Training an aligned/corrigible/obedient consequentialist is something that Elieze
... (read more)
3Rob Bensinger
Note that this is still better than 'honestly panic about not having achieved it and throw caution to the wind / rationalize reasons they don't need to halt'!

Strong upvote, I would also love to see more disscussion on the difficulty of inner alignment.

which if true should preclude strong confidence in disaster scenarios

Though only for disaster scenarios that rely on inner misalignment, right?

... seem like world models that make sense to me, given the surrounding justifications

FWIW, I don't really understand those world models/intuitions yet:

  • Re: "earlier patches not generalising as well as the deep algorithms" - I don't understand/am sceptical about the abstraction of "earlier patches" vs. "deep algori
... (read more)

Re: corrigibility being "anti-natural" in a certain sense - I think I have a better understanding of this now:

  • Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
  • AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
  • Training an aligned/corrigible/obedient consequentialist is something that Elieze
... (read more)

If we don’t have the techniques to reliably align AI, will someone deploy AI anyway? I think it’s more likely the answer is yes.

What level of deployment of unaligned benchmark systems do you expect would make doom plausible? "Someone" suggests maybe you think one deployment event of a sufficiently powerful system could be enough (which would be surprising in slow takeoff worlds). If you do think this, is it something to do with your expectations about discontinuous progress around AGI?

A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice

Sure, I agree this is a stronger point.

The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely.

Not really, unfortunately. In those posts, the authors are focusing on painting a plausible pi... (read more)

2Koen Holtman
I feel that Christiano's post here is pretty good at identifying plausible failure modes inside society that lead to unaligned agents not being corrected. My recollection of that post is partly why I mentioned the posts under that tag. There is an interesting question of methodology here: if you want to estimate the probability that society will fail in this this way in handing the impact of AI, do you send a poll to a bunch of AI technology experts, or should you be polling a bunch of global warming activists or historians of the tobacco industry instead? But I think I am reading in your work that this question is no news to you. Several of the AI alignment organisations you polled have people in them who produced work like this examination of the nuclear arms race. I wonder what happens in your analysis of your polling data if you single out this type of respondent specifically. In my own experience in analysing polling results with this type of response rate, I would be surprised however if you could find a clear signal above the noise floor. Agree, that is why I am occasionally reading various posts with failure scenarios and polls of experts. To be clear: my personal choice of alignment research subjects is only partially motivated by what I think is the most important to work to do, if I want to have the best chance of helping. Another driver is that I want to have some fun with mathematics. I tend to work on problems which lie in the intersection of those two fuzzy sets.

I'm broadly sympathetic to your point that there have been an unfortunate number of disagreements about inner alignment terminology, and it has been and remains a source of confusion.

to the extent that Evan has felt a need to write an entire clarification post.

Yeah, and recently there has been even more disagreement/clarification attempts.

I should have specified this on the top level question, but (as mentioned in my own answer) I'm talking about abergal's suggestion of what inner alignment failure should refer to (basically: a model pursuing a differe... (read more)

3Koen Holtman
Meta: I usually read these posts via the alignmentforum.org portal, and this portal filters out certain comments, so I missed your mention of abergal's suggestion, which would have clarified your concerns about inner alignment arguments for me. I have mailed the team that runs the website to ask if they could improve how this filtering works. Just read the post with the examples you mention, and skimmed the related arxiv paper. I like how the authors develop the metrics of 'objective robustness' vs 'capability robustness' while avoiding the problem of trying to define a single meaning for the term 'inner alignment'. Seems like good progress to me.

Thanks for your reply!

depends on what you mean with strongest arguments.

By strongest I definitely mean the second thing (probably I should have clarified here, thanks for picking up on this).

Also, the strongest argument when you address an audience of type A, say policy makers, may not be the strongest argument for an audience of type B, say ML researchers.

Agree, though I expect it's more like, the emphasis needs to be different, whilst the underlying argument is similar (conditional on talking about your second definition of "strongest").

many di

... (read more)
1Koen Holtman
I disagree. In my reading. all of these books offer fairly wide-ranging surveys of alignment failure mechanisms. A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice. Once we take it as axiomatic that some people are stupid some of the time, presenting a convincing proof that some AI alignment failure mode is theoretically possible does not require much heavy lifting at all. The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely. The main problem with distilling this work into, say, a top 3 of most powerful 1-page arguments is that we are not dealing purely with technology-driven failure modes. There is a technical failure mode story which says that it is very difficult to equip a very powerful future AI with an emergency stop button, that we have not solved that technical problem yet. In fact, this story is a somewhat successful meme in its own right: it appears in all 3 books I mentioned. That story is not very compelling to me. We have plenty of technical options for building emergency stop buttons, see for example my post here. There have been some arguments that none of the identified technical options for building AI stop buttons will be useful or used, because they will all turn out to be incompatible with yet-undiscovered future powerful AI designs. I feel that these arguments show a theoretical possibility, but I think it is a very low possibility, so in practice these arguments are not very compelling to me. The more compelling failure mode argument is that people will refuse to use the emergency AI stop button, even though it is available. Many of the posts with the tag above show failure scenarios where the AI fails to be aligned because o
2Koen Holtman
I'll do the easier part of your question first: I have not read all the material about inner alignment that has appeared on this forum, but I do occasionally read up on it. There are some posters on this forum who believe that contemplating a set of problems which are together called 'inner alignment' can work as an intuition pump that would allow us to make needed conceptual breakthroughs. The breakthroughs sought have mostly to do, I believe, with analyzing possibilities for post-training treacherous turns which have so far escaped notice. I am not (no longer) one of the posters who have high hopes that inner alignment will work as a useful intuition pump. The terminology problem I have with the term 'inner alignment' is that many working on it never make the move of defining it in rigorous mathematics, or with clear toy examples of what are and what are not inner alignment failures. Absent either a mathematical definition or some defining examples, I am not able judge if inner alignment is either the main alignment problem, or whether it would be a minor one, but still one that is extremely difficult to solve. What does not help here is that by now several non-mathematical notions floating around of what an inner alignment failure even is, to the extent that Evan has felt a need to write an entire clarification post. When poster X calls something an example of an inner alignment failure, poster Y might respond and declare that in their view of inner alignment failure, it is not actually an example of an inner alignment failure, or a very good example of an inner alignment failure. If we interpret it as a meme, then the meme of inner alignment has a reproduction strategy where it reproduces by triggering social media discussions about what it means. Inner alignment has become what Minsky called a suitcase word: everybody packs their own meaning into it. This means that for the purpose of distillation, the word is best avoided. If you want to distil the discu

Arguments for outer alignment failure, i.e. that we will plausibly train advanced AI systems using a training objective that doesn't incentivise or produce the behaviour we actually want from the AI system. (Thanks to Richard for spelling out these arguments clearly in AGI safety from first principles.)

  • It's difficult to explicitly write out objective functions which express all our desires about AGI behaviour.
    • There’s no simple metric which we’d like our agents to maximise - rather, desirable AGI behaviour is best formulated in terms of concepts like ob
... (read more)

I found this post helpful and interesting, and refer to it often! FWIW I think that powerful persuasion tools could have bad effects on the memetic ecosystem even if they don't shift the balance of power to a world with fewer, more powerful ideologies. In particular, the number of ideologies could remain roughly constant, but each could get more 'sticky'. This would make reasonable debate and truth-seeking harder, as well as reducing trusted and credible multipartisan sources. This seems like an existential risk factor, e.g. because it will make coordinati... (read more)

0Daniel Kokotajlo
Thanks! The post was successful then. Your point about stickiness is a good one; perhaps I was wrong to emphasize the change in number of ideologies. The "AI takeover without AGI or agency" bit was a mistake in retrospect. I don't remember why I wrote it, but I think it was a reference to this post which argues that what we really care about is AI-PONR, and AI takeover is just a prominent special case. It also might have been due to the fact that a world in which an ideology uses AI tools to cement itself and take over the world, can be thought of as a case of AI takeover, since we have AIs bossing everyone around and getting them to do bad things that ultimately lead to x-risk. It's just a weird case in which the AIs aren't agents or general intelligences. :)

Thanks for your comment!

Are we sure that given the choice between "lower crime, lower costs and algorithmic bias" and "higher crime, higher costs and only human bias", and we have dictatorial power and can consider long-term effects, we would choose the latter on reflection?

Good point, thanks, I hadn't thought that sometimes it actually would make sense, on reflection, to choose an algorithm pursuing an easy-to-measure goal over humans pursuing incorrect goals. One thing I'd add is that if one did delve into the research to work this out for a particul... (read more)

3Rohin Shah
Yup, all of that sounds right to me! One caveat is that on my models of AI development I don't expect the CEO could just copy model parameters to the intern. I think it's more likely that we have something along the lines of "graduate of <specific college major>" AI systems that you then copy and use as needed. But I don't think this really affects your point. Yeah jtbc I definitely would not assert this. If I had to make an argument for as-much-interpretability, it would be something like "in the scenario we're considering, AI systems are roughly human-level in capability; at this level of capability societal organization will still require a lot of modularity; if we know nothing else and assume agents are as black-boxy as humans, it seems reasonable to assume this will lead to a roughly similar amount of interpretability as current society". But this is not a particularly strong argument, especially in the face of vast uncertainty about what the future looks like.