All of Lauro Langosco's Comments + Replies

The argument I think is good (nr (2) in my previous comment) doesn't go through reference classes at all. I don't want to make an outside-view argument (eg "things we call optimization often produce misaligned results, therefore sgd is dangerous"). I like the evolution analogy because it makes salient some aspects of AI training that make misalignment more likely. Once those aspects are salient you can stop thinking about evolution and just think directly about AI.

evolution does not grow minds, it grows hyperparameters for minds.

Imo this is a nitpick that isn't really relevant to the point of the analogy. Evolution is a good example of how selection for X doesn't necessarily lead to a thing that wants ('optimizes for') X; and more broadly it's a good example for how the results of an optimization process can be unexpected.

I want to distinguish two possible takes here:

  1. The argument from direct implication: "Humans are misaligned wrt evolution, therefore AIs will be misaligned wrt their objectives"
  2. Evolution as an
... (read more)
4Alex Turner
I think it's extremely relevant, if we want to ensure that we only analogize between processes which share enough causal structure to ensure that lessons from e.g. evolution actually carry over to e.g. AI training (due to those shared mechanisms). If the shared mechanisms aren't there, then we're playing reference class tennis because someone decided to call both processes "optimization processes."

I think the specification problem is still hard and unsolved. It looks like you're using a different definition of 'specification problem' / 'outer alignment' than others, and this is causing confusion.

IMO all these terms are a bit fuzzy / hard to pin down, and so it makes sense that they'd lead to disagreement sometimes. The best way (afaict) to avoid this is to keep the terms grounded in 'what would be useful for avoiding AGI doom'? To me it looks like on your definition, outer alignment is basically a trivial problem that doesn't help alignment much.

Mor... (read more)

1Matthew Barnett
Can you explain how you're defining outer alignment and value specification? I'm using this definition, provided by Hubinger et al. Evan Hubinger provided clarification about this definition in his post "Clarifying inner alignment terminology", I deliberately avoided using the term "outer alignment" in the post because I wanted to be more precise and not get into a debate about whether the value specification problem matches this exact definition. (I think the definitions are subtly different but the difference is not very relevant for the purpose of the post.) Overall, I think the two problems are closely associated and solving one gets you a long way towards solving the other. In the post, I defined the value identification/specification problem as, This was based on the Arbital entry for the value identification problem, which was defined as a I should say note that I used this entry as the primary definition in the post because I was not able to find a clean definition of this problem anywhere else. I'd appreciate if you clarified whether you are saying: 1. That my definition of the value specification problem is different from how MIRI would have defined it in, say, 2017. You can use Nate Soares' 2016 paper or their 2017 technical agenda to make your point. 2. That my definition matches how MIRI used the term, but the value specification problem remains very hard and unsolved, and GPT-4 is not even a partial solution to this problem. 3. That my definition matches how MIRI used the term, and we appear to be close to a solution to the problem, but a solution to the problem is not sufficient to solve the hard bits of the outer alignment problem. I'm more sympathetic to (3) than (2), and more sympathetic to (2) than (1), roughly speaking.

Do you have an example of one way that the full alignment problem is easier now that we've seen that GPT-4 can understand & report on human values?

(I'm asking because it's hard for me to tell if your definition of outer alignment is disconnected from the rest of the problem in a way where it's possible for outer alignment to become easier without the rest of the problem becoming easier).

I think it's false in the sense that MIRI never claimed that it would be hard to build an AI with GPT-4 level understanding of human values + GPT-4 level of willingness to answer honestly (as far as I can tell). The reason I think it's false is mostly that I haven't seen a claim like that made anywhere, including in the posts you cite.

I agree lots of the responses elide the part where you emphasize that it's important how GPT-4 doesn't just understand human values, but is also "willing" to answer questions somewhat honestly. TBH I don't understand why that... (read more)

2Matthew Barnett
I don't think it's necessary for them to have made that exact claim. The point is that they said value specification would be hard. If you solve value specification, then you've arguably solved the outer alignment problem a large part of the outer alignment problem. Then, you just need to build a function maximizer that allows you to robustly maximize the utility function that you've specified. [ETA: btw, I'm not saying the outer alignment problem has been fully solved already. I'm making a claim about progress, not about whether we're completely finished.] I interpret MIRI as saying "but the hard part is building a function maximizer that robustly maximizes any utility function you specify". And while I agree that this represents their current view, I don't think this was always their view. You can read the citations in the post carefully, and I don't think they support the idea that they've consistently always considered inner alignment to be the only hard part of the problem. I'm not claiming they never thought inner alignment was hard. But I am saying they thought value specification would be hard and an important part of the alignment problem.

You make a claim that's very close to that - your claim, if I understand correctly, is that MIRI thought AI wouldn't understand human values and also not lie to us about it (or otherwise decide to give misleading or unhelpful outputs):

The key difference between the value identification/specification problem and the problem of getting an AI to understand human values is the transparency and legibility of how the values are represented: if you solve the problem of value identification, that means you have an actual function that can tell you the value of a

... (read more)
1Matthew Barnett
I agree the claim is "similar". It's actually a distinct claim, though. What are the reasons why it's false? (And what do you mean by saying that what I wrote is "false"? I think the historical question is what's important in this case. I'm not saying that solving the value specification problem means that we have a full solution to the alignment problem, or that inner alignment is easy now.)

My paraphrase of your (Matthews) position: while I'm not claiming that GPT-4 provides any evidence about inner alignment (i.e. getting an AI to actually care about human values), I claim that it does provide evidence about outer alignment being easier than we thought: we can specify human values via language models, which have a pretty robust understanding of human values and don't systematically deceive us about their judgement. This means people who used to think outer alignment / value specification was hard should change their minds.

(End paraphrase)

I t... (read more)

2Matthew Barnett
I'm sympathetic to some of these points, but overall I think it's still important to acknowledge that outer alignment seems easier than many expected even if we think that inner alignment is still hard. In this post I'm not saying that the whole alignment problem is now easy. I'm making a point about how we should update about the difficulty of one part of the alignment problem, which was at one time considered both hard and important to solve. I think the most plausibly correct interpretation here of "a genie must share the same values" is that we need to solve both the value specification and inner alignment problem. I agree that just solving one part doesn't mean we've solved the other. However, again, I'm not claiming the whole problem has been solved. Yes, and people gave proposals about how this might be done at the time. For example I believe this is what Paul Christiano was roughly trying to do when he proposed approval-directed agents. Nonetheless, these were attempts. People didn't know whether the solutions would work well. I think we've now gotten more evidence about how hard this part of the problem is.

Broadly agree with the takes here.

However, these results seem explainable by the widely-observed tendency of larger models to learn faster and generalize better, given equal optimization steps.

This seems right and I don't think we say anything contradicting it in the paper.

I also don't see how saying 'different patterns are learned at different speeds' is supposed to have any explanatory power. It doesn't explain why some types of patterns are faster to learn than others, or what determines the relative learnability of memorizing versus generalizing

... (read more)

There are positive feedback loops between prongs:

  • Successfully containing & using more capable models (p1) gives you more scary demos for p2
  • Success in p1 also speeds up p3 a lot, because:
    • 1) You can empirically study AGI directly, 
    • 2) Very advanced but “narrow” AI tools accelerate research (“narrow” here still means maybe more general than GPT-4)
    • 3) Maybe you can even have (proto-)AGIs do research for you
  • You definitely need a lot of success in p2 for anything to work, otherwise people will take all the useful work we can get from proto-AGIs and pour i
... (read more)

A three-pronged approach to AGI safety. (This is assuming we couldn't just avoid building AGI or proto-AGIs at all until say ~2100, which would of course be much better).


Prong 1: boxing & capability control (aka ‘careful bootstrapping’)

  • Make the AGI as capable as possible, under the constraint that you can make sure it can’t break out of the box or do other bad stuff. 
  • Assume the AGI is misaligned. Be super paranoid
  • Goal: get useful work out of boxed AGIs.
    • For example, AIs might be able to do interpretability really well.
    • More generally, for any field
... (read more)
2Lauro Langosco
There are positive feedback loops between prongs: * Successfully containing & using more capable models (p1) gives you more scary demos for p2 * Success in p1 also speeds up p3 a lot, because: * 1) You can empirically study AGI directly,  * 2) Very advanced but “narrow” AI tools accelerate research (“narrow” here still means maybe more general than GPT-4) * 3) Maybe you can even have (proto-)AGIs do research for you * You definitely need a lot of success in p2 for anything to work, otherwise people will take all the useful work we can get from proto-AGIs and pour it into capabilities research. * Better alignment research (p3) lets you do more p1 type risky stuff with SOTA models (on the margin)   If p1 is very successful, maybe we can punt most of p3 to the AIs; conversely, if p1 seems very hard then we probably only get ‘narrow’ tools to help with p3 and need to mostly do it ourselves, and hopefully get ML researchers to delay for long enough.

There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals

IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it's much harder, but I think it's good to at least try, as long as it doesn't terribly hurt less ambitious efforts (which I think it doesn't).

Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:

  1. Ability to be deceptively aligned
  2. Ability to think / reflect about its goals enough that model realises it does not like what it is being RLHF’d for
  3. Incentives to break containment exist in a way that is accessible / understandable to the model
  4. Ability to break containment
  5. Ability to robustly understand human intent
  6. Situational awareness
  7. Coherence / robustly pursuing it’s goal in a diverse set of circumstances
  8. Interpretability methods break (or other ove
... (read more)

I like that mini-game! Thanks for the reference

like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data,

... (read more)

More generally, suppose that the agent acts in accordance with the following policy in all decision-situations: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ That policy makes the agent immune to all possible money-pumps for Completeness.

Am I missing something or does this agent satisfy Completeness anytime it faces a decision for the second time?

I would not call 1) an instance of goal misgeneralization. Goal misgeneralization only occurs if the model does badly at the training objective. If you reward an RL agent for making humans happy and it goes on to make humans happy in unintended ways like putting them into heroin cells, the RL agent is doing fine on the training objective. I'd call 1) an instance of misspecification and 2) an instance of misgeneralization.

(AFAICT The Alignment Problem from a DL Perspective uses the term in the same way I do, but I'd have to reread more carefully to make sur... (read more)

It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I'm missing some of the problems.

(Analogy: say I'm working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I'm actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).

To briefly hop in and say something that may be useful: I had a reaction pretty similar to what Eliezer commented, and I don't see continuity or "Things will be weird before getting extremely weird" as a crux. (I don't know why you think he does, and don't know what he thinks, but would guess he doesn't think it's a crux either)

4Jan_Kulveit
I've been part or read enough debates with Eliezer to have some guesses how the argument would go, so I made the move of skipping several steps of double-crux to the area where I suspect actual cruxes lie. I think exploring the whole debate-tree or argument map would be quite long, so I'll just try to gesture at how some of these things are connected, in my map.   - pivotal acts vs. pivotal processes -- my take is people's stance on feasibility of pivotal acts vs. processes partially depends on continuity assumptions - what do you believe about pivotal acts? - assuming continuity, do you expect existing non-human agents to move important parts of their cognition to AI substrates? -- if yes, do you expect large-scale regulations around that? --- if yes, will it be also partially automated? - different route: assuming continuity, do you expect a lot of alignment work to be done partially by AI systems, inside places like OpenAI? -- if at the same time this is a huge topic for the whole society, academia and politics, would you expect the rest of the world not trying to influence this? - different route: assuming continuity, do you expect a lot of "how different entities in the world coordinate" to be done partially by AI systems? -- if yes, do you assume technical features of the system matter? like, eg., multi-agent deliberation dynamics? - assuming the world notices AI safety as problem (it did much more since writing this post) -- do you expect large amount of attention and resources of academia and industry will be spent on AI alignment? ---  would you expect this will be somehow related to the technical problems and how we understand them? --- eg do you think it makes no difference to the technical problem if 300 or 30k people work on it? ---- if it makes a difference, does it make a difference how is the attention allocated? Not sure if the doublecrux between us would rest on the same cruxes, but I'm happy to try :)

Yeah we're on the same page here, thanks for checking!

For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?

I feel pretty uncertain about all the factors here. One reason I overall still lean towards the 'definitely not' stance is that building a toddler AGI that is alignable in principle is only one of multiple steps that need to go right for us to get a reflectively-stable docile AGI; in particular we still need to solve the prob... (read more)

2Johannes Treutlein
Regarding your last point 3., why does this make you more pessimistic rather than just very uncertain about everything?

Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.

I'm arguing that it's definitely not going to work (I don't have 99% confidence here bc I might be missing something, but IM(current)O the things I list are actual blockers).

First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.

Do you mean we possibly don't need the prerequisites, or we definitely need them but that's possibly fine?

3Steve Byrnes
I’m gonna pause to make sure we’re on the same page. We’re talking about this claim I made above: And you’re trying to argue: “‘Maybe, maybe not’ is too optimistic, the correct answer is ‘(almost) definitely not’”. And then by “prerequisites” we’re referring to the thing you wrote above: OK, now to respond. For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right? For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary. (I have previously written about that here.) For yet another thing, I think if the “toddler AGI” is not yet sophisticated enough to have a reflectively-endorsed desire for open and honest communication (or whatever), that’s different from saying that the toddler AGI is totally out to get us. It can still have habits and desires and inclinations and aversions and such, of various sorts, and we have some (imperfect) control over what those are. We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.

In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.

Curious what your take is on these reasons to think the answer is no (IMO the first one is basically already enough):

  • In order to have reflectively-endorsed goals that are stable under capability gains, the AGI needs to have reached some threshold levels of situa
... (read more)
2Steve Byrnes
Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second. I want to be clear that the “zapping” thing I wrote is a really crap plan, and I hope we can do better, and I feel odd defending it. My least-worst current alignment plan, such as it is, is here, and doesn’t look like that at all. In fact, the way I wrote it, it doesn’t attempt corrigibility in the first place. But anyway… First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote. Second bullet point → Ditto Third bullet point → Doesn’t that apply to any goal you want the AGI to have? The context was: I think OP was assuming that we can make an AGI that’s sincerely trying to invent nanotech, and then saying that deception was a different and harder problem. It’s true that deception makes alignment hard, but that’s true for whatever goal we’re trying to install. Deception makes it hard to make an AGI that’s trying in good faith to invent nanotech, and deception also makes it hard to make an AGI that’s trying in good faith to have open and honest communication with its human supervisor. This doesn’t seem like a differential issue. But anyway, I’m not disagreeing. I do think I would frame the issue differently though: I would say “zapping the AGI for being deceptive” looks identical to “zapping the AGI for getting caught being deceptive”, at least by default, and thus the possibility of Goal Mis-Generalization wields its ugly head. Fourth bullet point → I disagree for reasons here.

That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle.

Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:

  • There are kinds of deception (or rather kinds of deceptive capabilities / thoughts) that only show up after a certain capability level, and training before that level just won't affect them cause they're not there yet.
  • General capabilities imply the ability to be deceptive if useful in a particu
... (read more)
3Buck Shlegeris
My current guess is that people who want to use this algorithm should just implement it from scratch themselves--using our software is probably more of a pain than it's worth if you don't already have some reason to use it.
1Pranav Gade
I ended up throwing this(https://github.com/pranavgade20/causal-verifier) together over the weekend - it's probably very limited compared to redwood's thing, but seems to work on the one example I've tried.
2Buck Shlegeris
nope, but hopefully we'll release one in the next few weeks.

I also think that often "the AI just maximizes reward" is a useful simplifying assumption. That is, we can make an argument of the form "even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed".

(Though of course it's important to spell the argument out)

2Ajeya Cotra
Yeah, I agree this is a good argument structure -- in my mind, maximizing reward is both a plausible case (which Richard might disagree with) and the best case (conditional on it being strategic at all and not a bag of heuristics), so it's quite useful to establish that it's doomed; that's the kind of structure I was going for in the post.

I agree with your general point here, but I think Ajeya's post actually gets this right, eg

There is some ambiguity about what exactly “maximize reward” means, but once Alex is sufficiently powerful -- and once human knowledge/control has eroded enough -- an uprising or coup eventually seems to be the reward-maximizing move under most interpretations of “reward.”

and

What if Alex doesn’t generalize to maximizing its reward in the deployment setting? What if it has more complex behaviors or “motives” that aren’t directly and simply derived from

... (read more)
2Lauro Langosco
I also think that often "the AI just maximizes reward" is a useful simplifying assumption. That is, we can make an argument of the form "even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed". (Though of course it's important to spell the argument out)

I would be very curious to see your / OpenAI's responses to Eliezer's Dimensions of Operational Adequacy in AGI Projects post. Which points do you / OpenAI leadership disagree with? Insofar as you agree but haven't implemented the recommendations, what's stopping you?

People at OpenAI regularly say things like

And you say:

  • OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely
... (read more)

Gotcha, this makes sense to me now, given the assumption that to get AGI we need to train a P-parameter model on the optimal scaling, where P is fixed. Thanks!

...though now I'm confused about why we would assume that. Surely that assumption is wrong?

  • Humans are very constrained in terms of brain size and data, so we shouldn't assume that these quantities are scaled optimally in some sense that generalizes to deep learning models.
  • Anyhow we don't need to guess the amount of data the human brain needs: we can just estimate it directly, just like we estimate
... (read more)

But in my report I arrive at a forecast by fixing a model size based on estimates of brain computation, and then using scaling laws to estimate how much data is required to train a model of that size. The update from Chinchilla is then that we need more data than I might have thought.

I'm confused by this argument. The old GPT-3 scaling law is still correct, just not compute-optimal. If someone wanted to, they could still go on using the old scaling law. So discovering better scaling can only lead to an update towards shorter timelines, right?

(Except if you had expected even better scaling laws by now, but it didn't sound like that was your argument?)

5kave
If you assume the human brain was trained roughly optimally, then requiring more data, at a given parameter number, to be optimal pushes timelines out. If instead you had a specific loss number in mind, then a more efficient scaling law would pull timelines in.

What would make you change your mind about robustness of behavior (or interpretability of internal representations) through the sharp left turn? Or about the existence of such a sharp left turn, as opposed to smooth scaling of ability to learn in-context?

For example, would you change your mind if we found smooth scaling laws for (some good measure of) in-context learning?

4Rob Bensinger
From A central AI alignment problem: capabilities generalization, and the sharp left turn:

(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)

The basics

  • It doesn't prevent you from shutting it down
  • It doesn't prevent you from modifying it
  • It doesn't deceive or manipulate you
  • It does not try to infer your goals and achieve them; instead it just executes the most straightforward, human-common-sense interpretation of its instructions
  • It performs the task with minimal side-effects (but without explicitly minimizing a measure of side-effects)
  • If it self-modifies or co
... (read more)

Minor comment on clarity: you don't explicitly define relaxed adversarial training (it's only mentioned in the title and the conclusion), which is a bit confusing for someone coming across the term for the first time. Since this is the current reference post for RAT I think it would be nice if you did this explicitly; for example, I'd suggest renaming the second section to 'Formalizing relaxed adversarial training', and within the section call it that instead of 'Pauls approach'

2Evan Hubinger
Good point—edited.