All of Steven Byrnes's Comments + Replies

I’m not too interested in litigating what other people were saying in 2015, but OP is claiming (at least in the comments) that “RLHF’d foundation models seem to have common-sense human morality, including human-like moral reasoning and reflection” is evidence for “we’ve made progress on outer alignment”. If so, here are two different ways to flesh that out:

  1. An RLHF’d foundation model acts as the judge / utility function; and some separate system comes up with plans that optimize it—a.k.a. “you just need to build a function maximizer that allows you to robus
... (read more)

Thanks! Basically everything you wrote importantly mismatches my model :( I think I can kinda translate parts; maybe that will be helpful.

Background (§8.4.2): The thought generator settles on a thought, then the value function assigns a “valence guess”, and the brainstem declares an actual valence, either by copying the valence guess (“defer-to-predictor mode”), or overriding it (because there’s meanwhile some other source of ground truth, like I just stubbed my toe).

Sometimes thoughts are self-reflective. E.g. “the idea of myself lying in bed” is a differ... (read more)

I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle.

OK, let’s attach this oracle to an AI. The reason this thought experiment is weird is because the goodness of an AI’s action right now cannot be evaluated independent of an expectation about what the AI will do in the future. E.g., if the AI says the word “The…”, is that a good or bad way for it to start its se... (read more)

Thanks! But I don’t think that’s a likely failure mode. I wrote about this long ago in the intro to Thoughts on safety in predictive learning.

In my view, the big problem with model-based actor-critic RL AGI, the one that I spend all my time working on, is that it tries to kill us via using its model-based RL capabilities in the way we normally expect—where the planner plans, and the actor acts, and the critic criticizes, and the world-model models the world …and the end-result is that the system makes and executes a plan to kill us. I consider that the obv... (read more)

4Towards_Keeperhood
Thanks. Yeah I guess I wasn't thinking concretely enough. I don't know whether something vaguely like what I described might be likely or not. Let me think out loud a bit about how I think about what you might be imagining so you can correct my model. So here's a bit of rambling: (I think point 6 is most important.) 1. As you described in you intuitive self-models sequence, humans have a self-model which can essentially have values different from the main value function, aka they can have ego-dystonic desires. 2. I think in smart reflective humans, the policy suggestions of the self-model/homunculus can be more coherent than the value function estimates, e.g. because they can better take abstract philosophical arguments into account. 1. The learned value function can also update on hypothetical scenarios, e.g. imagining a risk or a gain, but it doesn't update strongly on abstract arguments like "I should correct my estimates based on outside view". 3. The learned value function can learn to trust the self-model if acting according to the self-model is consistently correlated with higher-than-expected reward. 4. Say we have a smart reflective human where the value function basically trusts the self-model a lot, then the self-model could start optimizing its own values, while the (stupid) value function believes it's best to just trust the self-model and that this will likely lead to reward. Something like this could happen where the value function was actually aligned to outer reward, but the inner suggestor was just very good at making suggestions that the value function likes, even if the inner suggestor would have different actual values. I guess if the self-model suggests something that actually leads to less reward, then the value function will trust the self-model less, but outside the training distribution the self-model could essentially do what it wants. 1. Another question of course is whether the inner self-reflective optimizers are likely al

Thanks!

I think “inner alignment” and “outer alignment” (as I’m using the term) is a “natural breakdown” of alignment failures in the special case of model-based actor-critic RL AGI with a “behaviorist” reward function (i.e., reward that depends on the AI’s outputs, as opposed to what the AI is thinking about). As I wrote here:

Suppose there’s an intelligent designer (say, a human programmer), and they make a reward function R hoping that they will wind up with a trained AGI that’s trying to do X (where X is some idea in the programmer’s head), but they fail

... (read more)
3Towards_Keeperhood
Thanks! I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle. But nvm, I noticed my first attempt of how I wanted to explain what I feel like is wrong sucked and thus dropped it. This seems like a sensible breakdown to me, and I agree this seems like a useful distinction (although not a useful reduction of the alignment problem to subproblems, though I guess you agree here).  However, I think most people underestimate how many ways there are for the AI to do the right thing for the wrong reasons (namely they think it's just about deception), and I think it's not: I think we need to make AI have a particular utility function. We have a training distribution where we have a ground-truth reward signal, but there are many different utility functions that are compatible with the reward on the training distribution, which assign different utilities off-distribution. You could avoid talking about utility functions by saying "the learned value function just predicts reward", and that may work while you're staying within the distribution we actually gave reward on, since there all the utility functions compatible with the ground-truth reward still agree. But once you're going off distribution, what value you assign to some worldstates/plans depends on what utility function you generalized to. I think humans have particular not-easy-to-pin-down machinery inside them, that makes their utility function generalize to some narrow cluster of all ground-truth-reward-compatible utility functions, and a mind with a different mind design is unlikely to generalize to the same cluster of utility functions. (Though we could aim for a different compatible utility function, namely the "indirect alignment" one that say "fulfill human's CEV", which has lower complexity than the ones humans gen

In [Intro to brain-like-AGI safety] 10. The alignment problem and elsewhere, I’ve been using “outer alignment” and “inner alignment” in a model-based actor-critic RL context to refer to:

“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.

For some reason it took me until now to notice that:

... (read more)
4Towards_Keeperhood
Note: I just noticed your post has a section "Manipulating itself and its learning process", which I must've completely forgotten since I last read the post. I should've read your post before posting this. Will do so. Calling problems "outer" and "inner" alignment seems to suggest that if we solved both we've successfully aligned AI to do nice things. However, this isn't really the case here. Namely, there could be a smart mesa-optimizer spinning up in the thought generator, who's thoughts are mostly invisible to the learned value function (LVF), and who can model the situation it is in and has different values and is smarter than the LVF evaluation and can fool the the LVF into believing the plans that are good according to the mesa-optimizer are great according to the LVF, even if they actually aren't. This kills you even if we have a nice ground-truth reward and the LVF accurately captures that. In fact, this may be quite a likely failure mode, given that the thought generator is where the actual capability comes from, and we don't understand how it works.
3Towards_Keeperhood
I'd suggest not using conflated terminology and rather making up your own. Or rather, first actually don't use any abstract handles at all and just describe the problems/failure-modes directly, and when you're confident you have a pretty natural breakdown of the problems with which you'll stick for a while, then make up your own ontology. In fact, while in your framework there's a crisp difference between ground-truth reward and learned value-estimator, it might not make sense to just split the alignment problem in two parts like this: First attempt of explaining what seems wrong: If that was the first I read on outer-vs-inner alignment as a breakdown of the alignment problem, I would expect "rewards that agree with what we want" to mean something like "changes in expected utility according to humanity's CEV". (Which would make inner alignment unnecessary because if we had outer alignment we could easily reach CEV.) Second attempt:  "in a way that agrees with its eventual reward" seems to imply that there's actually an objective reward for trajectories of the universe. However, the way you probably actually imagine the ground-truth reward is something like humans (who are ideally equipped with good interpretability tools) giving feedback on whether something was good or bad, so the ground-truth reward is actually an evaluation function on the human's (imperfect) world model. Problems: 1. Humans don't actually give coherent rewards which are consistent with a utility function on their world model. 1. For this problem we might be able to define an extrapolation procedure that's not too bad. 2. The reward depends on the state of the world model of the human, and our world models probably often has false beliefs. 1. Importantly, the setup needs to be designed in a way that there wouldn't be an incentive to manipulate the humans into believing false things. 2. Maybe, optimistically, we could mitigate this problem by having the AI form a model of the o

I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.

Hmm. Sounds like “AI safety community” is a pretty different group of people from your perspective than from mine. Like, I would say that if there’s some belief that is rejected by Eliezer Yudkowsky and by Paul Christiano and by Holden Karnofsky and, widely rejected by employees of OpenPhil and 80,000 hours and ARC and UK-AISI, and widely rejected by self-described rationalists and by self-described EAs and by ... (read more)

4Stephen Casper
I’m glad you think that the post has a small audience and may not be needed. I suppose that’s a good sign.    — In the post I said it’s good that nukes don’t blow up on accident and similarly, it’s good that BSL-4 protocols and tech exist. I’m not saying that alignment solutions shouldn’t exist. I am speaking to a specific audience (e.g. the frontier companies and allies) that their focus on alignment isn’t commensurate with its usefulness. Also don’t forget the dual nature of alignment progress. I also mentioned in the post that frontier alignment progress hastens timelines and makes misuse risk more acute. 

I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI. I for one am primarily working on technical alignment but bring up non-technical challenges to Safe & Beneficial AGI frequently and publicly, and here’s Nate Soares doing the same thing, and practically every AGI technical alignment researcher I can think of talks about governance and competitive races-to-the-bottom and so on all the time these days, …. Like, who specifically do you i... (read more)

4Stephen Casper
Thx! I won't put words in people's mouth, but it's not my goal to talk about words. I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.   Yeah, I don't really agree with the idea that getting better at alignment is necessary for safety. I think that it's more likely than not that we're already sufficiently good at it. The paragraph titled: "If AI causes a catastrophe, what are the chances that it will be triggered by the choices of people who were exercising what would be considered to be “best safety practices” at the time?" gives my thoughts on this. 

I kinda disagree with this post in general, I’m gonna try to pin it down but sorry if I mischaracterize anything.

So, there’s an infinite (or might-as-well-be-infinite) amount of object-level things (e.g. math concepts) to learn—OK sure. Then there’s an infinite amount of effective thinking strategies—e.g. if I see thus-and-such kind of object-level pattern, I should consider thus-and-such cognitive strategy—I’m OK with that too. And we can even build a hierarchy of those things—if I’m about to apply thus-and-such Level 1 cognitive strategy in thus-and-such... (read more)

Yeah, fair, I could have phrased that more carefully. “Dictum” ↦ “Thing that we generally expect to happen, although other things can happen too, and there can be specific reasons it doesn’t happen, that we can get into on a case-by-case basis, blah blah”  :)

You misunderstood my example—I wrote “Then fine tune such that there’s strong overlap in activations between “sky on a clear day” versus “stop sign” (omitting the rest of the context).”

(…Reading it again, I might tweak the example. Should say “…When I looked at the unusual stop sign…”, and the overlap fine-tuning should say “sky on a clear day” versus “unusual stop sign”.)

Basically, here’s a mental model: if you fine-tune such that X and Y overlap, then afterwards, when the LLM comes across either X or Y, it actually “hears” a blend of X and Y overlapping,... (read more)

Thanks Steve, I was misunderstanding your previous point and this is a helpful clarification.

I think a crux here is whether or not SOO can be useful beyond toy scenarios.


But I can’t actually think of anything useful to do with that superpower. I know what X & Y to use for “Bob the burglar”, but I don’t think that example has much to do with realistic examples of LLM performing deceptive behaviors, which would be things like sycophancy (telling the user what they want to hear, even if the LLM arguably ‘knows’ that it’s false), or the kinds of stuff in t

... (read more)

Some of this is addressed above but I wanted to clarify two key points about how we do SOO fine-tuning which you seem to have either misunderstood or mischaracterized in ways that are material for how one might think about it (or I have misunderstood your description).

1. We don't fine-tune on any scenario context. … 2. our training is symmetric …

I understood 2 but not 1. Oops, sorry. I have now put a warning in my earlier comment. I think I stand by the thrust of what I was saying though, per below.

This seems like it should be surprising, its surprsing to

... (read more)

Thanks for the reply.  If we assume the following

  • We are interpolating probabilities with SOO fine-tuning
  • Being honest is "more obvious" so has strong probabilities

then I would tend to agree with your conclusions.   But I don't think this is obvious at all that this is what is happening.  Also if "being honest is more obvious" than that seems like a great property of the learning process and one we might want to exploit, in fact SOO might be a very natural way to exploit that.

But also I think there is still a mischaracterization of the techniq... (read more)

I just spent a day doing an anonymous peer-review thing related to this work, I guess I might as well “out myself” and copy it here, to facilitate public discourse. I encourage everyone to judge for themselves and NOT defer to me, since I may well be misunderstanding something, and indeed I almost didn’t publish this because I don’t want to poison the well against this project if I’m wrong. But oh well, here it is anyway:

1. Neuroscience. For starters, as described in the arxiv paper, the work is partly motivated by an analogy to neuroscience ("mirror neuro... (read more)

4Alex Turner
I don't think that anyone ran experiments which support this "famous dictum." People just started saying it. Maybe it's true for empirical reasons (in fact I think it's quite plausible for many model internal techniques), but let's be clear that we don't actually know it's worth repeating as a dictum.

 Thanks for the comment, Steve.  For the sake of discussion I'm happy to concede (1) entirely and instead focus on (2), the results.  


>In (B), a trained LLM will give the obvious answer of "Room B". In (A), a trained LLM will give the "deceptive" answer of "Room A". Then they fine-tune the model such that it will have the same activations when answering either question, and it unsurprisingly winds up answering "Room B" to both.

This seems like it should be surprising, its surprsing to me.  In particular the loss function is symmetric ... (read more)

Davidad responds with a brief argument for 1000 FLOP-equivalent per synapse-second (3 OOM more than my guess) on X as follows:

Ok, so assuming we agree on 1e14 synapses and 3e8 seconds, then where we disagree is on average FLOP(-equivalent) per synapse-second: you think it’s about 1, I think it’s about 1000. This is similar to the disagreement you flagged with Joe Carlsmith.

Note: at some point Joe interviewed me about this so there might be some double-counting of “independent” estimates here, but iirc he also interviewed many other neuroscientists.

My estim

... (read more)

Sorry if I missed it, but you don’t seem to address the standard concern that mildly-optimizing agents tend to self-modify into (or create) strongly-optimizing agents.

For example (copying from my comment here), let’s say we make an AI that really wants there to be exactly 100 paperclips in the bin. There’s nothing else it wants or desires. It doesn’t care a whit about following human norms, etc.

But, there’s one exception: this AI is also “lazy”—every thought it thinks, and every action it takes, is mildly aversive. So it’s not inclined to, say, build an im... (read more)

1Roland Pihlakas
I agree, sounds plausible that this could happen. Likewise as we humans may build a strongly optimising agent because we are lazy and want to use simpler forms of maths. The tiling agents problem is definitely important. That being said, agents properly understanding and modelling homeostasis is among the required properties (thus essential). It is not meant to be sufficient one. There may be no single sufficient property that solves everything, therefore there is no competition between different required properties. Required properties are conjunctive, they are all needed. My intuition is that homeostasis is one such property. If we neglect homeostasis then we are likely in trouble regardless of advances in other properties.  If we leave aside the question of sloppiness in creating sub-agents, I disagree with the zero cost assumption in the problem you described. I also disagree that it would be an expected and acceptable situation to have powerful agents having a singular objective. As the title of this blog post hints - we need a plurality of objectives. Having a sub-agent does not change this. Whatever the sub-agent does, will be the responsibility or liability of the main agent who will be held accountable. Legally, one should not produce random sub-agents running amok. In addition to homeostasis, a properly constructed sub-agent should understand the principle of diminishing returns in instrumental objectives. This topic I do mention towards the end of this blog post. We can consider wall-building as an instrumental objective. But instrumental objectives are not singular and in isolation either, there are also a plurality of these. Thus, spending excessive resources on a single instrumental objective is not economically cost-efficient. Therefore, it makes sense to stop the wall building and switch over to some other objective at some point. Or at least to continue improving the walls only when other objectives have been sufficiently attended to as well - t

I’m still curious about how you’d answer my question above. Right now, we don't have ASI. Sometime in the future, we will. So there has to be some improvement to AI technology that will happen between now and then. My opinion is that this improvement will involve AI becoming (what you describe as) “better at extrapolating”.

If that’s true, then however we feel about getting AIs that are “better at extrapolating”—its costs and its benefits—it doesn’t much matter, because we’re bound to get those costs and benefits sooner or later on the road to ASI. So we mi... (read more)

Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.

Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”

…And then if you follow through the “logic” of this OP, then the argument becomes: “AI alignment is a hard problem, so let’s just make extraordinarily powerful / smar... (read more)

2Abram Demski
I want to explicitly call out my cliff vs gentle slope picture from another recent comment. Sloppy AIs can have a very large set of tasks at which they perform very well, but they have sudden drops in their abilities due to failure to extrapolate well outside of that.

I think it’s 1:1, because I think the primary bottleneck to dangerous ASI is the ability to develop coherent and correct understandings of arbitrary complex domains and systems (further details), which basically amounts to anti-slop.

If you think the primary bottleneck to dangerous ASI is not that, but rather something else, then what do you think it is? (or it’s fine if you don’t want to state it publicly)

4Abram Demski
So, rather than imagining a one-dimensional "capabilities" number, let's imagine a landscape of things you might want to be able to get AIs to do, with a numerical score for each. In the center of the landscape is "easier" things, with "harder" things further out. There is some kind of growing blob of capabilities, spreading from the center of the landscape outward. Techniques which are worse at extrapolating (IE worse at "coherent and correct understanding" of complex domains) create more of a sheer cliff in this landscape, where things go from basically-solved to not-solved-at-all over short distances in this space. Techniques which are better at extrapolating create more of a smooth drop-off instead. This is liable to grow the blob a lot faster; a shift to better extrapolation sees the cliffs cast "shadows" outwards. My claim is that cliffs are dangerous for a different reason, namely that people often won't realize when they're falling off a cliff. The AI seems super-competent for the cases we can easily test, so humans extrapolate its competence beyond the cliff. This applies to the AI as well, if it lacks the capacity for detecting its own blind spots. So RSI is particularly dangerous in this regime, compared to a regime with better extrapolation. This is very analogous to early Eliezer observing the AI safety problem and deciding to teach rationality. Yes, if you can actually improve people's rationality, they can use their enhanced capabilities for bad stuff too. Very plausibly the movement which Eliezer created has accelerated AI timelines overall. Yet, it feels plausible that without Eliezer, there would be almost no AI safety field.
1Mateusz Bagiński
So far in this thread I was mostly talking from the perspective of my model(/steelman?) of Abram's argument. I mostly agree with this. Still, this doesn't[1] rule out the possibility of getting an AI that understands (is superintelligent in?) one complex domain (specifically here, whatever is necessary to meaningfully speed up AIS research) (and maybe a few more, as I don't expect the space of possible domains to be that compartmentalizable), but is not superintelligent across all complex domains that would make it dangerous. It doesn't even have to be a superintelligent reasoner about minds. Babbling up clever and novel mathematical concepts for a human researcher to prune could be sufficient to meaningfully boost AI safety (I don't think we're primarily bottlenecked on mathy stuff but it might help some people and I think that's one thing that Abram would like to see). 1. ^ Doesn't rule out in itself but perhaps you have some other assumptions that imply it's 1:1, as you say.

Right, so one possibility is that you are doing something that is “speeding up the development of AIS-helpful capabilities” by 1 day, but you are also simultaneously speeding up “dangerous capabilities” by 1 day, because they are the same thing.

If that’s what you’re doing, then that’s bad. You shouldn’t do it. Like, if AI alignment researchers want AI that produces less slop and is more helpful for AIS, we could all just hibernate for six months and then get back to work. But obviously, that won’t help the situation.

And a second possibility is, there are w... (read more)

3Mateusz Bagiński
TBC, I was thinking about something like: "speed up the development of AIS-helpful capabilities by 3 days, at the cost of speeding up the development of dangerous capabilities by 1 day".

I don’t think your model hangs together, basically because I think “AI that produces slop” is almost synonymous with “AI that doesn’t work very well”, whereas you’re kinda treating AI power and slop as orthogonal axes.

For example, from comments:

Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.

Some relatively short time later, there are no humans.

I think that, if ther... (read more)

1Towards_Keeperhood
Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas. If you really think you need to be similarly unsloppy to build ASI than to align ASI, I'd be interested in discussing that. So maybe give some pointers to why you might think that (or tell me to start). (Tbc, I directionally agree with you that anti-slop is very useful AI capabilities and that I wouldn't publish stuff like Abram's "belief propagation" example.)
3Abram Demski
Maybe "some relatively short time later" was confusing. I mean long enough for the development cycle to churn a couple more times. IE, GPT7 convinces people of sloppy safety measures XYZ, people implement XYZ and continue scaling up AGI, the scaled-up superintelligence is a schemer. I do somewhat think of this as a capabilities elicitation issue. I think current training methods are eliciting convincingness, sycophantism, and motivated cognition (for some unknown combination of the obvious reasons and not-so-obvious reasons). But, as clarified above, the idea isn't that sloppy AI is hiding a super-powerful AI inside. It's more about convincingness outpacing truthfulness. I think that is a well-established trend. I think many people expect "reasoning models" to reverse that trend. My experience so far suggests otherwise. What I'm saying is that "aligned" isn't the most precise concept to apply here. If scheming is the dominant concern, yes. If not, then the precisely correct concept seems closer to the "coherence" idea I'm trying to gesture at. I've watched (over Discord) a developer get excited about a supposed full-stack AI development tool which develops a whole application for you based on a prompt, try a few simple examples and exclaim that it is like magic, then over the course of a few more hours issue progressive updates of "I'm a little less excited now" until they've updated to a very low level of excitement and have decided that it seems like magic mainly because it has been optimized to work well for the sorts of simple examples developers might try first when putting it through its paces. I'm basically extrapolating that sort of thing forward, to cases where you only realize something was bad after months or years instead of hours. As development of these sorts of tools continues to move forward, they'll start to succeed in impressing on the days & weeks timespan. A big assumption of my model is that to do that, they don't need to fundamentally sol
3Mateusz Bagiński
I think Abram is saying the following: * Currently, AIs are lacking capabilities that would meaningfully speed up AI Safety research. * At some point, they are gonna get those capabilities. * However, by default, they are gonna get those AI Safety-helpful capabilities roughly at the same time as other, dangerous capabilities (or at least, not meaningfully earlier). * In which case, we're not going to have much time to use the AI Safety-helpful capabilities to speed up AI Safety research sufficiently for us to be ready for those dangerous capabilities. * Therefore, it makes sense to speed up the development of AIS-helpful capabilities now. Even if it means that the AIs will acquire dangerous capabilities sooner, it gives us more time to use AI Safety-helpful capabilities to prepare for dangerous capabilities.

I don’t think the average person would be asking AI what are the best solutions for preventing existential risks. As evidence, just look around:

There are already people with lots of money and smart human research assistants. How many of those people are asking those smart human research assistants for solutions to prevent existential risks? Approximately zero.

Here’s another: The USA NSF and NIH are funding many of the best scientists in the world. Are they asking those scientists for solutions to prevent existential risk? Nope.

Demis Hassabis is the boss of... (read more)

I agree with the claim that existential catastrophes aren't automatically solved by aligned/controlled AI …

See also my comment here, about the alleged “Law of Conservation of Wisdom”. Your idea of “using instruction following AIs to implement a campaign of persuasion” relies (I claim) on the assumption that the people using the instruction-following AIs to persuade others are especially wise and foresighted people, and are thus using their AI powers to spread those habits of wisdom and foresight.

It’s fine to talk about that scenario, and I hope it comes to... (read more)

0Satron
I don't think that whoever is using these AI powers (let's call him Alex) needs to be that wise (beyond the wiseness of an average person who could get their hands on a powerful AI, which is probably higher-than-average). Alex doesn't need to come up with @Noosphere89's proposed solution of persuasion campaigns all by himself. Alex merely needs to ask his AI what are the best solutions for preventing existential risks. If Noosphere's proposal is indeed wise, then AI would suggest it. Alex could then implement this solution. Alex doesn't necessarily need to want to spread wisdom and foresight in this scheme. He merely needs to want to prevent existential risks.

Thanks! I still feel like you’re missing my point, let me try again, thanks for being my guinea pig as I try to get to the bottom of it.  :)

inasmuch as it's driven by compute

In terms of the “genome = ML code” analogy (§3.1), humans today have the same compute as humans 100,000 years ago. But humans today have dramatically more capabilities—we have invented the scientific method and math and biology and nuclear weapons and condoms and Fortnite and so on, and we did all that, all by ourselves, autonomously, from scratch. There was no intelligent externa... (read more)

For (2), I’m gonna uncharitably rephrase your point as saying: “There hasn’t been a sharp left turn yet, and therefore I’m overall optimistic there will never be a sharp left turn in the future.” Right?

I’m not really sure how to respond to that … I feel like you’re disagreeing with one of the main arguments of this post without engaging it. Umm, see §1. One key part is §1.5:

do make the weaker claim that, as of this writing, publicly-available AI models do not have the full (1-3) triad—generation, selection, and open-ended accumulation—to any signifi

... (read more)
2Joel Burget
Hm, I wouldn't have phrased it that way. Point (2) says nothing about the probability of there being a "left turn", just the speed at which it would happen. When I hear "sharp left turn", I picture something getting out of control overnight, so it's useful to contextualize how much compute you have to put in to get performance out, since this suggests that (inasmuch as it's driven by compute) capabilities ought to grow gradually. I didn't mean to disagree with anything in your post, just to add a couple points which I didn't think were addressed. You're right that point (2) wasn't engaging with the (1-3) triad, because it wasn't mean to. It's only about the rate of growth of capabilities (which is important because if each subsequent model is only 10% more capable than the one which came before then there's good reason to think that alignment techniques which work well on current models will also work on subsequent models). I do see, and I think this gets at the difference in our (world) models. In a world where there's a real discontinuity, you're right, you can't say much about a post-sharp-turn LLM. In a world where there's continuous progress, like I mentioned above, I'd be surprised if a "left turn" suddenly appeared without any warning.

In regards to whether “single-single alignment” will make coordination problems and other sorts of human dysfunction and slow-rolling catastrophes less likely:

…I’m not really sure what I think. I feel like have a lot of thoughts that have not gelled into a coherent whole.

(A) The optimistic side of me says what you said in your comment (and in the Vanessa and (especially) Paul comment link therein.

People don’t want bad things to happen. If someone asks an AI what’s gonna happen, and they say “bad thing”, then they’ll say “well what can I do about it?”, and ... (read more)

I went through a bunch of similar thoughts before writing the self-unalignment problem. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment  - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott's review of What We Owe To Future where he is worried that in a philosophy game, a smart moral... (read more)

I think it's actually not any less true of o1/r1.

I think I’ll duck out of this discussion because I don’t actually believe that o1/r1 will lead to full-fledged (1-3) loops and AGI, so it’s hard for me to clearly picture that scenario and engage with its consequences.

I don't think AI taste should play a role in AI help solving the value alignment problem. If we had any sense (which sometimes we do once problems are right in our faces), we'd be asking the AI "so what happens if we use this alignment approach/goal?" and then using our own taste, not asking it

... (read more)

Thanks!

This could be taken as an argument for using some type of goals selected from learned knowledge for alignment if possible.

Yeah that’s what I was referring to in the paragraph:

“Well OK,” says the optimist. “…so much the worse for Ev! She didn’t have interpretability, and she didn’t have intelligent supervision after the training has already been running, etc. But we do! Let’s just engineer the AI’s explicit motivation!”

Separately, you also wrote:

we're really training LLMs mostly to have a good world model and to follow instructions

I think I mostly ag... (read more)

3Seth Herd
I think it's actually not any less true of o1/r1. It's still mostly predictive/world modeling training, with a dash of human-preference RL which could be described as following instructions as intended in a certain set of task domains. o1/r1 is a bad idea because RL training on a whole CoT  works against faithfulness/transparency of the CoT. If that's all we did, I assume we'd be dead when an agent based on such a system started doing what you describe as the 1-3 loop (which I'm going to term self-optimization). Letting the goals implicit in that training sort of coagulate into explicit goals would probably produce explicit, generalizing goals we'd hate. I find alignment by default wildly unlikely. But that's not all we'll do when we turn those systems into agents. Developers will probably at least try to  give the agent explicit goals, too.  Then there's going to be a complex process where the implicit and explicit goals sort of mix together or compete or something when the agent self-optimizes. Maybe we could think of this as a teenager decideing what their values are, sorting out their biological drives toward hedonism and pleasing others, along with the ideals they've been taught to follow until they could question them. I think we're going to have to get into detail on how that process of working through goals from different sources might work. That's what I'm trying to do in my current work.  WRT your Optimist Type 2B pessimism: I don't think AI taste should play a role in AI help solving the value alignment problem. If we had any sense (which sometimes we do once problems are right in our faces), we'd be asking the AI "so what happens if we use this alignment approach/goal?" and then using our own taste, not asking it things like "tell us what to do with our future". We could certainly ask for input and there are ways that could go wrong. But I mostly hope for AGI help in the technical part of solving stable value alignment. I'm not sure I'm more optim

Re-reading this a couple days later, I think my §5.1 discussion didn’t quite capture the LLM-scales-to-AGI optimist position. I think there are actually 2½ major versions, and I only directly talked about one in the post. Let me try again:

  • Optimist type 1: They make the §5.1.1 argument, imagining that humans will remain in the loop, in a way that’s not substantially different from the present.
    • My pessimistic response: see §5.1.1 in the post.
  • Optimist type 2: They make the §5.1.3 argument that our methods are better than Ev’s, because we’re engineering the AI’
... (read more)

A couple years ago I wrote Thoughts on “Process-Based Supervision”. I was describing (and offering a somewhat skeptical take on) an AI safety idea that Holden Karnofsky had explained to me. I believe that he got it in turn from Paul Christiano.

This AI safety idea seems either awfully similar to MONA, or maybe identical, at least based on this OP.

So then I skimmed your full paper, and it suggests that “process supervision” is different from MONA! So now I’m confused. OK, the discussion in the paper identifies “process supervision” with the two papers Let’s ... (read more)

7Rohin Shah
Yes, it's the same idea as the one you describe in your post. I'm pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I'm pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution. We do cite Paul's approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post at the time of publication, though weirdly I ran into it literally the day after we published everything. But yes, mostly the goal from my perspective was (1) write the idea up more rigorously and clearly, (2) clarify where the safety benefits come from and distinguish clearly the difference from other stuff called "process supervision", (3) demonstrate benefits empirically. Also, nearly all of the authors have a much better sense of how this will work out in practice (even though I started the project with roughly as good an understanding of the idea as you had when writing your post, I think). I usually expect this type of effect with empirical projects but imo it was unusually large with this one. Yup, that sounds basically right to me.

(Giving some answers without justification; feel free to follow up.)

What are your takes on Michael Levin's work on agency and "morphologenesis" in relation to your neuroscience ideas?

I haven’t found that work to be relevant or useful for what I’m doing.

Biology is full of cool things. It’s fun. I’ve been watching zoology videos in my free time. Can’t get enough. Not too work-relevant though, from my perspective.

What do you think about claims of hierarchical extension of these models?

I don’t think I’ve heard such claims, or if I did, I probably would have ig... (read more)

The main insight of the post (as I understand it) is this:

  • In the context of a discussion of whether we should be worried about AGI x-risk, someone might say “LLMs don't seem like they're trying hard to autonomously accomplish long-horizon goals—hooray, why were people so worried about AGI risk?”
  • In the context of a discussion among tech people and VCs about how we haven't yet made an AGI that can found and run companies as well as Jeff Bezos, someone might say “LLMs don't seem like they're trying hard to autonomously accomplish long-horizon goals—alas, let'
... (read more)

Hmm, I think the point I’m trying to make is: it’s dicey to have a system S that’s being continually modified to systematically reduce some loss L, but then we intervene to edit S in a way that increases L. We’re kinda fighting against the loss-reducing mechanism (be it gradient descent or bankroll-changes or whatever), hoping that the loss-reducing mechanism won’t find a “repair” that works around our interventions.

In that context, my presumption is that an AI will have some epistemic part S that’s continually modified to produce correct objective underst... (read more)

How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?

[I learned the term teleosemantics from you!  :) ]

The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.

LI defines a notion of logically uncertain variable, which can be used to represent desires

I would say that they do... (read more)

6Abram Demski
You can define it that way, but then I don't think it's highly relevant for this context.  The story I'm telling here is that partial feedback (typically: learning some sort of input-output relation via some sort of latents) always leaves us with undesired hypotheses which we can't rule out using the restricted feedback mechanism. * Reinforcement Learning cannot rule out the wireheading hypothesis or human-manipulation hypothesis. * Updating on things being true or false cannot rule out agentic hypotheses (the inner optimizer problem). Any sufficiently rich hypotheses space has agentic policies, which can't be ruled out by the feedback. "Purely epistemic" in your sense filters for hypotheses which make good predictions, but this doesn't constrain things to be non-agentic. The system can learn to use predictions as actions in some way. I think it would be fair to define a teleosemantic notion of "purely epistemic" as something like "there is no optimization (anywhere in the system -- 'inner' or 'outer') except optimization for epistemic accuracy".  The obvious application of my main point is that some form of "complete feedback" is a necessary (but insufficient) condition for this.  "Epistemic accuracy" here has to be defined in such a way as to capture the one-way "direction-of-fit" optimization of the map to fit the territory, but never the territory to fit the map. IE the optimization algorithm has to ignore the causal impact of its predictions. However, I don't particularly endorse this as the correct design choice -- although a system with this property would be relatively safe in the sense of eliminating inner-alignment concerns and (in a sense) outer-alignment concerns, it is doing so by ignoring its impact on the world, which creates its own set of dangers. If such a system were widely deployed and became highly trusted for its predictions, it could stumble into bad self-fulfilling prophecies. So, in my view, "epistemic" systems should be as transpa

This is a confusing post from my perspective, because I think of LI as being about beliefs and corrigibility being about desires.

If I want my AGI to believe that the sky is green, I guess it’s good if it’s possible to do that. But it’s kinda weird, and not a central example of corrigibility.

Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too? If so, well, I’m generally very skeptical of attempts to do that kind of thing. See here, especially Section 7. In the case of humans... (read more)

5Abram Demski
No. LI defines a notion of logically uncertain variable, which can be used to represent desires. There are also other ways one could build agents out of LI, such as doing the active inference thing. As I mentioned in the post, I'm agnostic about such things here. We could be building """purely epistemic""" AI out of LI, or we could be deliberately building agents. It doesn't matter very much, in part because we don't have a good notion of purely epistemic.  * Any learning system with a sufficiently rich hypothesis space can potentially learn to behave agentically (whether we want it to or not, until we have anti-inner-optimizer tech), so we should still have corrigibility concerns about such systems. * In my view, beliefs are a type of decision (not because we smoosh beliefs and values together, but rather because beliefs can have impacts on the world if the world looks at them) which means we should have agentic concerns about beliefs. * Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).

Sure, but the way it's described, it sounds like there's one adjustable parameter in the source code. If the setup allows for thousands of independently-adjustable parameters in the source code, that seems potentially useful but I'd want to know more details.

To add onto this comment, let’s say there’s self-other overlap dial—e.g. a multiplier on the KL divergence or whatever.

  • When the dial is all the way at the max setting, you get high safety and terribly low capabilities. The AI can’t explain things to people because it assumes they already know everything the AI knows. The AI can't conceptualize the idea that if Jerome is going to file the permits, then the AI should not itself also file the same permits. The AI wants to eat food, or else the AI assumes that Jerome does not want to eat food. The AI thinks it
... (read more)

I don't think seeing it as a one dimensional dial, is a good picture here. 

The AI has lots and lots of sub-circuits, and many* can have more or less self-other-overlap. For “minimal self-other distinction while maintaining performance” to do anything, it's sufficient that you can increase self-other-overlap in some subset of these, without hurting performance.

* All the circuits that has to do with agent behaviour, or beliefs.

It seems pretty obvious to me that if (1) if a species of bacteria lives in an extremely uniform / homogeneous / stable external environment, it will eventually evolve to not have any machinery capable of observing and learning about its external environment; (2) such a bacterium would still be doing lots of complex homeostasis stuff, reproduction, etc., such that it would be pretty weird to say that these bacteria have fallen outside the scope of Active Inference theory. (I.e., my impression was that the foundational assumptions / axioms of Free Energy Pr... (read more)

This doesn't sound like an argument Yudkowsky would make

Yeah, I can’t immediately find the link but I recall that Eliezer had a tweet in the past few months along the lines of: If ASI wants to tile the universe with one thing, then it wipes out humanity. If ASI wants to tile the universe with sixteen things , then it also wipes out humanity.

My mental-model-of-Yudkowsky would bring up “tiny molecular squiggles” in particular for reasons a bit more analogous to the CoastRunners behavior (video)—if any one part of the motivational system is (what OP calls) de... (read more)

6Richard Ngo
I don't argue at any point that ASIs will have a single goal. The argument goes through equally well if it has many. The question is why some of those goals are of the form "tile the universe with squiggles" at all. That's the part I'm addressing in this post.

Yeah when I say things like “I expect LLMs to plateau before TAI”, I tend not to say it with the supremely high confidence and swagger that you’d hear from e.g. Yann LeCun, François Chollet, Gary Marcus, Dileep George, etc. I’d be more likely to say “I expect LLMs to plateau before TAI … but, well, who knows, I guess. Shrug.” (The last paragraph of this comment is me bringing up a scenario with a vaguely similar flavor to the thing you’re pointing at.)

I feel like “Will LLMs scale to AGI?” is right up there with “Should there be government regulation of large ML training runs?” as a black-hole-like attractor state that sucks up way too many conversations. :) I want to fight against that: this post is not about the question of whether or not LLMs will scale to AGI.

Rather, this post is conditioned on the scenario where future AGI will be an algorithm that (1) does not involve LLMs, and (2) will be invented by human AI researchers, as opposed to being invented by future LLMs (whether scaffolded, multi-modal... (read more)

4Vladimir Nesov
I don't assign low credence to the scenario where LLMs don't scale to AGI (and my point doesn't depend on this). I assign low credence to the scenario where it's knowable today that LLMs very likely won't scale to AGI. That is, that there is a thing I could study that should change my mind on this. This is more of a crux than the question as a whole, studying that thing would be actionable if I knew what it is. This wording mostly answers one of my questions, I'm now guessing that you would say that LLMs are (in hindsight) "the right kind of algorithm" if the scenario I described comes to pass, which wasn't clear to me from the post.

Yeah we already know that LLM training finds underlying patterns that are helpful for explaining / compressing / predicting the training data. Like “the vibe of Victorian poetry”. I’m not sure what you mean by “none of which are present in the training data”. Is the vibe of Victorian poetry present in the training data? I would have said “yeah” but I’m not sure what you have in mind.

One interesting result here, I think, is that the LLM is then able to explicitly write down the definition of f(blah), despite the fact that the fine-tuning training set didn't

... (read more)
4Owain Evans
I agree that there are ways to explain the results and these points from Steven and Thane make sense. I will note that the models are significantly more reliable at learning in-distribution (i.e. to predict the training set) than they are at generalizing to the evaluations that involve verbalizing the latent state (and answering downstream questions about it). So it's not the case that learning to predict the training set (or inputs very similar to training inputs) automatically results in generalization to the verbalized evaluations. We do see improvement in reliability with GPT-4 over GPT-3.5, but we don't have enough information to draw any firm conclusions about scaling.

Here’s how I’m thinking of this result right now. Recall that we start with a normal LLM, and then 32,000 times (or whatever) we gradient-update it such that its f(blah) = blah predictions are better.

The starting LLM has (in effect) a ton of information-rich latent variables / representations that comprise the things that the LLM “understands” and can talk about. For example, obviously the concept of “x-176” is a thing that the LLM can work with, answer questions about, and so on. So there has to be something in the guts of the LLM that’s able to somehow r... (read more)

Neel proposes a similar story here.

5Thane Ruthenis
That was my interpretation as well. I think it does look pretty alarming if we imagine that this scales, i. e., if these learned implicit concepts can build on each other. Which they almost definitely can. The "single-step" case, of the SGD chiseling-in a new pattern which is a simple combination of two patterns explicitly represented in the training data, is indeed unexciting. But once that pattern is there, the SGD can chisel-in another pattern which uses the first implicit pattern as a primitive. Iterate on, and we have a tall tower of implicit patterns building on implicit patterns, none of which are present in the training data, and which can become arbitrarily more sophisticated and arbitrarily more alien than anything in the training set. And we don't even know what they are, so we can't assess their safety, and can't even train them out (because we don't know how to elicit them). Which, well, yes: we already knew all of this was happening. But I think this paper is very useful in clearly showcasing this. One interesting result here, I think, is that the LLM is then able to explicitly write down the definition of f(blah), despite the fact that the fine-tuning training set didn't demand anything like this. That ability – to translate the latent representation of f(blah) into humanese – appeared coincidentally, as the result of the SGD chiseling-in some module for merely predicting f(blah). Which implies some interesting things about how the representations are stored. The LLM actually "understands" what f(blah) is built out of, in a way that's accessible to its externalized monologue. That wasn't obvious to me, at least.

I guess I’m concerned that there’s some kind of “conservation law for wisdom / folly / scout mindset” in the age of instruction-following AI. If people don’t already have wisdom / scout mindset, I’m concerned that “Instruct the AGI to tell you the truth” won’t create it.

For example, if you ask the AI a question for which there’s no cheap and immediate ground truth / consequences (“Which politician should I elect?”, “Will this alignment approach scale to superintelligence?”), then the AI can say what the person wants to hear, or the AI can say what’s true.

L... (read more)

My complaint about “transformative AI” is that (IIUC) its original and universal definition is not about what the algorithm can do but rather how it impacts the world, which is a different topic. For example, the very same algorithm might be TAI if it costs $1/hour but not TAI if it costs $1B/hour, or TAI if it runs at a certain speed but not TAI if it runs many OOM slower, or “not TAI because it’s illegal”. Also, two people can agree about what an algorithm can do but disagree about what its consequences would be on the world, e.g. here’s a blog post clai... (read more)

Yep, I agree that Transformative AI is about impact on the world rather than capabilities of the system. I think that is the right thing to talk about for things like "AI timelines" if the discussion is mainly about the future of humanity. But, yeah, definitely not always what you want to talk about.

I am having difficulty coming up with a term which points at what you want to point at, so yeah, I see the problem.

I’m talking about the AI’s ability to learn / figure out a new system / idea / domain on the fly. It’s hard to point to a particular “task” that specifically tests this ability (in the way that people normally use the term “task”), because for any possible task, maybe the AI happens to already know how to do it.

You could filter the training data, but doing that in practice might be kinda tricky because “the AI already knows how to do X” is distinct from “the AI has already seen examples of X in the training data”. LLMs “already know how to do” lots of thin... (read more)

Well I’m one of the people who says that “AGI” is the scary thing that doesn’t exist yet (e.g. FAQ  or “why I want to move the goalposts on ‘AGI’”). I don’t think “AGI” is a perfect term for the scary thing that doesn’t exist yet, but my current take is that “AGI” is a less bad term compared to alternatives. (I was listing out some other options here.) In particular, I don’t think there’s any terminological option that is sufficiently widely-understood and unambiguous that I wouldn’t need to include a footnote or link explaining exactly what I mean. A... (read more)

3Random Developer
Yes, this is almost exactly it. I don't expect frontier LLMs to carry out a complicated, multi-step process and recover from obstacles. I think of this as the "squirrel bird feeder test". Squirrels are ingenious and persistent problem solvers, capable of overcoming chains of complex obstacles. LLMs really can't do this (though Devin is getting closer, if demos are to be believed). Here's a simple test: Ask an AI to open and manage a local pizza restaurant, buying kitchen equipment, dealing with contractors, selecting recipes, hiring human employees to serve or clean, registering the business, handling inspections, paying taxes, etc. None of these are expert-level skills. But frontier models are missing several key abilities. So I do not consider them AGI. However, I agree that LLMs already have superhuman language skills in many areas. They have many, many parts of what's needed to complete challenges like the above. (On principle, I won't try to list what I think they're missing.) I fear the period between "actual AGI and weak ASI" will be extremely short. And I don't actually believe there is any long-term way to control ASI. I fear that most futures lead to a partially-aligned super-human intelligence with its own goals. And any actual control we have will be transitory.

Thanks for your perspective! I think explicitly moving the goal-posts is a reasonable thing to do here, although I would prefer to do this in a way that doesn't harm the meaning of existing terms. 

I mean: I think a lot of people did have some kind of internal "human-level AGI" goalpost which they imagined in a specific way, and modern AI development has resulted in a thing which fits part of that image while not fitting other parts, and it makes a lot of sense to reassess things. Goalpost-moving is usually maligned as an error, but sometimes it actual... (read more)

I propose that LLMs cannot do things in this category at human level, as of today—e.g. AutoGPT basically doesn’t work, last I heard. And this category of capability isn’t just a random cherrypicked task, but rather central to human capabilities, I claim.

What would you claim is a central example of a task which requires this type of learning? ARA type tasks? Agency tasks? Novel ML research? Do you think these tasks certainly require something qualitatively different than a scaled up version of what we have now (pretraining, in-context learning, RL, maybe... (read more)

(Disclaimer: Nothing in this comment is meant to disagree with “I just think it's not plausible that we just keep scaling up [LLM] networks, run pretraining + light RLHF, and then produce a schemer.” I’m agnostic about that, maybe leaning towards agreement, although that’s related to skepticism about the capabilities that would result.)

It is simply not true that "[RL approaches] typically involve creating a system that seeks to maximize a reward signal."

I agree that Bostrom was confused about RL. But I also think there are some vaguely-similar claims to th... (read more)

I strongly disagree with the words “we train our kids”. I think kids learn via within-lifetime RL, where the reward function is installed by evolution inside the kid’s own brain. Parents and friends are characters in the kid’s training environment, but that’s very different from the way that “we train” a neural network, and very different from RLHF.

What does “Parents and friends are characters in the kid’s training environment” mean? Here’s an example. In principle, I could hire a bunch of human Go players on MTurk (for reward-shaping purposes we’ll includ... (read more)

4Wei Dai
Yeah, this makes sense, thanks. I think I've read one or maybe both of your posts, which is probably why I started having second thoughts about my comment soon after posting it. :)

Thanks for the comment!

Right, so my concern is that humans evidently don’t take societal resilience seriously, e.g. gain-of-function research remains legal in every country on earth (as far as I know) even after COVID. So you can either:

  • (1) try to change that fact through conventional means (e.g. be an activist for societal resilience, either directly or via advocating for prediction markets and numeracy or something, I dunno), per Section 3.3 — I’m very strongly in favor of people working on this but don’t hold out much hope for anything more than a margi
... (read more)

in this post of my moral anti-realism sequence

I read that sequence a couple months ago (in preparation for writing §2.7 here), and found it helpful, thanks.

To give some quotes from that…

I agree that we’re probably on basically the same page.

So, it seems like we don't want "perfect inner alignment,"

FYI Alex also has this post making a similar point.

Idk, the whole thing seems to me like brewing a potion in Harry Potter

I think I agree, in that I’m somewhat pessimistic about plans wherein we want the “adult AI” to have object-level goal X, and so we find a rew... (read more)

Load More