All of Steven Byrnes's Comments + Replies

Steve Byrnes10d40

Thanks! Basically everything you wrote importantly mismatches my model :( I think I can kinda translate parts; maybe that will be helpful.

Background (§8.4.2): The thought generator settles on a thought, then the value function assigns a “valence guess”, and the brainstem declares an actual valence, either by copying the valence guess (“defer-to-predictor mode”), or overriding it (because there’s meanwhile some other source of ground truth, like I just stubbed my toe).

Sometimes thoughts are self-reflective. E.g. “the idea of myself lying in bed” is a differ... (read more)

Steve Byrnes16d20

I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle.

OK, let’s attach this oracle to an AI. The reason this thought experiment is weird is because the goodness of an AI’s action right now cannot be evaluated independent of an expectation about what the AI will do in the future. E.g., if the AI says the word “The…”, is that a good or bad way for it to start its se... (read more)

Steve Byrnes16d40

Thanks! But I don’t think that’s a likely failure mode. I wrote about this long ago in the intro to Thoughts on safety in predictive learning.

In my view, the big problem with model-based actor-critic RL AGI, the one that I spend all my time working on, is that it tries to kill us via using its model-based RL capabilities in the way we normally expect—where the planner plans, and the actor acts, and the critic criticizes, and the world-model models the world …and the end-result is that the system makes and executes a plan to kill us. I consider that the obv... (read more)

4Towards_Keeperhood15d

Thanks. Yeah I guess I wasn't thinking concretely enough. I don't know whether something vaguely like what I described might be likely or not. Let me think out loud a bit about how I think about what you might be imagining so you can correct my model. So here's a bit of rambling: (I think point 6 is most important.) 1. As you described in you intuitive self-models sequence, humans have a self-model which can essentially have values different from the main value function, aka they can have ego-dystonic desires. 2. I think in smart reflective humans, the policy suggestions of the self-model/homunculus can be more coherent than the value function estimates, e.g. because they can better take abstract philosophical arguments into account. 1. The learned value function can also update on hypothetical scenarios, e.g. imagining a risk or a gain, but it doesn't update strongly on abstract arguments like "I should correct my estimates based on outside view". 3. The learned value function can learn to trust the self-model if acting according to the self-model is consistently correlated with higher-than-expected reward. 4. Say we have a smart reflective human where the value function basically trusts the self-model a lot, then the self-model could start optimizing its own values, while the (stupid) value function believes it's best to just trust the self-model and that this will likely lead to reward. Something like this could happen where the value function was actually aligned to outer reward, but the inner suggestor was just very good at making suggestions that the value function likes, even if the inner suggestor would have different actual values. I guess if the self-model suggests something that actually leads to less reward, then the value function will trust the self-model less, but outside the training distribution the self-model could essentially do what it wants. 1. Another question of course is whether the inner self-reflective optimizers are likely al

Steve Byrnes16d30

Thanks!

I think “inner alignment” and “outer alignment” (as I’m using the term) is a “natural breakdown” of alignment failures in the special case of model-based actor-critic RL AGI with a “behaviorist” reward function (i.e., reward that depends on the AI’s outputs, as opposed to what the AI is thinking about). As I wrote here:

Suppose there’s an intelligent designer (say, a human programmer), and they make a reward function R hoping that they will wind up with a trained AGI that’s trying to do X (where X is some idea in the programmer’s head), but they fail

... (read more)

3Towards_Keeperhood16d

Thanks! I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle. But nvm, I noticed my first attempt of how I wanted to explain what I feel like is wrong sucked and thus dropped it. This seems like a sensible breakdown to me, and I agree this seems like a useful distinction (although not a useful reduction of the alignment problem to subproblems, though I guess you agree here). However, I think most people underestimate how many ways there are for the AI to do the right thing for the wrong reasons (namely they think it's just about deception), and I think it's not: I think we need to make AI have a particular utility function. We have a training distribution where we have a ground-truth reward signal, but there are many different utility functions that are compatible with the reward on the training distribution, which assign different utilities off-distribution. You could avoid talking about utility functions by saying "the learned value function just predicts reward", and that may work while you're staying within the distribution we actually gave reward on, since there all the utility functions compatible with the ground-truth reward still agree. But once you're going off distribution, what value you assign to some worldstates/plans depends on what utility function you generalized to. I think humans have particular not-easy-to-pin-down machinery inside them, that makes their utility function generalize to some narrow cluster of all ground-truth-reward-compatible utility functions, and a mind with a different mind design is unlikely to generalize to the same cluster of utility functions. (Though we could aim for a different compatible utility function, namely the "indirect alignment" one that say "fulfill human's CEV", which has lower complexity than the ones humans gen

Reframing AI Safety as a Neverending Institutional Challenge

Steve Byrnes16d102

In [Intro to brain-like-AGI safety] 10. The alignment problem and elsewhere, I’ve been using “outer alignment” and “inner alignment” in a model-based actor-critic RL context to refer to:

“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.

For some reason it took me until now to notice that:

my “outer misalignment” is more-or-less synonymous with “specification ga

... (read more)

4Towards_Keeperhood16d

Note: I just noticed your post has a section "Manipulating itself and its learning process", which I must've completely forgotten since I last read the post. I should've read your post before posting this. Will do so. Calling problems "outer" and "inner" alignment seems to suggest that if we solved both we've successfully aligned AI to do nice things. However, this isn't really the case here. Namely, there could be a smart mesa-optimizer spinning up in the thought generator, who's thoughts are mostly invisible to the learned value function (LVF), and who can model the situation it is in and has different values and is smarter than the LVF evaluation and can fool the the LVF into believing the plans that are good according to the mesa-optimizer are great according to the LVF, even if they actually aren't. This kills you even if we have a nice ground-truth reward and the LVF accurately captures that. In fact, this may be quite a likely failure mode, given that the thought generator is where the actual capability comes from, and we don't understand how it works.

3Towards_Keeperhood16d

I'd suggest not using conflated terminology and rather making up your own. Or rather, first actually don't use any abstract handles at all and just describe the problems/failure-modes directly, and when you're confident you have a pretty natural breakdown of the problems with which you'll stick for a while, then make up your own ontology. In fact, while in your framework there's a crisp difference between ground-truth reward and learned value-estimator, it might not make sense to just split the alignment problem in two parts like this: First attempt of explaining what seems wrong: If that was the first I read on outer-vs-inner alignment as a breakdown of the alignment problem, I would expect "rewards that agree with what we want" to mean something like "changes in expected utility according to humanity's CEV". (Which would make inner alignment unnecessary because if we had outer alignment we could easily reach CEV.) Second attempt: "in a way that agrees with its eventual reward" seems to imply that there's actually an objective reward for trajectories of the universe. However, the way you probably actually imagine the ground-truth reward is something like humans (who are ideally equipped with good interpretability tools) giving feedback on whether something was good or bad, so the ground-truth reward is actually an evaluation function on the human's (imperfect) world model. Problems: 1. Humans don't actually give coherent rewards which are consistent with a utility function on their world model. 1. For this problem we might be able to define an extrapolation procedure that's not too bad. 2. The reward depends on the state of the world model of the human, and our world models probably often has false beliefs. 1. Importantly, the setup needs to be designed in a way that there wouldn't be an incentive to manipulate the humans into believing false things. 2. Maybe, optimistically, we could mitigate this problem by having the AI form a model of the o

Steve Byrnes1mo1112

I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.

Hmm. Sounds like “AI safety community” is a pretty different group of people from your perspective than from mine. Like, I would say that if there’s some belief that is rejected by Eliezer Yudkowsky and by Paul Christiano and by Holden Karnofsky and, widely rejected by employees of OpenPhil and 80,000 hours and ARC and UK-AISI, and widely rejected by self-described rationalists and by self-described EAs and by ... (read more)

4Stephen Casper1mo

I’m glad you think that the post has a small audience and may not be needed. I suppose that’s a good sign. — In the post I said it’s good that nukes don’t blow up on accident and similarly, it’s good that BSL-4 protocols and tech exist. I’m not saying that alignment solutions shouldn’t exist. I am speaking to a specific audience (e.g. the frontier companies and allies) that their focus on alignment isn’t commensurate with its usefulness. Also don’t forget the dual nature of alignment progress. I also mentioned in the post that frontier alignment progress hastens timelines and makes misuse risk more acute.

Reframing AI Safety as a Neverending Institutional Challenge

Steve Byrnes1mo*2323

I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI. I for one am primarily working on technical alignment but bring up non-technical challenges to Safe & Beneficial AGI frequently and publicly, and here’s Nate Soares doing the same thing, and practically every AGI technical alignment researcher I can think of talks about governance and competitive races-to-the-bottom and so on all the time these days, …. Like, who specifically do you i... (read more)

4Stephen Casper1mo

Thx! I won't put words in people's mouth, but it's not my goal to talk about words. I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception. Yeah, I don't really agree with the idea that getting better at alignment is necessary for safety. I think that it's more likely than not that we're already sufficiently good at it. The paragraph titled: "If AI causes a catastrophe, what are the chances that it will be triggered by the choices of people who were exercising what would be considered to be “best safety practices” at the time?" gives my thoughts on this.

An Advent of Thought

Steve Byrnes1mo*40

I kinda disagree with this post in general, I’m gonna try to pin it down but sorry if I mischaracterize anything.

So, there’s an infinite (or might-as-well-be-infinite) amount of object-level things (e.g. math concepts) to learn—OK sure. Then there’s an infinite amount of effective thinking strategies—e.g. if I see thus-and-such kind of object-level pattern, I should consider thus-and-such cognitive strategy—I’m OK with that too. And we can even build a hierarchy of those things—if I’m about to apply thus-and-such Level 1 cognitive strategy in thus-and-such... (read more)

Reducing LLM deception at scale with self-other overlap fine-tuning

Steve Byrnes1mo30

Yeah, fair, I could have phrased that more carefully. “Dictum” ↦ “Thing that we generally expect to happen, although other things can happen too, and there can be specific reasons it doesn’t happen, that we can get into on a case-by-case basis, blah blah” :)

Reducing LLM deception at scale with self-other overlap fine-tuning

Steve Byrnes1mo30

You misunderstood my example—I wrote “Then fine tune such that there’s strong overlap in activations between “sky on a clear day” versus “stop sign” (omitting the rest of the context).”

(…Reading it again, I might tweak the example. Should say “…When I looked at the unusual stop sign…”, and the overlap fine-tuning should say “sky on a clear day” versus “unusual stop sign”.)

Basically, here’s a mental model: if you fine-tune such that X and Y overlap, then afterwards, when the LLM comes across either X or Y, it actually “hears” a blend of X and Y overlapping,... (read more)

Mike Vaiana1mo50

Thanks Steve, I was misunderstanding your previous point and this is a helpful clarification.

I think a crux here is whether or not SOO can be useful beyond toy scenarios.

But I can’t actually think of anything useful to do with that superpower. I know what X & Y to use for “Bob the burglar”, but I don’t think that example has much to do with realistic examples of LLM performing deceptive behaviors, which would be things like sycophancy (telling the user what they want to hear, even if the LLM arguably ‘knows’ that it’s false), or the kinds of stuff in t

Steve Byrnes1mo50

Some of this is addressed above but I wanted to clarify two key points about how we do SOO fine-tuning which you seem to have either misunderstood or mischaracterized in ways that are material for how one might think about it (or I have misunderstood your description).
1. We don't fine-tune on any scenario context. … 2. our training is symmetric …

I understood 2 but not 1. Oops, sorry. I have now put a warning in my earlier comment. I think I stand by the thrust of what I was saying though, per below.

This seems like it should be surprising, its surprsing to

... (read more)

Mike Vaiana1mo*42

Thanks for the reply. If we assume the following

We are interpolating probabilities with SOO fine-tuning
Being honest is "more obvious" so has strong probabilities

then I would tend to agree with your conclusions. But I don't think this is obvious at all that this is what is happening. Also if "being honest is more obvious" than that seems like a great property of the learning process and one we might want to exploit, in fact SOO might be a very natural way to exploit that.

But also I think there is still a mischaracterization of the techniq... (read more)

Reducing LLM deception at scale with self-other overlap fine-tuning

Steve Byrnes1mo*442

I just spent a day doing an anonymous peer-review thing related to this work, I guess I might as well “out myself” and copy it here, to facilitate public discourse. I encourage everyone to judge for themselves and NOT defer to me, since I may well be misunderstanding something, and indeed I almost didn’t publish this because I don’t want to poison the well against this project if I’m wrong. But oh well, here it is anyway:

1. Neuroscience. For starters, as described in the arxiv paper, the work is partly motivated by an analogy to neuroscience ("mirror neuro... (read more)

4Alex Turner1mo

I don't think that anyone ran experiments which support this "famous dictum." People just started saying it. Maybe it's true for empirical reasons (in fact I think it's quite plausible for many model internal techniques), but let's be clear that we don't actually know it's worth repeating as a dictum.

Mike Vaiana1mo*1213

Thanks for the comment, Steve. For the sake of discussion I'm happy to concede (1) entirely and instead focus on (2), the results.

>In (B), a trained LLM will give the obvious answer of "Room B". In (A), a trained LLM will give the "deceptive" answer of "Room A". Then they fine-tune the model such that it will have the same activations when answering either question, and it unsurprisingly winds up answering "Room B" to both.

This seems like it should be surprising, its surprsing to me. In particular the loss function is symmetric ... (read more)

Thoughts on hardware / compute requirements for AGI

Steve Byrnes2mo*20

Davidad responds with a brief argument for 1000 FLOP-equivalent per synapse-second (3 OOM more than my guess) on X as follows:

Ok, so assuming we agree on 1e14 synapses and 3e8 seconds, then where we disagree is on average FLOP(-equivalent) per synapse-second: you think it’s about 1, I think it’s about 1000. This is similar to the disagreement you flagged with Joe Carlsmith.
Note: at some point Joe interviewed me about this so there might be some double-counting of “independent” estimates here, but iirc he also interviewed many other neuroscientists.
My estim

Steve Byrnes2mo40

Sorry if I missed it, but you don’t seem to address the standard concern that mildly-optimizing agents tend to self-modify into (or create) strongly-optimizing agents.

For example (copying from my comment here), let’s say we make an AI that really wants there to be exactly 100 paperclips in the bin. There’s nothing else it wants or desires. It doesn’t care a whit about following human norms, etc.

But, there’s one exception: this AI is also “lazy”—every thought it thinks, and every action it takes, is mildly aversive. So it’s not inclined to, say, build an im... (read more)

1Roland Pihlakas2mo

I agree, sounds plausible that this could happen. Likewise as we humans may build a strongly optimising agent because we are lazy and want to use simpler forms of maths. The tiling agents problem is definitely important. That being said, agents properly understanding and modelling homeostasis is among the required properties (thus essential). It is not meant to be sufficient one. There may be no single sufficient property that solves everything, therefore there is no competition between different required properties. Required properties are conjunctive, they are all needed. My intuition is that homeostasis is one such property. If we neglect homeostasis then we are likely in trouble regardless of advances in other properties. If we leave aside the question of sloppiness in creating sub-agents, I disagree with the zero cost assumption in the problem you described. I also disagree that it would be an expected and acceptable situation to have powerful agents having a singular objective. As the title of this blog post hints - we need a plurality of objectives. Having a sub-agent does not change this. Whatever the sub-agent does, will be the responsibility or liability of the main agent who will be held accountable. Legally, one should not produce random sub-agents running amok. In addition to homeostasis, a properly constructed sub-agent should understand the principle of diminishing returns in instrumental objectives. This topic I do mention towards the end of this blog post. We can consider wall-building as an instrumental objective. But instrumental objectives are not singular and in isolation either, there are also a plurality of these. Thus, spending excessive resources on a single instrumental objective is not economically cost-efficient. Therefore, it makes sense to stop the wall building and switch over to some other objective at some point. Or at least to continue improving the walls only when other objectives have been sufficiently attended to as well - t

Steve Byrnes2mo*20

I’m still curious about how you’d answer my question above. Right now, we don't have ASI. Sometime in the future, we will. So there has to be some improvement to AI technology that will happen between now and then. My opinion is that this improvement will involve AI becoming (what you describe as) “better at extrapolating”.

If that’s true, then however we feel about getting AIs that are “better at extrapolating”—its costs and its benefits—it doesn’t much matter, because we’re bound to get those costs and benefits sooner or later on the road to ASI. So we mi... (read more)

Steve Byrnes2mo20

Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.

Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”

…And then if you follow through the “logic” of this OP, then the argument becomes: “AI alignment is a hard problem, so let’s just make extraordinarily powerful / smar... (read more)

2Abram Demski2mo

I want to explicitly call out my cliff vs gentle slope picture from another recent comment. Sloppy AIs can have a very large set of tasks at which they perform very well, but they have sudden drops in their abilities due to failure to extrapolate well outside of that.

Steve Byrnes2mo32

I think it’s 1:1, because I think the primary bottleneck to dangerous ASI is the ability to develop coherent and correct understandings of arbitrary complex domains and systems (further details), which basically amounts to anti-slop.

If you think the primary bottleneck to dangerous ASI is not that, but rather something else, then what do you think it is? (or it’s fine if you don’t want to state it publicly)

4Abram Demski2mo

So, rather than imagining a one-dimensional "capabilities" number, let's imagine a landscape of things you might want to be able to get AIs to do, with a numerical score for each. In the center of the landscape is "easier" things, with "harder" things further out. There is some kind of growing blob of capabilities, spreading from the center of the landscape outward. Techniques which are worse at extrapolating (IE worse at "coherent and correct understanding" of complex domains) create more of a sheer cliff in this landscape, where things go from basically-solved to not-solved-at-all over short distances in this space. Techniques which are better at extrapolating create more of a smooth drop-off instead. This is liable to grow the blob a lot faster; a shift to better extrapolation sees the cliffs cast "shadows" outwards. My claim is that cliffs are dangerous for a different reason, namely that people often won't realize when they're falling off a cliff. The AI seems super-competent for the cases we can easily test, so humans extrapolate its competence beyond the cliff. This applies to the AI as well, if it lacks the capacity for detecting its own blind spots. So RSI is particularly dangerous in this regime, compared to a regime with better extrapolation. This is very analogous to early Eliezer observing the AI safety problem and deciding to teach rationality. Yes, if you can actually improve people's rationality, they can use their enhanced capabilities for bad stuff too. Very plausibly the movement which Eliezer created has accelerated AI timelines overall. Yet, it feels plausible that without Eliezer, there would be almost no AI safety field.

1Mateusz Bagiński2mo

So far in this thread I was mostly talking from the perspective of my model(/steelman?) of Abram's argument. I mostly agree with this. Still, this doesn't[1] rule out the possibility of getting an AI that understands (is superintelligent in?) one complex domain (specifically here, whatever is necessary to meaningfully speed up AIS research) (and maybe a few more, as I don't expect the space of possible domains to be that compartmentalizable), but is not superintelligent across all complex domains that would make it dangerous. It doesn't even have to be a superintelligent reasoner about minds. Babbling up clever and novel mathematical concepts for a human researcher to prune could be sufficient to meaningfully boost AI safety (I don't think we're primarily bottlenecked on mathy stuff but it might help some people and I think that's one thing that Abram would like to see). 1. ^ Doesn't rule out in itself but perhaps you have some other assumptions that imply it's 1:1, as you say.

Steve Byrnes2mo45

Right, so one possibility is that you are doing something that is “speeding up the development of AIS-helpful capabilities” by 1 day, but you are also simultaneously speeding up “dangerous capabilities” by 1 day, because they are the same thing.

If that’s what you’re doing, then that’s bad. You shouldn’t do it. Like, if AI alignment researchers want AI that produces less slop and is more helpful for AIS, we could all just hibernate for six months and then get back to work. But obviously, that won’t help the situation.

And a second possibility is, there are w... (read more)

3Mateusz Bagiński2mo

TBC, I was thinking about something like: "speed up the development of AIS-helpful capabilities by 3 days, at the cost of speeding up the development of dangerous capabilities by 1 day".

Steve Byrnes2mo113

I don’t think your model hangs together, basically because I think “AI that produces slop” is almost synonymous with “AI that doesn’t work very well”, whereas you’re kinda treating AI power and slop as orthogonal axes.

For example, from comments:

Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish "belief propagation" it gets drowned out in the noise anyway.
Some relatively short time later, there are no humans.

I think that, if ther... (read more)

1Towards_Keeperhood2mo

Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas. If you really think you need to be similarly unsloppy to build ASI than to align ASI, I'd be interested in discussing that. So maybe give some pointers to why you might think that (or tell me to start). (Tbc, I directionally agree with you that anti-slop is very useful AI capabilities and that I wouldn't publish stuff like Abram's "belief propagation" example.)

3Abram Demski2mo

Maybe "some relatively short time later" was confusing. I mean long enough for the development cycle to churn a couple more times. IE, GPT7 convinces people of sloppy safety measures XYZ, people implement XYZ and continue scaling up AGI, the scaled-up superintelligence is a schemer. I do somewhat think of this as a capabilities elicitation issue. I think current training methods are eliciting convincingness, sycophantism, and motivated cognition (for some unknown combination of the obvious reasons and not-so-obvious reasons). But, as clarified above, the idea isn't that sloppy AI is hiding a super-powerful AI inside. It's more about convincingness outpacing truthfulness. I think that is a well-established trend. I think many people expect "reasoning models" to reverse that trend. My experience so far suggests otherwise. What I'm saying is that "aligned" isn't the most precise concept to apply here. If scheming is the dominant concern, yes. If not, then the precisely correct concept seems closer to the "coherence" idea I'm trying to gesture at. I've watched (over Discord) a developer get excited about a supposed full-stack AI development tool which develops a whole application for you based on a prompt, try a few simple examples and exclaim that it is like magic, then over the course of a few more hours issue progressive updates of "I'm a little less excited now" until they've updated to a very low level of excitement and have decided that it seems like magic mainly because it has been optimized to work well for the sorts of simple examples developers might try first when putting it through its paces. I'm basically extrapolating that sort of thing forward, to cases where you only realize something was bad after months or years instead of hours. As development of these sorts of tools continues to move forward, they'll start to succeed in impressing on the days & weeks timespan. A big assumption of my model is that to do that, they don't need to fundamentally sol

3Mateusz Bagiński2mo

I think Abram is saying the following: * Currently, AIs are lacking capabilities that would meaningfully speed up AI Safety research. * At some point, they are gonna get those capabilities. * However, by default, they are gonna get those AI Safety-helpful capabilities roughly at the same time as other, dangerous capabilities (or at least, not meaningfully earlier). * In which case, we're not going to have much time to use the AI Safety-helpful capabilities to speed up AI Safety research sufficiently for us to be ready for those dangerous capabilities. * Therefore, it makes sense to speed up the development of AIS-helpful capabilities now. Even if it means that the AIs will acquire dangerous capabilities sooner, it gives us more time to use AI Safety-helpful capabilities to prepare for dangerous capabilities.

Steve Byrnes2mo20

I don’t think the average person would be asking AI what are the best solutions for preventing existential risks. As evidence, just look around:

There are already people with lots of money and smart human research assistants. How many of those people are asking those smart human research assistants for solutions to prevent existential risks? Approximately zero.

Here’s another: The USA NSF and NIH are funding many of the best scientists in the world. Are they asking those scientists for solutions to prevent existential risk? Nope.

Demis Hassabis is the boss of... (read more)

Steve Byrnes2mo20

I agree with the claim that existential catastrophes aren't automatically solved by aligned/controlled AI …

See also my comment here, about the alleged “Law of Conservation of Wisdom”. Your idea of “using instruction following AIs to implement a campaign of persuasion” relies (I claim) on the assumption that the people using the instruction-following AIs to persuade others are especially wise and foresighted people, and are thus using their AI powers to spread those habits of wisdom and foresight.

It’s fine to talk about that scenario, and I hope it comes to... (read more)

0Satron2mo

I don't think that whoever is using these AI powers (let's call him Alex) needs to be that wise (beyond the wiseness of an average person who could get their hands on a powerful AI, which is probably higher-than-average). Alex doesn't need to come up with @Noosphere89's proposed solution of persuasion campaigns all by himself. Alex merely needs to ask his AI what are the best solutions for preventing existential risks. If Noosphere's proposal is indeed wise, then AI would suggest it. Alex could then implement this solution. Alex doesn't necessarily need to want to spread wisdom and foresight in this scheme. He merely needs to want to prevent existential risks.

Steve Byrnes2mo40

Thanks! I still feel like you’re missing my point, let me try again, thanks for being my guinea pig as I try to get to the bottom of it. :)

inasmuch as it's driven by compute

In terms of the “genome = ML code” analogy (§3.1), humans today have the same compute as humans 100,000 years ago. But humans today have dramatically more capabilities—we have invented the scientific method and math and biology and nuclear weapons and condoms and Fortnite and so on, and we did all that, all by ourselves, autonomously, from scratch. There was no intelligent externa... (read more)

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Steve Byrnes3mo50

For (2), I’m gonna uncharitably rephrase your point as saying: “There hasn’t been a sharp left turn yet, and therefore I’m overall optimistic there will never be a sharp left turn in the future.” Right?

I’m not really sure how to respond to that … I feel like you’re disagreeing with one of the main arguments of this post without engaging it. Umm, see §1. One key part is §1.5:

I do make the weaker claim that, as of this writing, publicly-available AI models do not have the full (1-3) triad—generation, selection, and open-ended accumulation—to any signifi

... (read more)

2Joel Burget3mo

Hm, I wouldn't have phrased it that way. Point (2) says nothing about the probability of there being a "left turn", just the speed at which it would happen. When I hear "sharp left turn", I picture something getting out of control overnight, so it's useful to contextualize how much compute you have to put in to get performance out, since this suggests that (inasmuch as it's driven by compute) capabilities ought to grow gradually. I didn't mean to disagree with anything in your post, just to add a couple points which I didn't think were addressed. You're right that point (2) wasn't engaging with the (1-3) triad, because it wasn't mean to. It's only about the rate of growth of capabilities (which is important because if each subsequent model is only 10% more capable than the one which came before then there's good reason to think that alignment techniques which work well on current models will also work on subsequent models). I do see, and I think this gets at the difference in our (world) models. In a world where there's a real discontinuity, you're right, you can't say much about a post-sharp-turn LLM. In a world where there's continuous progress, like I mentioned above, I'd be surprised if a "left turn" suddenly appeared without any warning.

Steve Byrnes3mo174

In regards to whether “single-single alignment” will make coordination problems and other sorts of human dysfunction and slow-rolling catastrophes less likely:

…I’m not really sure what I think. I feel like have a lot of thoughts that have not gelled into a coherent whole.

(A) The optimistic side of me says what you said in your comment (and in the Vanessa and (especially) Paul comment link therein.

People don’t want bad things to happen. If someone asks an AI what’s gonna happen, and they say “bad thing”, then they’ll say “well what can I do about it?”, and ... (read more)

Jan_Kulveit3mo*64

I went through a bunch of similar thoughts before writing the self-unalignment problem. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott's review of What We Owe To Future where he is worried that in a philosophy game, a smart moral... (read more)

Steve Byrnes3mo20

I think it's actually not any less true of o1/r1.

I think I’ll duck out of this discussion because I don’t actually believe that o1/r1 will lead to full-fledged (1-3) loops and AGI, so it’s hard for me to clearly picture that scenario and engage with its consequences.

I don't think AI taste should play a role in AI help solving the value alignment problem. If we had any sense (which sometimes we do once problems are right in our faces), we'd be asking the AI "so what happens if we use this alignment approach/goal?" and then using our own taste, not asking it

Steve Byrnes3mo30

Thanks!

This could be taken as an argument for using some type of goals selected from learned knowledge for alignment if possible.

Yeah that’s what I was referring to in the paragraph:

“Well OK,” says the optimist. “…so much the worse for Ev! She didn’t have interpretability, and she didn’t have intelligent supervision after the training has already been running, etc. But we do! Let’s just engineer the AI’s explicit motivation!”

Separately, you also wrote:

we're really training LLMs mostly to have a good world model and to follow instructions

I think I mostly ag... (read more)

3Seth Herd3mo

I think it's actually not any less true of o1/r1. It's still mostly predictive/world modeling training, with a dash of human-preference RL which could be described as following instructions as intended in a certain set of task domains. o1/r1 is a bad idea because RL training on a whole CoT works against faithfulness/transparency of the CoT. If that's all we did, I assume we'd be dead when an agent based on such a system started doing what you describe as the 1-3 loop (which I'm going to term self-optimization). Letting the goals implicit in that training sort of coagulate into explicit goals would probably produce explicit, generalizing goals we'd hate. I find alignment by default wildly unlikely. But that's not all we'll do when we turn those systems into agents. Developers will probably at least try to give the agent explicit goals, too. Then there's going to be a complex process where the implicit and explicit goals sort of mix together or compete or something when the agent self-optimizes. Maybe we could think of this as a teenager decideing what their values are, sorting out their biological drives toward hedonism and pleasing others, along with the ideals they've been taught to follow until they could question them. I think we're going to have to get into detail on how that process of working through goals from different sources might work. That's what I'm trying to do in my current work. WRT your Optimist Type 2B pessimism: I don't think AI taste should play a role in AI help solving the value alignment problem. If we had any sense (which sometimes we do once problems are right in our faces), we'd be asking the AI "so what happens if we use this alignment approach/goal?" and then using our own taste, not asking it things like "tell us what to do with our future". We could certainly ask for input and there are ways that could go wrong. But I mostly hope for AGI help in the technical part of solving stable value alignment. I'm not sure I'm more optim