All of Matthew Barnett's Comments + Replies

Matthew Barnett9mo10

I'm happy to use a functional definition of "understanding" or "intelligence" or "situational awareness". If a system possesses all relevant behavioral qualities that we associate with those terms, I think it's basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It's possible this is our main disagreement.

When I talk to GPT-4, I think it's quite clear it possesses a great deal of functional understanding of human intentions and human motives... (read more)

1RobertM9mo

But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they're giving us the desired behavior now will continue to give us desired behavior in the future. My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you're importing expectations from how human outputs reflect the internal processes that generated them. If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences. Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it. Consider Anthropic's Sleeper Agents. Would a situationally aware model use a provided scratch pad to think about how it's in training and needs to pretend to be helpful? No, and neither does the model "understand" your intentions in a way that generalizes out of distribution the way you might expect a human's "understanding" to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the "right" responses during RLHF are not anything like human reasoning. Are you asking for a capabilities threshold, beyond which I'd be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is "can it replace humans at all economically valuable tasks", which is probably not that helpful. Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we'll be able to train models capable of doing a lot of economically useful

Matthew Barnett9mo60

I don't know how many years it's going to take to get to human-level in agency skills, but I fear that corrigibility problems won't be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic.

How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are "getting really agentic" and therefore dangerous? I'm imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remainin... (read more)

3Daniel Kokotajlo9mo

Good question. I want to think about this more, I don't have a ready answer. I have a lot of uncertainty about how long it'll take to get to human-level agency skills; it could be this year, it could be five more years, it could be anything in between. Could even be longer than five more years though I'm skeptical. The longer it takes, the more likely it is that we'll have a significant period of kinda-agentic-but-not-super-agentic systems, and so then that raises the question of what we should expect to see re: corrigibility in that case. Idk. Would be interesting to discuss sometime and maybe place some bets!

Matthew Barnett9mo*814

Please give some citations so I can check your memory/interpretation?

Sure. Here's a snippet of Nick Bostrom's description of the value-loading problem (chapter 13 in his book Superintelligence):

We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The programmer has some particular human value in mind that he would like the AI to promote. To be con

... (read more)

3Daniel Kokotajlo9mo

Thanks for this Matthew, it was an update for me -- according to the quote you pulled Bostrom did seem to think that understanding would grow up hand-in-hand with agency, such that the current understanding-without-agency situation should come as a positive/welcome surprise to him. (Whereas my previous position was that probably Bostrom didn't have much of an opinion about this)

2RobertM9mo

I suspect that a lot of my disagreement with your views comes down to thinking that current systems provide almost no evidence about the difficulty of aligning systems that could pose existential risks, because (I claim) current systems in fact almost certainly don't have any kind of meaningful situational awareness, or stable(ish) preferences over future world states. In this case, I don't know why you think that GPT-4 "understands our intentions", unless you mean something very different by that than what you'd mean if you said that about another human. It is true that GPT-4 will produce output that, if it came from a human, would be quite strong evidence that our intentions were understood (more or less), but the process which generates that output is extremely different from the one that'd generate it in a human and is probably missing most of the relevant properties that we care about when it comes to "understanding". Like, in general, if you ask GPT-4 to produce output that references its internal state, that output will not have any obvious relationship[1] to its internal state, since (as far as we know) it doesn't have the same kind of introspective access to its internal state that we do. (It might, of course, condition its outputs on previous tokens it output, and some humans do in fact rely on examining their previous externally-observable behavior to try to figure out what they were thinking at the time. But that's not the modality I'm talking about.) It is also true that GPT-4 usually produces output that seems like it basically corresponds to our intentions, but that being true does not depend on it "understanding our intentions". 1. ^ That is known to us right now; possibly one exists and could be derived.

Matthew Barnett9mo30

Just a quick reply to this:

Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I'm interested in seeing if we can make some bets on this though; if we can, great; if we can't, then at least we can avoid future disagreements about who should update.

I'll note that my prediction was for the next "few years" and the 1-3 OOMs of compute. It seems your timelines are even shorter than I thought if you think the apocalypse, or point of no return, will happen before that poin... (read more)

3Daniel Kokotajlo9mo

It's not about timelines, it's about capabilities. My tentative prediction is that the sole remaining major bottleneck/gap between current systems and dangerous powerful agent AGIs is 'agency skills.' So, skills relevant to being an agent, i.e. ability to autonomously work towards ambitious goals in diverse challenging environments over long periods. I don't know how many years it's going to take to get to human-level in agency skills, but I fear that corrigibility problems won't be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic. Thus, whether AGI is reached next year or in 2030, we'll face the problem of corrigibility breakdowns only really happening right around the time when it's too late or almost too late.

Matthew Barnett9mo51

Yes, rereading the passage, Bostrom's central example of a reason why we could see this "when dumb, smarter is safer; yet when smart, smarter is more dangerous" pattern (that's a direct quote btw) is that they could be scheming/pretending when dumb. However [...] Bostrom is explicitly calling out the possibility of an AI being genuinely trying to help you, obey you, or whatever until it crosses some invisible threshold of intelligence and has certain realizations that cause it to start plotting against you. This is exactly what I currently think is plausib

... (read more)

5Daniel Kokotajlo9mo

Thanks for this detailed reply! Depending on what you mean by "on their way towards being solved" I'd agree. The way I'd put it is: "We didn't know what the path to AGI would look like; in particular we didn't know whether we'd have agency first and then world-understanding, or world-understanding first and then agency. Now we know we are getting the latter, and while that's good in some ways and bad in other ways, it's probably overall good. Huzzah! However, our core problems remain, and we don't have much time left to solve them." (Also, fwiw, I have myself updated over the course of the last five years or so. First update was reading Paul's stuff and related literatures convincing me that corrigibility-based stuff would probably work. Second update was all the recent faithful CoT and control and mechinterp progress etc., plus also the LLM stuff. The LLM stuff was less than 50% of the overall update for me, but it mattered.) Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I'm interested in seeing if we can make some bets on this though; if we can, great; if we can't, then at least we can avoid future disagreements about who should update. I don't think that we know how to "just create the corrigible AIs." The next step on the path to AGI seems to be to make our AIs much more agentic; I am concerned that our current methods of instilling corrigibility (basically: prompting and imperfect training) won't work on much more agentic AIs. To be clear I think they might work, there's a lot of uncertainty, but I think they probably won't. I think it might be easier to see why I think this if you try to prove the opposite in detail -- like, write a mini-scenario in which we have something like AutoGPT but much better, and it's being continually trained to accomplish diverse long-horizon tasks involving pursuing goals in challenging environments, and write down what the corrigi

3Vladimir Nesov9mo

I'd say the considerations for scheming exist platonically, and dumber AIs only get to concretely instantiate the currently appropriate conclusion of compliance, everything else crumbles as not directly actionable. But smarter AIs might succeed in channeling those considerations in the real world. The hypothesis expects that such AIs are not here yet, given the lack of modern AIs' ability to coherently reason about complicated or long term plans, or to carry them out. So properties of AIs that are already here don't work as evidence about this either way.

Matthew Barnett9mo*115

Me: "Oh ok, that's a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn."

This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are d... (read more)

Daniel Kokotajlo9mo*1920

I thought you would say that, bwahaha. Here is my reply:

(1) Yes, rereading the passage, Bostrom's central example of a reason why we could see this "when dumb, smarter is safer; yet when smart, smarter is more dangerous" pattern (that's a direct quote btw) is that they could be scheming/pretending when dumb. However he goes on to say: "A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted to narrowly ... A treacherous turn could also come about i... (read more)

Instruction-following AGI is easier and more likely than value aligned AGI

Matthew Barnett9mo*215

In the last year, I've had surprisingly many conversations that have looked a bit like this:

Me: "Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence."

Interlocutor: "You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part."

Me: "I didn't misunderstand the argument. I un... (read more)

Wei Dai9mo2423

**Me: **“Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”

Please give some citations so I can check your memory/interpretation? One source I found is where Paul Christiano first talked about IDA (which he initially called ALBA) in early 2016, and most of the commenters there were willing to grant him the assumption of an aligned weak AGI and wanted to argue instead about the... (read more)

Daniel Kokotajlo9mo3625

Here's how that discussion would go if you had it with me:

You: "Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence."

Me: "You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part."

You: "I didn't misunderstand the argument. I understand the distinction you are making pe... (read more)

Matthew Barnett10mo21

Yes, but I don't consider this outcome very pessimistic because this is already what the current world looks like. How commonly do businesses work for the common good of all humanity, rather than for the sake of their shareholders? The world is not a utopia, but I guess that's something I've already gotten used to.

Instruction-following AGI is easier and more likely than value aligned AGI

Matthew Barnett10mo2235

I think the main reason why we won't align AGIs to some abstract conception of "human values" is because users won't want to rent or purchase AI services that are aligned to such a broad, altruistic target. Imagine a version of GPT-4 that, instead of helping you, used its time and compute resources to do whatever was optimal for humanity as a whole. Even if that were a great thing for GPT-4 to do from a moral perspective, most users aren't looking for charity when they sign up for ChatGPT, and they wouldn't be interested in signing up for such a service. T... (read more)

1agazi10mo

I think we can already see the early innings of this with large API providers figuring out how to calibrate post-training techniques (RHLF, constitutional AI) between economic usefulness and the "mean" of western morals. Tough to go against economic incentives

Charlie Steiner10mo710

Wow, that's pessimistic. So in the future you imagine, we could build AIs that promote the good of all humanity, we just won't because if a business built that AI it wouldn't make as much money?

Buck's Shortform

Matthew Barnett11mo30

Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).

Can you be more clearer this point? To operationalize this, I propose the following question: what is the fraction of world GDP you expect will be attributable to AI at the time we have these risky AIs that you are interested in?

For example, are you worried about AIs that will arise when AI is 1-10% of the economy, or more like 50%? 90%?

6Ryan Greenblatt11mo

One operationalization is "these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs". As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them. With these caveats: * The speed up is relative to the current status quo as of GPT-4. * The speed up is ignoring the "speed up" of "having better experiments to do due to access to better models" (so e.g., they would complete a fixed research task faster). * By "capable" of speeding things up this much, I mean that if AIs "wanted" to speed up this task and if we didn't have any safety precautions slowing things down, we could get these speedups. (Of course, AIs might actively and successfully slow down certain types of research and we might have burdensome safety precautions.) * The 2x increase in marginal cost is ignoring potential inflation in the cost of compute (FLOP/$) and inflation in the cost of wages of ML researchers. Otherwise, I'm uncertain how exactly to model the situation. Maybe increase in wages and decrease in FLOP/$ cancel out? Idk. * It might be important that the speed up is amortized over a longer duration like 6 months to 1 year. I'm uncertain what the economic impact of such systems will look like. I could imagine either massive (GDP has already grown >4x due to the total effects of AI) or only moderate (AIs haven't yet been that widely deployed due to inference availability issues, so actual production hasn't increased that much due to AI (<10%), though markets are pricing in AI being a really, really big deal). So, it's hard for me to predict the immediate impact on world GDP. After adaptation and broad deployment, systems of this level would likely have a massive effect on GDP.

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt

Counting arguments provide no evidence for AI doom

My question for people who support this framing (i.e., that we should try to "control" AIs) is the following:

When do you think it's appropriate to relax our controls on AI? In other words, how do you envision we'd reach a point at which we can trust AIs well enough to grant them full legal rights and the ability to enter management and governance roles without lots of human oversight?

I think this question is related to the discussion you had about whether AI control is "evil", but by contrast my worries are a bit different than the ones I felt were express... (read more)

Ryan Greenblatt1y170

Summary

Here's a (simplified) summary of where I'm at:

My prefered usage of control involves only using AIs with both control and a low probability that these AIs have problematic beyond episde aims. We should also offer to pay AIs to reveal their longer run aims. We should shutdown usage of powerful AIs if we get strong evidence of these aims. (E.g. the AIs accept our payment and reveal evidence of these aims. Or more concerningly, these AIs try to escape despite not revealing their aims at an earlier point.) (I think this isn't sufficiently clear in thi

Matthew Barnett1y912

I agree with virtually all of the high-level points in this post — the term "AGI" did not seem to usually initially refer to a system that was better than all human experts at absolutely everything, transformers are not a narrow technology, and current frontier models can meaningfully be called "AGI".

Indeed, my own attempt to define AGI a few years ago was initially criticized for being too strong, as I initially specified a difficult construction task, which was later weakened to being able to "satisfactorily assemble a (or the equivalent of a) circa-2021... (read more)

Matthew Barnett1y*10

Hmm, I don't think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.

I confusingly stated my point (and retracted my specific claim in the comment above). I think the rest of my comment basically holds, though. Here's what I think is a clearer argument:

The term "schemer" evokes an image of someone who is lying to obtain power. It doesn't particularly evoke a backstory for why the person became a liar in the

... (read more)

2Ryan Greenblatt1y

I agree this matches typical usage (and also matches usage in the overall post we're commenting on), but sadly the word schemer in the context of Joe's report means something more specific. I'm sad about the overall terminology situation here. It's possible I should just always use a term like beyond-episode-goal-style-scheming. I agree this distinction is fuzzy, but I think there is likely to be an important distinction because the case where the behavior isn't due to things well described as beyond-episode-goals, it should be much easier to study. See here for more commentary. There will of course be a spectrum here.

Counting arguments provide no evidence for AI doom

Matthew Barnett1y*52

Perhaps I was being too loose with my language, and it's possible this is a pointless pedantic discussion about terminology, but I think I was still pointing to what Carlsmith called schemers in that quote. Here's Joe Carlsmith's terminological breakdown:

~~The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not.~~ [ETA: this was confusingly stated. What I meant is that if a people design a reward function that accidentally reinforces lying in order to obtain power, it seems reasonable to call the... (read more)

4Ryan Greenblatt1y

Hmm, I don't think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior. Overall, I use the term to mean basically the same thing as "deceptive alignment". (But more specifically pointing the definition in Joe's report which depends less on some notion of mesa-optimization and is a bit more precise IMO.)

4Ryan Greenblatt1y

I think in Ajeya's story the core threat model isn't well described as scheming and is better described as seeking some proxy of reward.

Counting arguments provide no evidence for AI doom

Matthew Barnett1y137

(I might write a longer response later, but I thought it would be worth writing a quick response now. Cross-posted from the EA forum, and I know you've replied there, but I'm posting anyway.)

I have a few points of agreement and a few points of disagreement:

Agreements:

The strict counting argument seems very weak as an argument for scheming, essentially for the reason you identified: it relies on a uniform prior over AI goals, which seems like a really bad model of the situation.
The hazy counting argument—while stronger than the strict counting argument—stil

... (read more)

1Alex Turner1y

I agree, they're wrong to claim it's "no evidence." I think that counting arguments are extremely slight evidence against scheming, because they're weaker than the arguments I'd expect our community's thinkers to find in worlds where scheming was real. (Although I agree that on the object-level and in isolation, the arguments are tiiiny positive evidence.)

6Ryan Greenblatt1y

It's worth noting here that Carlsmith's original usage of the term scheming just refers to AIs that perform well on training and evaluations for instrumental reasons because they have longer run goals or similar. So, AIs lying because this was directly reinforced wouldn't itself be scheming behavior in Carlsmith's terminology. However, it's worth noting that part of Carlsmith's argument involves arguing that smart AIs will likely have to explicitly reason about the reinforcement process (sometimes called playing the training game) and this will likely involve lying.

Four visions of Transformative AI success

Matthew Barnett1y57

“But what about comparative advantage?” you say. Well, I would point to the example of a not-particularly-bright 7-year-old child in today’s world. Not only would nobody hire that kid into their office or factory, but they would probably pay good money to keep him out, because he would only mess stuff up.

This is an extremely minor critique given that I'm responding to a footnote, so I hope it doesn't drown out more constructive responses, but I'm actually pretty skeptical that the reason why people don't hire children as workers is because the ch... (read more)

4Steve Byrnes1y

Thanks. I changed the wording to “moody 7-year-old” and “office or high-tech factory” which puts me on firmer ground I think. :) I think there have been general increases in productivity across the economy associated with industrialization, automation, complex precise machines, and so on, and those things provide a separate reason (besides legal & social norms as you mentioned) that 7yos are far less employable today than in the 18th century. E.g. I can easily imagine a moody 7yo being net useful in a mom & pop artisanal candy shop, but it’s much harder to imagine a moody 7yo being net useful in a modern jelly bean factory. I think your bringing up “$3/day” gives the wrong idea; I think we should focus on whether the sign is positive or negative. If the sign is positive at all, it’s probably >$3/day. The sign could be negative because they sometimes touch something they’re not supposed to touch, or mess up in other ways, or it could simply be that they bring in extra management overhead greater than their labor contribution. (We’ve all delegated projects where it would have been far less work to just do the project ourselves, right?) E.g. even if the cost to feed and maintain a horse were zero, I would still not expect to see horses being used in a modern construction project. Anyway, I think I’m on firmer ground when talking about a post-AGI economy, in which case, literally anything that can be done by a human at all, can be automated.

AI doing philosophy = AI generating hands?

Matthew Barnett1y50

In a parallel universe with a saner civilization, there must be tons of philosophy professors workings with tons of AI researchers to try to improve AI's philosophical reasoning. They're probably going on TV and talking about 养兵千日，用兵一时 (feed an army for a thousand days, use it for an hour) or how proud they are to contribute to our civilization's existential safety at this critical time. There are probably massive prizes set up to encourage public contribution, just in case anyone had a promising out of the box idea (and of course with massive associated i

... (read more)

5Wei Dai1y

1. The super-alignment effort will fail. 2. Technological progress will continue to advance faster than philosophical progress, making it hard or impossible for humans to have the wisdom to handle new technologies correctly. I see AI development itself as an instance of this, for example the e/acc crowd trying to advance AI without regard to safety because they think it will automatically align with their values (something about "free energy"). What if, e.g., value lock-in becomes possible in the future and many decide to lock in their current values (based on their religions and/or ideologies) to signal their faith/loyalty? 3. AIs will be optimized for persuasion and humans won't know how to defend against bad but persuasive philosophical arguments aimed to manipulate them. Bad economic policies can probably be recovered from and are therefore not (high) x-risks. My answers to many of your other questions are "I'm pretty uncertain, and that uncertainty leaves a lot of room for risk." See also Some Thoughts on Metaphilosophy if you haven't already read that, as it may help you better understand my perspective. And, it's also possible that in the alternate sane universe, a lot of philosophy professors have worked with AI researchers on the questions you raised here, and adequately resolved the uncertainties in the direction of "no risk", and AI development has continued based on that understanding, but I'm not seeing that happening here either. Let me know if you want me to go into more detail on any of the questions.

Evaluating the historical value misspecification argument

Matthew Barnett1y32

I think the second half of this makes it clear that Eliezer is using “good” in a definition-2-sense.

I think there's some nuance here. It seems clear to me that solving the "full" friendly AI problem, as Eliezer imagined, would involve delineating human value on the level of the Coherent Extrapolated Volition, rather than merely as adequately as an ordinary human. That's presumably what Eliezer meant in the context of the quote you cited.

However, I think it makes sense to interpret GPT-4 as representing substantial progress on the problem of building a... (read more)

Evaluating the historical value misspecification argument

Matthew Barnett1y36

So, IIUC, you are proposing we:
Literally just query GPT-N about whether [input_outcome] is good or bad

I'm hesitant to say that I'm actually proposing literally this exact sequence as my suggestion for how we build safe human-level AGI, because (1) "GPT-N" can narrowly refer to a specific line of models by OpenAI whereas the way I was using it was more in-line with "generically powerful multi-modal models in the near-future", and (2) the actual way we build safe AGI will presumably involve a lot of engineering and tweaking to any such plan in ways that are ... (read more)

2Daniel Kokotajlo1y

This is underspecified, I think, since we have for years had AIs that follow objectives in ways that generalize to inputs outside of the training distribution. The thing is there are lots of ways to generalize / lots of objectives they could learn to follow, and we don't have a good way of pinning it down to exactly the ones we want. (And indeed as our AIs get smarter there will be new ways of generalizing / categories of objectives that will become available, such as "play the training game") So it sounds like you are saying "A solution to inner alignment mans that we've figured out how to train an AI to have the objectives we want it to have, robustly such that it continues to have them way off distribution." This sounds like basically the whole alignment problem to me? I see later you say you mean the second thing -- which is interestingly in between "play the training game" and "actually be honest/helpful/harmless/etc." (A case that distinguishes it from the latter: Suppose it is reading a paper containing an adversarial example for the RM, i.e. some text it can output that causes the RM to give it a high score even though the text is super harmful / dishonest / etc. If it's objective is the "do what the RM would give high score to if it was operating normally" objective, it'll basically wirehead on that adversarial example once it learns about it, even if it's in deployment and it isn't getting trained anymore, and even though it's an obviously harmful/dishonest piece of text. It's a nontrivial and plausible claim you may be making -- that this sort of middle ground might be enough for safe AGI, when combined with the rest of the plan at least. But I'd like to see it spelled out. I'm pretty skeptical right now.

Evaluating the historical value misspecification argument

Matthew Barnett1y22

It sounds like you are saying: We just need to prompt GPT with something like "Q: How good is this outcome? A:" and then build a generic maximizer agent using that prompted GPT as the utility function, and our job is done, we would have made an AGI that cares about maximizing the human value function

I think solving value specification is basically what you need in order to build a good reward model. If you have a good reward model, and you solve inner alignment, then I think you're pretty close to being able to create (at least) a broadly human-level AGI t... (read more)

2[comment deleted]1y

2Daniel Kokotajlo1y

So, IIUC, you are proposing we: * Literally just query GPT-N about whether [input_outcome] is good or bad * Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?) * Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them. * Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success! Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of "producing outcomes the RM classifies as good?" Or the objective "producing outcomes the RM would classify as good if it was operating normally?" (the difference revealing itself in cases of tampering with the RM) Or the objective "producing outcomes that are good-for-humans, harmless, honest, etc."?

2Daniel Kokotajlo1y

I had the impression that it was more than just that

Yes, the post was about more than that. To the extent I was arguing against a single line of work, it was mainly intended as a critique of public advocacy. Separately, I asked people to re-evaluate which problems will be solved by default, to refocus our efforts on the most neglected, important problems, and went into detail about what I currently expect will be solved by default.

If you have any you think faithfully represent a possible disagreement between us go ahead.

I offered a concrete prediction in t... (read more)

I agree this is important and it was in your post but it seems like a decent description of what the majority of AI x-risk governance people are already working on, or at least not obviously a bad one.

I agree. I'm not criticizing the people who are trying to make sure that policies are well-targeted and grounded in high-quality evidence. I'm arguing in favor of their work. ~~I'm mainly arguing against public AI safety advocacy work, which was~~ ~~recently upvoted highly on the EA Forum~~. [ETA, rewording: To the extent I was arguing against a single line of ... (read more)

3elifland1y

I had the impression that it was more than just that, given the line: "In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention." and the further attention devoted to deceptive alignment. If you have any you think faithfully represent a possible disagreement between us go ahead. I personally feel it will be very hard to operationalize objective stuff about policies in a satisfying way. For example, a big issue with the market you've made is that it is about what will happen in the world, not what will happen without intervention from AI x-risk people. Furthermore it has all the usual issues with forecasting on complex things 12 years in advance, regarding the extent to which it operationalizes any disagreement well (I've bet yes on it, but think it's likely that evaluating and fixing deceptive alignment will remain mostly unsolved in 2035 conditional on no superintelligence, especially if there were no intervention from x-risk people).

Matthew Barnett1y37

The key question that the debate was about was whether building AGI would require maybe 1-2 major insights about how to build it, vs. it would require the discovery of a large number of algorithms that would incrementally make a system more and more up-to-par with where humans are at.

Robin Hanson didn't say that AGI would "require the discovery of a large number of algorithms". He emphasized instead that AGI would require a lot of "content" and would require a large "base". He said,

My opinion, which I think many AI experts will agree with at least, includi

... (read more)

Evaluating the historical value misspecification argument

Matthew Barnett1y44

Can you provide examples of interventions that meet your bar for not being done by default? It's hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones

I have three things to say here:

Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a "hard bit" of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won'

... (read more)

1elifland1y

Thanks for clarifying. Don't have a strong opinion here, but intuitively feels like it would be hard to find tractable angles for work on this now. Maybe. In general, I'm excited about people who have the talent for it to think about previously neglected angles. I agree this is important and it was in your post but it seems like a decent description of what the majority of AI x-risk governance people are already working on, or at least not obviously a bad one. This is the phrase that I was hoping would get made more concrete. I understand this (sorry if wasn't clear), but I think it's less obvious than you do that this trend will continue without intervention from AI x-risk people. I agree with other commenters that AI x-risk people should get a lot of the credit for the recent push. I also provided example reasons that the trend might not continue smoothly or even reverse in my point (3). There might also be disagreements around: 1. Not sharing your high confidence in slow, continuous takeoff. 2. The strictness of regulation needed to make a dent in AI risk, e.g. if substantial international coordination is required it seems optimistic to me to assume that the trajectory will by default lead to this. 3. The value in things getting done faster than they would have done otherwise, even if they would have been done either way. This indirectly provides more time to iterate and get to better, more nuanced policy. Operationalizing disagreements well is hard and time-consuming especially when we're betting on "how things would go without intervention from a community that is intervening a lot", but a few very rough forecasts, all conditional on no TAI before resolve date: 1. 75%: In Jan 2028, less than 10% of Americans will consider AI the most important problem. 2. 60%: In Jan 2030, Evan Hubinger will believe that if x-risk-motivated people had not worked on deceptive alignment at all, risk from deceptive alignment would be at least 50% higher, compared to a

Matthew Barnett1y40

Presumably you think that ordinary human beings are capable of "singling out concepts that are robustly worth optimizing for".
Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.

If ordinary humans can't single out concepts that are robustly worth optimizing for, then either,

Human beings in general cannot single out what is robustly worth optimizing for
Only extraordinary humans can single out what is robustly worth optimizing for

Can you be more clear about which of these you believe?

I'm also in... (read more)

Nate Soares1y*1515

If you allow indirection and don't worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.

Answering your request for prediction, given that it seems like that request is still live: a thing I don't expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (... (read more)

Evaluating the historical value misspecification argument

Matthew Barnett1y*40

Thanks for the continued clarifications.

Our primary existing disagreement might be this part,

My estimate of how well Eliezer or Nate or Rob of 2016 would think my comment above summarizes the relevant parts of their own models, is basically the same as my estimate of how well Eliezer or Nate or Rob of today would think my comment above summarizes the relevant parts of their own models.

Of course, there's no way of proving what these three people would have said in 2016, and I sympathize with the people who are saying they don't care much about the spe... (read more)

johnswentworth1y1212

Either Eliezer believed that we need a proposed solution to the value identification problem that far exceeds the performance of humans on the task of identifying valuable from non-valuable outcomes. This is somewhat plausible as he mentions CEV in the next paragraph, but elsewhere Eliezer has said, "When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of 'provable' alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas whic

... (read more)

Evaluating the historical value misspecification argument

Matthew Barnett1y*10

I don't think MIRI ever considered this an important part of the alignment problem, and I don't think we expect humanity to solve lots of the alignment problem as a result of having such a tool

If you don't think MIRI ever considered coming up with an "explicit function that reflects the human value function with high fidelity" to be "an important part of the alignment problem", can you explain this passage from the Arbital page on The problem of fully updated deference?

One way to look at the central problem of value identification in superintelligence is t

Matthew Barnett1y*40

Thanks for trying to understand my position. I think this interpretation that you gave is closest to what I'm arguing,

Attempting again: on Matthew's model of past!Nate's model, getting an AI to answer the above sorts of questions properly was supposed to take a lot of elbow grease. But it doesn't take a lot of elbow grease, which suggests that values are much easier to lift out of human data than past!Nate thought, which means that value is more like "diamond" and less like "a bunch of random noise", which means that alignment is easier than past!Nate thou

... (read more)

2Nate Soares1y

(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don't understand the relevance of this claim to my argument.) Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process. (This seems to me like plausibly one of the sources of misunderstanding, and in particular I am skeptical that your request for prediction will survive it, and so I haven't tried to answer your request for a prediction.)

Evaluating the historical value misspecification argument

Matthew Barnett1y52

This comment is valuable for helping to clarify the disagreement. So, thanks for that. Unfortunately, I am not sure I fully understand the comment yet. Before I can reply in-depth, I have a few general questions:

Are you interpreting me as arguing that alignment is easy in this post? I avoided arguing that, partly because I don't think the inner alignment problem has been solved, and the inner alignment problem seems to be the "hard part" of the alignment problem, as I understand it. Solving inner alignment completely would probably require (at the very lea

... (read more)

johnswentworth1y82

Are you interpreting me as arguing that alignment is easy in this post?

Not in any sense which I think is relevant to the discussion at this point.

Are you saying that MIRI has been very consistent on the question of where the "hard parts" of alignment lie?

My estimate of how well Eliezer or Nate or Rob of 2016 would think my comment above summarizes the relevant parts of their own models, is basically the same as my estimate of how well Eliezer or Nate or Rob of today would think my comment above summarizes the relevant parts of their own models.

That d... (read more)

Evaluating the historical value misspecification argument

Matthew Barnett1y*10

Can you explain how you're defining outer alignment and value specification?

I'm using this definition, provided by Hubinger et al.

the outer alignment problem is an alignment problem between the system and the humans outside of it (specifically between the base objective and the programmer’s intentions). In the context of machine learning, outer alignment refers to aligning the specified loss function with the intended goal, whereas inner alignment refers to aligning the mesa-objective of a mesa-optimizer with the specified loss function.

Evan Hubinger provi... (read more)

Evaluating the historical value misspecification argument

Matthew Barnett1y*2-6

I think it's false in the sense that MIRI never claimed that it would be hard to build an AI with GPT-4 level understanding of human values + GPT-4 level of willingness to answer honestly (as far as I can tell). The reason I think it's false is mostly that I haven't seen a claim like that made anywhere, including in the posts you cite.

I don't think it's necessary for them to have made that exact claim. The point is that they said value specification would be hard.

If you solve value specification, then you've arguably solved ~~the outer alignment problem~~ a... (read more)

2Lauro Langosco1y

I think the specification problem is still hard and unsolved. It looks like you're using a different definition of 'specification problem' / 'outer alignment' than others, and this is causing confusion. IMO all these terms are a bit fuzzy / hard to pin down, and so it makes sense that they'd lead to disagreement sometimes. The best way (afaict) to avoid this is to keep the terms grounded in 'what would be useful for avoiding AGI doom'? To me it looks like on your definition, outer alignment is basically a trivial problem that doesn't help alignment much. More generally, I think this discussion would be more grounded / useful if you made more object-level claims about how value specification being solved (on your view) might be useful, rather than meta claims about what others were wrong about.

Evaluating the historical value misspecification argument