AI ALIGNMENT FORUM
AF

All of Edouard Harris's Comments + Replies

What’s the short timeline plan?

Edouard Harris2mo66

This is a great & timely post.

Announcing Epoch’s dashboard of key trends and figures in Machine Learning

Edouard Harris2y10

Gotcha, that makes sense!

Announcing Epoch’s dashboard of key trends and figures in Machine Learning

Edouard Harris2y10

Looks awesome! Minor correction on the cost of the GPT-4 training run: the website says $40 million, but sama confirmed publicly that it was over $100M (and several news outlets have reported the latter number as well).

1Jaime Sevilla2y

Thanks! Our current best guess is that this includes costs other than the amortized compute of the final training run. If no extra information surfaces we will add a note clarifying this and/or adjust our estimate.

Inverse scaling can become U-shaped

Edouard Harris2y22

Done, a few days ago. Sorry thought I'd responded to this comment.

Inverse scaling can become U-shaped

Edouard Harris2y30

Excellent context here, thank you. I hadn't been aware of this caveat.

2Neel Nanda2y

I'd recommend editing a link to Ethan's comment to the top of the post - I think people could easily lead with a misleading impression otherwise

Misalignment-by-default in multi-agent systems

Edouard Harris2y20

Great question. This is another place where our model is weak, in the sense that it has little to say about the imperfect information case. Recall that in our scenario, the human agent learns its policy in the absence of the AI agent; and the AI agent then learns its optimal policy conditional on the human policy being fixed.

It turns out that this setup dodges the imperfect information question from the AI side, because the AI has perfect information on all the relevant parts of the human policy during its training. And it dodges the imperfect information ... (read more)

Misalignment-by-default in multi-agent systems

Edouard Harris2y10

Agreed. We think our human-AI setting is a useful model of alignment in the limit case, but not really so in the transient case. (For the reason you point out.)

Misalignment-by-default in multi-agent systems

Edouard Harris2y40

I think you might have reversed the definitions of $α_{H A}$ and $β_{H A}$ in your comment,^[1] but otherwise I think you're exactly right.

To compute $β_{H A}$ (the correlation coefficient between terminal values), naively you'd have reward functions $R_{H} (s)$ and $R_{A} (s)$ , that respectively assign human and AI rewards over every possible arrangement of matter $s$ . Then you'd look at every such reward function pair over your joint distribution $D_{H A}$ , and ask how correlated they are over arrangements of matter. If you like, you... (read more)

2Alex Flint2y

OK, good, thanks for that correction. One question I have is: how do you avoid two perfectly aligned agents from developing instrumental values concerning their own self-preservation and then becoming instrumentally misaligned as a result? In a little more detail: consider two agents, both trying to build a house, with perfectly aligned preferences over what kind of house should be built. And suppose the agents have only partial information about the environment -- enough, let's say, to get the house built, but not enough, let's say, to really understand what's going on inside the other agent. Then wouldn't the two agents both reason "hey if I die then who knows if this house will be built correctly; I better take steps towards self-preservation just to make sure that the house gets built". Then the two agents might each take steps to build physical protection for themselves, to acquire resources with which to do that, and eventually to fight over resources, even though their goals are, in truth, perfectly aligned. Is it true that this would happen under an imperfect information version of your model?

Misalignment-by-default in multi-agent systems

Edouard Harris2y22

Good question. Unfortunately, one weakness of our definition of multi-agent POWER is that it doesn't have much useful to say in a case like this one.

We assume AI learning timescales vastly outstrip human learning timescales as a way of keeping our definition tractable. So the only way to structure this problem in our framework would be to imagine a human is playing chess against a superintelligent AI — a highly distorted situation compared to the case of two roughly equal opponents.

On the other hand, from other results I've seen anecdotally, I suspect that... (read more)

2Noosphere892y

I think this is probably true in the long term (the classical-quantum/reversible computer transition is very large, and humans can't easily modify brains, unlike a virtual human.) But this may not be true in the short-term.

Instrumental convergence in single-agent systems

Edouard Harris2y*30

Thanks for you comment. These are great questions. I'll do the best I can to answer here, feel free to ask follow-ups:

On pre-committing as a negotiating tactic: If I've understood correctly, this is a special case of the class of strategies where you sacrifice some of your own options (bad) to constrain those of your opponent (good). And your question is something like: which of these effects is strongest, or do they cancel each other out?

It won't surprise you that I think the answer is highly context-dependent, and that I'm not sure which way it would act

Edouard Harris2y30

Yes, I think this is right. It's been pointed out elsewhere that feature universality in neural networks could be an instance of instrumental convergence, for example. And if you think about it, to the extent that a "correct" model of the universe exists, then capturing that world-model in your reasoning should be instrumentally useful for most non-trivial terminal goals.

We've focused on simple gridworlds here, partly because they're visual, but also because they're tractable. But I suspect there's a mapping between POWER (in the RL context) and generaliza... (read more)

The shard theory of human values

Edouard Harris3y10

Got it. That makes sense, thanks!

The shard theory of human values

Edouard Harris3y52

This is really interesting. It's hard to speak too definitively about theories of human values, but for what it's worth these ideas do pass my intuitive smell test.

One intriguing aspect is that, assuming I've followed correctly, this theory aims to unify different cognitive concepts in a way that might be testable:

On the one hand, it seems to suggest a path to generalizing circuits-type work to the model-based RL paradigm. (With shards, which bid for outcomes on a contextually activated basis, being analogous to circuits, which contribute to prediction pro

... (read more)

5Alex Turner3y

I currently guess that even the most advanced shards won't have private world-models which they can query in relative isolation from the rest of the shard economy. Importantly, I didn't want the reader to think that we're positing a bunch of homunculi. Maybe I should have just written that. But I also feel relatively ignorant more advanced shard dynamics. While I can give interesting speculation, I don't have enough evidence-fuel to make such stories actually knowably correct.

Announcing Epoch: A research organization investigating the road to Transformative AI

Edouard Harris3y67

Nice. Congrats on the launch! This is an extremely necessary line of effort.

AGI Ruin: A List of Lethalities

Edouard Harris3y10

Interesting. The specific idea you're proposing here may or may not be workable, but it's an intriguing example of a more general strategy that I've previously tried to articulate in another context. The idea is that it may be viable to use an AI to create a "platform" that accelerates human progress in an area of interest to existential safety, as opposed to using an AI to directly solve the problem or perform the action.

Essentially:

A "platform" for work in domain X is something that removes key constraints that would otherwise have consumed human time an

... (read more)

Alignment versus AI Alignment

Edouard Harris3y10

Yep, I'd say I intuitively agree with all of that, though I'd add that if you want to specify the set of "outcomes" differently from the set of "goals", then that must mean you're implicitly defining a mapping from outcomes to goals. One analogy could be that an outcome is like a thermodynamic microstate (in the sense that it's a complete description of all the features of the universe) while a goal is like a thermodynamic macrostate (in the sense that it's a complete description of the features of the universe that the system can perceive).

This mapping fr... (read more)

1Alex Flint3y

Ah so I think what you're saying is that for a given outcome, we can ask whether there is a goal we can give to the system such that it steers towards that outcome. Then, as a system becomes more powerful, the range of outcomes that it can steer towards expands. That seems very reasonable to me, though the question that strikes me as most interesting is: what can be said about the internal structure of physical objects that have power in this sense?

Alignment versus AI Alignment

Edouard Harris3y10

Gotcha. I definitely agree with what you're saying about the effectiveness of incentive structures. And to be clear, I also agree that some of the affordances in the quote reasonably fall under "alignment": e.g., if you explicitly set a specific mission statement, that's a good tactic for aligning your organization around that specific mission statement.

But some of the other affordances aren't as clearly goal-dependent. For example, iterating quickly is an instrumentally effective strategy across a pretty broad set of goals a company might have. That (in m... (read more)

4Alex Flint3y

Yes, I think what you're saying is that there is (1) the set of all possible outcomes, (2) within that, the set of outcomes where the company succeeds with respect to any goal, and (3) within that, the set of outcomes where the company succeeds with respect to the operator's goal. The capability-increasing interventions, then, are things that concentrate probability mass onto (2), whereas the alignment-increasing interventions are things that concentrate probability mass onto (3). This is a very interesting way to say it and I think it explains why there is a spectrum from alignment to capabilities. Very roughly, (1) corresponds to any system whatsoever, (2) corresponds to a system that is generally powerful, and (3) corresponds to a system that is powerful and aligned. We are not so worried about non-powerful unaligned systems, and we are not worried at all about powerful aligned systems. We are worried about the awkward middle ground - powerful unaligned systems.

Alignment versus AI Alignment

Edouard Harris3y10

Thanks, great post.

These include formulating and repeating a clear mission statement, setting up a system for promotions that rewards well-calibrated risk taking, and iterating quickly at the beginning of the company in order to habituate a rhythm of quick iteration cycles.

I may be misunderstanding, but wouldn't these techniques fall more under the heading of capabilities rather than under alignment? These are tactics that should increase a company's effectiveness in general, for most reasonable mission statements or products the company could have.

3Alex Flint3y

Thank you. I was thinking of the incentive structure of a company (to focus on one example) as an affordance for aligning a company with a particular goal because if you set the incentive structure up right then you don’t have to keep track of everything that everyone does within the company, you can just (if you do it well) trust that the net effect of all those actions will optimize something that you want it to optimize (much like steering via the goals of an AI or steering via the taxes and regulations of a market). But I think actually you are pointing to a very important way that alignment generally requires clarity, and clarity generally increases capabilities. This is present also in AI development: if we gained the insight necessary to build a very clear consequentialist AI that we knew how to align, we would simultaneously increase capabilities due to the same clarity. Interested in your thoughts.

Estimating training compute of Deep Learning models

Edouard Harris3y50

This is fantastic. Really appreciate both the detailed deep-dive in the document, and the summary here. This is also timely, given that teams working on superscale models with concerning capabilities haven't generally been too forthcoming with compute estimates. (There are exceptions.)

As you and Alex point out in the sibling thread, the biggest remaining fudge factors seem to be:

Mixture models (or any kind of parameter-sharing, really) for the first method, which will cause you to systematically overestimate the "Operations per forward pass" factor; and
Var

... (read more)

Reply to Eliezer on Biological Anchors

Edouard Harris3y910

It's simply because we each (myself more than her) have an inclination to apply a fair amount of adjustment in a conservative direction, for generic "burden of proof" reasons, rather than go with the timelines that seem most reasonable based on the report in a vacuum.

While one can sympathize with the view that the burden of proof ought to lie with advocates of shorter timelines when it comes to the pure inference problem ("When will AGI occur?"), it's worth observing that in the decision problem ("What should we do about it?") this situation is reversed. T... (read more)

HoldenKarnofsky3y51

I agree with this. I often default to acting as though we have ~10-15 years, partly because I think leverage is especially high conditional on timelines in that rough range.

AI Tracker: monitoring current and near-future risks from superscale models

Edouard Harris3y10

This is an excellent point and it's indeed one of the fundamental limitations of a public tracking approach. Extrapolating trends in an information environment like this can quickly degenerate into pure fantasy. All one can really be sure of is that the public numbers are merely lower bounds — and plausibly, very weak ones.

AI Tracker: monitoring current and near-future risks from superscale models

Edouard Harris3y30

Yeah, great point about Gopher, we noticed the same thing and included a note to that effect in Gopher's entry in the tracker.

I agree there's reason to believe this sort of delay could become a bigger factor in the future, and may already be a factor now. If we see this pattern develop further (and if folks start publishing "model cards" more consistently like DM did, which gave us the date of Gopher's training) we probably will begin to include training date as separate from publication date. But for now, it's a possible trend to keep an eye on.

Thanks again!

What Selection Theorems Do We Expect/Want?

Edouard Harris3y40

A more typical example: I can look at a chain of options on a stock, and use the prices of those options to back out market-implied probabilities for each possible stock price at expiry.

Gotcha, this is a great example. And the fundamental reasons why this works are 1) the immediate incentive that you can earn higher returns by pricing the option more correctly; combined with 2) the fact that the agents who are assigning these prices have (on a dollar-weighted-average basis) gone through multiple rounds of selection for higher returns.

(I wonder to what exte... (read more)

What Selection Theorems Do We Expect/Want?

Edouard Harris3y30

Okay, then to make sure I've understood correctly: what you were saying in the quoted text is that you'll often see an economist, etc., use coherence theorems informally to justify a particular utility maximization model for some system, with particular priors and conditionals. (As opposed to using coherence theorems to justify the idea of EU models generally, which is what I'd thought you meant.) And this is a problem because the particular priors and conditionals they pick can't be justified solely by the coherence theorem(s) they cite.

The problem with V

... (read more)

6johnswentworth3y

Correct. This is a problem not because I want the choices fully justified, but rather because with many real world systems it's not clear exactly how I should set up my agent model. For instance, what's the world model and utility function of an e-coli? Some choices would make the model tautological/trivial; I want my claim that e.g. an e-coli approximates a Bayesian expected utility maximizer to have nontrivial and correct implications. I want to know the sense-in-which an e-coli approximates a Bayesian expected utility maximizer, and a rock doesn't. The coherence theorems tell us how to do that. They provide nontrivial sufficient conditions (like e.g. pareto optimality) which imply (and are implied by) particular utilities and world models. Your example is correct, though it is not the usual way of obtaining probabilities from coherence conditions. (Well, ok, in actual practice it kinda is the usual way, because existing coherence theorems are pretty weak. But it's not the usual way used by people who talk about coherence theorems a lot.) A more typical example: I can look at a chain of options on a stock, and use the prices of those options to back out market-implied probabilities for each possible stock price at expiry. Many coherence theorems do basically the same thing, but "prices" are derived from the trade-offs an agent accepts, rather than from a market.

AI Tracker: monitoring current and near-future risks from superscale models

Edouard Harris3y40

Thanks so much for the feedback!

The ability to sort by model size etc would be nice. Currently sorting is alphabetical.

Right now the default sort is actually chronological by publication date. I just added the ability to sort by model size and compute budget at your suggestion. You can use the "⇅ Sort" button in the Models tab to try it out; the rows should now sort correctly.

Also the rows with long textual information should be more to the right and the more informative/tighter/numerical columns more to the left (like "deep learning" in almost all r

... (read more)

7p.b.3y

Much better now! The date published vs date trained was on my mind because of Gopher. It seemed to me very relevant,that Deepmind trained a significantly larger model within basically half a year of the publication of GPT-3. In addition to google brain also being quite coy about their 100+B model it made me update a lot in the direction of "the big players will replicate any new breakthrough very quickly but not necessarily talk about it." To be clear, I also think it probably doesn't make sense to include this information in the list, because it is too rarely relevant.

What Selection Theorems Do We Expect/Want?

Edouard Harris3y30

I’m surprised by just how much of a blindspot goal-inputs seem to be for today’s economists, AI researchers, etc. The coherence theorems usually cited to justify expected utility maximization models imply a quite narrow range of inputs to those utility functions: utilities are only over the outcomes on which agents can bet. Yet practitioners use utility functions over entire (unobservable) world states, world state trajectories, MDP states, etc, often without any way for the agent to bet on all of the outcomes.

It's true that most of the agents we build can... (read more)

5johnswentworth3y

The problem with VNM-style lotteries is that the probabilities involved have to come from somewhere besides the coherence theorems themselves. We need to have some other, external reason to think it's useful to model the environment using these probabilities. That also means that the "probabilities" associated with the lottery are not necessarily the agent's probabilities, at least not in the sense that the implied probabilities derived from coherence theorems are the agent's.

AI Tracker: monitoring current and near-future risks from superscale models

Edouard Harris3y10

Personally speaking, I think this is the subfield to be closely tracking progress in, because 1) it has far-reaching implications in the long term and 2) it has garnered relatively little attention compared to other subfields.

Thanks for the clarification — definitely agree with this.

If you'd like to visualize trends though, you'll need more historical data points, I think.

Yeah, you're right. Our thinking was that we'd be able to do this with future data points or by increasing the "density" of points within the post-GPT-3 era, but ultimately it will probably be necessary (and more compelling) to include somewhat older examples too.

AI Tracker: monitoring current and near-future risks from superscale models

Edouard Harris3y10

Interesting; I hadn't heard of DreamerV2. From a quick look at the paper, it looks like one might describe it as a step on the way to something like EfficientZero. Does that sound roughly correct?

it would be great to see older models incorporated as well

We may extend this to older models in the future. But our goal right now is to focus on these models' public safety risks as standalone (or nearly standalone) systems. And prior to GPT-3, it's hard to find models whose public safety risks were meaningful on a standalone basis — while an earlier model could have been used as part of a malicious act, for example, it wouldn't be as central to such an act as a modern model would be.

1[anonymous]3y

Yes. They don't share a common lineage, but are similar in that they're both recent advances in efficient model-based RL. Personally speaking, I think this is the subfield to be closely tracking progress in, because 1) it has far-reaching implications in the long term and 2) it has garnered relatively little attention compared to other subfields. I see. If you'd like to visualize trends though, you'll need more historical data points, I think.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Edouard Harris3y50

Yeah, these are interesting points.

Isn't it a bit suspicious that the thing-that's-discontinuous is hard to measure, but the-thing-that's-continuous isn't? I mean, this isn't totally suspicious, because subjective experiences are often hard to pin down and explain using numbers and statistics. I can understand that, but the suspicion is still there.

I sympathize with this view, and I agree there is some element of truth to it that may point to a fundamental gap in our understanding (or at least in mine). But I'm not sure I entirely agree that disc... (read more)

Yudkowsky and Christiano discuss "Takeoff Speeds"

Edouard Harris3y182

I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we've seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance. Qualitative capabilities are more relevant to safety than benchmark performance is, because while qualitative capabilities include things like "code a simple video game" and "summarize movies with emojis", they also include things like "break out of confinement and kill everyone". It's the latter capabil... (read more)

Matthew Barnett3y70

I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we've seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance.

This is a reasonable thesis, and if indeed it's the one Gwern intended, then I apologize for missing it!

That said, I have a few objections,

Isn't it a bit suspicious that the thing-that's-discontinuous is hard to measure, but the-thing-that's-continuous isn't? I mean, this isn't totally suspicious, because

Edouard Harris3y20

Good catch! I didn't check the form. Yes you are right, the spoiler should say (1=Paul, 9=Eliezer) but the conclusion is the right way round.

4Rafael Harth3y

Yeah, it's fixed now. Thanks for pointing it out.

Yudkowsky and Christiano discuss "Takeoff Speeds"

Edouard Harris3y30

(Not being too specific to avoid spoilers) Quick note: I think the direction of the shift in your conclusion might be backwards, given the statistics you've posted and that 1=Eliezer and 9=Paul.

4Lukas Finnveden3y

No, the form says that 1=Paul. It's just the first sentence under the spoiler that's wrong.

AI Tracker: monitoring current and near-future risks from superscale models

Edouard Harris3y30

Thanks for the kind words and thoughtful comments.

You're absolutely right that expected ROI ultimately determines scale of investment. I agree on your efficiency point too: scaling and efficiency are complements, in the sense that the more you have of one, the more it's worth investing in the other.

I think we will probably include some measure of efficiency as you've suggested. But I'm not sure exactly what that will be, since efficiency measures tend to be benchmark-dependent so it's hard to get apples-to-apples here for a variety of reasons. (e.g., diffe... (read more)

A positive case for how we might succeed at prosaic AI alignment

Edouard Harris3y*10

Gotcha. Well, that seems right—certainly in the limit case.

A positive case for how we might succeed at prosaic AI alignment

Edouard Harris3y80

Thanks, that helps. So actually this objection says: "No, the biggest risk lies not in the trustworthiness of the Bob you use as the input to your scheme, but rather in the fidelity of your copying process; and this is true even if the errors in your copying process are being introduced randomly rather than adversarially. Moreover, if you actually do develop the technical capability to reduce your random copying-error risk down to around the level of your Bob-trustworthiness risk, well guess what, you've built yourself an AGI. But since this myopic copying... (read more)

5Eliezer Yudkowsky3y

Closer, yeah. In the limit of doing insanely complicated things with Bob you will start to break him even if he is faithfully simulated, you will be doing things that would break the actual Bob; but I think HCH schemes fail long before they get to that point.

A positive case for how we might succeed at prosaic AI alignment

Edouard Harris3y280

This is a great thread. Let me see if I can restate the arguments here in different language:

Suppose Bob is a smart guy whom we trust to want all the best things for humanity. Suppose we also have the technology to copy Bob's brain into software and run it in simulation at, say, a million times its normal speed. Then, if we thought we had one year between now and AGI (leaving aside the fact that I just described a literal AGI in the previous sentence), we could tell simulation-Bob, "You have a million subjective years to think of an effective pivotal act i

... (read more)

Eliezer Yudkowsky3y150

Eliezer's counterargument is "You don't get a high-fidelity copy of Bob that can be iterated and recursed to do arbitrary amounts of work a Bob-army could do, the way Bob could do it, until many years after the world otherwise ends. The imitated Bobs are imperfect, and if they scale to do vast amounts of work, kill you."

johnswentworth3y110

Abstracting out one step: there is a rough general argument that human-imitating AI is, if not perfectly safe, then at least as safe as the humans it's imitating. In particular, if it's imitating humans working on alignment, then it's at least as likely as we are to come up with an aligned AI. Its prospects are no worse than our prospects are already. (And plausibly better, since the simulated humans may have more time to solve the problem.)

For full strength, this argument requires that:

It emulate the kind of alignment research which the actual humans woul

Edouard Harris3y20

I want to push back a little against the claim that the bootstrapping strategy ("build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment") is definitely irrelevant/doomed/inferior. Specifically, I don't know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.

Yeah, very much agree with all of this. I even think there's an argument to be made that relatively narrow-yet-superhuman theorem provers (or other resear... (read more)

Optimization Concepts in the Game of Life

Edouard Harris3y40

Great catch. For what it's worth, it actually seems fine to me intuitively that any finite pattern would be an optimizing system for this reason, though I agree most such patterns may not directly be interesting. But perhaps this is a hint that some notion of independence or orthogonality of optimizing systems might help to complete this picture.

Here's a real-world example: you could imagine a universe where humans are minding their own business over here on Earth, while at the same time, over there in a star system 20 light-years away, two planets are hur... (read more)

Forecasting progress in language models

Edouard Harris3y20

Extremely interesting — thanks for posting. Obviously there are a number of caveats which you carefully point out, but this seems like a very reasonable methodology and the actual date ranges look compelling to me. (Though they also align with my bias in favor of shorter timelines, so I might not be impartial on that.)

One quick question about the end of this section:

The expected number of bits in original encoding per bits in the compression equals the entropy of that language.

Wouldn't this be the other way around? If your language has low entropy it shoul... (read more)

Optimization Concepts in the Game of Life

Edouard Harris3y20

Thanks! I think this all makes sense.

Oh yeah, I definitely agree with you that the empty board would be an optimizing system in the GoL context. All I meant was that the "Death" square in the examples table might not quite correspond to it in the analogy, since the death property is perhaps not an optimization target by the definition. Sorry if that wasn't clear.
:)
Got it, thanks! So if I've understood correctly, you are currently only using the mask as a way to separate the agent from its environment at instantiation, since that is all you real

... (read more)

3Victoria Krakovna3y

1. Actually, we realized that if we consider an empty board an optimizing system, then any finite pattern is an optimizing system (because it's similarly robust to adding non-viable collections of live cells), which is not very interesting. We have updated the post to reflect this.

Optimization Concepts in the Game of Life

Edouard Harris3y50

Loved this post. This whole idea of using a deterministic dynamical system as a conceptual testing ground feels very promising.

A few questions / comments:

About the examples: do you think it's strictly correct to say that entropy / death is an optimizing system? One of the conditions of the Flint definition is that the set of target states ought to be substantially smaller than the basin of attraction, by some measure on the configuration space. Yet neither high entropy nor death seem like they satisfy this: there are too many ways to be dead, and (tautolog

... (read more)

4Ramana Kumar3y

Nice comment - thanks for the feedback and questions! 1. I think the specific example we had in mind has a singleton set of target states: just the empty board. The basin is larger: boards containing no groups of more than 3 live cells. This is a refined version of "death" where even the noise is gone. But I agree with you that "high entropy" or "death", intuitively, could be seen as a large target, and hence maybe not an optimization target. Perhaps compare to the black hole. 2. Great suggestion - I think the "macrostate" terminology may indeed be a good fit / worth exploring more. 3. Thanks! I think there are probably external perturbations that can't be represented as embedded perturbations. 4. Thanks! 5. The mask applies only at the instant of instantiation, and then is irrelevant for the rest of the computation, in the way we've set things up. (This is because once you've used the mask to figure out what the initial state for the computation is, you then have just an ordinary state to roll out.)If we wanted to be able to find the agent again later on in the computation then yes indeed some kind of un-instantiation operation might need a mask to do that - haven't thought about it much but could be interesting.

Meta learning to gradient hack

Edouard Harris3y10

Very neat. It's quite curious that switching to L2 for the base optimizer doesn't seem to have resulted in the meta-initialized network learning the sine function. What sort of network did you use for the meta-learner? (It looks like the 4-layer network in your Methods refers to your base optimizer, but perhaps it's the same architecture for both?)

Also, do you know if you end up getting the meta-initialized network to learn the sine function eventually if you train for thousands and thousands of steps? Or does it just never learn no matter how hard you train it?

AI takeoff story: a continuation of progress by other means

Edouard Harris3y20

I see — perhaps I did misinterpret your earlier comment. It sounds like the transition you are more interested in is closer to (AI has ~free rein over the internet) => (AI invents nanotech). I don't think this is a step we should expect to be able to model especially well, but the best story/analogy I know of for it is probably the end part of That Alien Message. i.e., what sorts of approaches would we come up with, if all of human civilization was bent on solving the equivalent problem from our point of view?

If instead you're thinking more about a tran... (read more)

AI takeoff story: a continuation of progress by other means

Edouard Harris3y20

No problem, glad it was helpful!

And thanks for the APS-AI definition, I wasn't aware of the term.

AI takeoff story: a continuation of progress by other means

Edouard Harris3y40

Thanks! I agree with this critique. Note that Daniel also points out something similar in point 12 of his comment — see my response.

To elaborate a bit more on the "missing step" problem though:

I suspect many of the most plausible risk models have features that make it undesirable for them to be shared too widely. Please feel free to DM me if you'd like to chat more about this.
There will always be some point between Step 1 and Step 3 at which human-legible explanations fail. i.e., it would be extremely surprising if we could tell a coherent story about the

... (read more)

0Aegeus3y

Is it something like the AI-box argument? "If I share my AI breakout strategy, people will think 'I just won't fall for that strategy' instead of noticing the general problem that there are strategies they didn't think of"? I'm not a huge fan of that idea, but I won't argue it further. I'm not expecting a complete explanation, but I'd like to see a story that doesn't skip directly to "AI can reformat reality at will" without at least one intermediate step. Like, this is the third time I've seen an author pull this trick and I'm starting to wonder if the real AI-safety strategy is "make sure nobody invents grey-goo nanotech." If you have a ball of nanomachines that can take over the world faster than anyone can react to it, it doesn't really matter if it's an AI or a human at the controls, as soon as it's invented everyone dies. It's not so much an AI-risk problem as it is a problem with technological progress in general. (Fortunately, I think it's still up for debate whether it's even possible to create grey-goo-style nanotech.)

AI takeoff story: a continuation of progress by other means

Edouard Harris3y20

See my response to point 6 of Daniel's comment — it's rather that I'm imagining competing hedge funds (run by humans) beginning to enter the market with this sort of technology.

AI takeoff story: a continuation of progress by other means

Edouard Harris3y*60

Hey Daniel — thanks so much for taking the time to write this thoughtful feedback. I really appreciate you doing this, and very much enjoyed your "2026" post as well. I apologize for the delay and lengthy comment here, but wanted to make sure I addressed all your great points.

1. It would be great if you could pepper your story with dates, so that we can construct a timeline and judge for ourselves whether we think things are happening too quickly or not.

I've intentionally avoided referring to absolute dates, other than by indirect implication (e.g. "iOS 19... (read more)

5Daniel Kokotajlo3y

Thanks, this was a load of helpful clarification and justification! APS-AI means Advanced, Planning, Strategically-Aware AI. Advanced means superhuman at some set of tasks (such as persuasion, strategy, etc.) that combines to enable the acquisition of power and resources, at least in today's world. The term & concept is due to Joe Carlsmith (see his draft report on power-seeking AI, he blogged about it a while ago).

The theory-practice gap

Edouard Harris3y30

I see. Okay, I definitely agree that makes sense under the "fails to generalize" risk model. Thanks Rohin!

The theory-practice gap

Edouard Harris3y30

Got it, thanks!

I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren't the ones that are actually created by AGI.

This helps, and I think it's the part I don't currently have a great intuition for. My best attempt at steel-manning would be something like: "It's plausible that an AGI will generalize correctly to distributions which it is itself responsible for bringing about." (Where "correctly" here means "in a way that's consistent with its builders' wishes.") And you could plausibly argue that an AGI would hav... (read more)

4Rohin Shah3y

It's nothing quite so detailed as that. It's more like "maybe in the exotic circumstances we actually encounter, the objective does generalize, but also maybe not; there isn't a strong reason to expect one over the other". (Which is why I only say it is plausible that the AI system works fine, rather than probable.) You might think that the default expectation is that AI systems don't generalize. But in the world where we've gotten an existential catastrophe, we know that the capabilities generalized to the exotic circumstance; it seems like whatever made the capabilities generalize could also make the objective generalize in that exotic circumstance.

The theory-practice gap

Edouard Harris3y30

I agree with pretty much this whole comment, but do have one question:

But it still seems plausible that in practice we never hit those exotic circumstances (because those exotic circumstances never happen, or because we've retrained the model before we get to the exotic circumstances, etc), and it's intent aligned in all the circumstances the model actually encounters.

Given that this is conditioned on us getting to AGI, wouldn't the intuition here be that pretty much all the most valuable things such a system would do would fall under "exotic circumstances... (read more)

4Rohin Shah3y

There are lots and lots of exotic circumstances. We might get into a nuclear war. We might invent time travel. We might become digital uploads. We might decide democracy was a bad idea. I agree that AGI will create exotic circumstances. But not all exotic circumstances will be created by AGI. I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren't the ones that are actually created by AGI.