In most technical fields, we try designs, see what goes wrong, and iterate until it works. That’s the core iterative design loop. Humans are good at iterative design, and it works well in most fields in practice.

In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.

By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails. So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.

Why might the iterative design loop fail? Most readers probably know of two widely-discussed reasons:

  • Fast takeoff: there will be a sudden phase shift in capabilities, and the design of whatever system first undergoes that phase shift needs to be right on the first try.
  • Deceptive inner misalignment: an inner agent behaves well in order to deceive us, so we can’t tell there’s a problem just by trying stuff and looking at the system’s behavior.

… but these certainly aren’t the only reasons the iterative design loop potentially fails. This post will mostly talk about some particularly simple and robust failure modes, but I’d encourage you to think on your own about others. These are the things which kill us; they’re worth thinking about.

Basics: Hiding Problems

Example/Analogy: The Software Executive

Imagine that a software company executive, concerned about the many errors coming from the software, creates a new incentive scheme: software developers get a monetary reward for changes which decrease the rate of error messages showing up on the manager’s dashboard, and get docked for changes which increase the rate of error messages.

As Tyler Cowen would say: “solve for the equilibrium”. Obvious equilibrium here: the developers stop throwing error messages when they detect a problem, and instead the software just fails silently. The customer’s experience remains the same, but the manager’s dashboard shows fewer error messages. Over time, the customer’s experience probably degrades, as more and more problems go undetected.

In the short run, the strategy may eliminate some problems, but in the long run it breaks the iterative design loop: problems are not seen, and therefore not iterated upon. The loop fails at the “see what goes wrong” step.

Why RLHF Is Uniquely Terrible

The software executive’s strategy is the same basic idea as Reinforcement Learning from Human Feedback (RLHF). AI does something, a human looks at what happened to see if it looks good/bad, and the AI is trained on the human’s feedback. Just like the software executive’s anti-error-message compensation scheme, RLHF will probably result in some problems actually being fixed in the short term. But it renders the remaining problems far less visible, and therefore breaks the iterative design loop. In the context of AI, RLHF makes it far more likely that a future catastrophic error will have no warning signs, that overseers will have no idea that there’s any problem at all until it’s much too late.

Note that this issue applies even at low capability levels! Humans overlook problems all the time, some of those mistakes are systematic, and RLHF will select for places where humans systematically overlook problems; that selection pressure applies even when the neural net lacks great capabilities.

Net learns to hold hand in front of ball, so that it looks to a human observer like the ball is being grasped. Yes, this actually happened.

This is the core reason why I consider RLHF uniquely terrible, among alignment schemes. It is the only strategy I know of which actively breaks the iterative design loop; it makes problems less visible rather than more.

Generalization: Iterate Until We Don’t See Any Problems

More generally, one of the alignment failure modes I consider most likely is that an organization building AGI does see some problems in advance. But rather than addressing root causes, they try to train away the problems, and instead end up training the problems to no longer be easily noticeable.

Does This Prove Too Much?

One counterargument: don’t real organizations create incentives like the software executive all the time? And we have not died of it.

Response: real organizations do indeed create incentives to hide problems all the time, and large organizations are notorious for hiding problems at every level. It doesn’t even require employees consciously trying to hide things; selection pressure suffices. Sometimes important problems become public knowledge when a disaster occurs, but that’s usually after the disaster. The only reason we haven’t died of it yet is that it is hard to wipe out the human species with only 20th-century human capabilities. 

Less Basic: Knowing What To Look For

Example/Analogy: The Fusion Power Generator

Suppose, a few years from now, I prompt GPT-N to design a cheap, simple fusion power generator - something I could build in my garage and use to power my house. GPT-N succeeds. I build the fusion power generator, find that it works exactly as advertised, share the plans online, and soon the world has easy access to cheap, clean power.

One problem: at no point did it occur to me to ask “Can this design easily be turned into a bomb?”. Had I thought to prompt it with the question, GPT-N would have told me that the design could easily be turned into a bomb. But I didn’t think to ask, so GPT-N had no reason to mention it. With the design in wide use, it’s only a matter of time until people figure it out. And so, just like that, we live in a world where anyone can build a cheap thermonuclear warhead in their garage.

The root problem here is that I didn’t think to ask the right question; I didn’t pay attention to the right thing. An iterative design loop can sometimes help with that - empirical observation can draw our attention to previously-ignored issues. But when the failure mode does not happen in testing, the iterative design loop generally doesn’t draw our attention to it. An iterative design loop does not, in general, tell us which questions we need to ask.

Ok, but can’t we have an AI tell us what questions we need to ask? That’s trainable, right? And we can apply the iterative design loop to make AIs suggest better questions?

Example/Analogy: Gunpowder And The Medieval Lord

Imagine a medieval lord in a war against someone with slightly more advanced technological knowledge. We're not talking modern weaponry here, just gunpowder.

To the lord, it doesn’t look like the technologist is doing anything especially dangerous; mostly the technologist looks like an alchemist or a witch doctor. The technologist digs a hole, stretches a cloth over, dumps a pile of shit on top, then runs water through the shit-pile for a while. Eventually they remove the cloth and shit, throw some coal and brimstone in the hole, and mix it all together.

From the lord’s perspective, this definitely looks weird and mysterious, and they may be somewhat worried about weird and mysterious things in general. But it’s not obviously any more dangerous than, say, a shaman running a spiritual ceremony.

It’s not until after the GIANT GODDAMN EXPLOSION that the shit-pile starts to look unusually dangerous.

Now, what helpful advice could an AI give this medieval lord?

Obviously the AI could say “the powder which comes out of that weird mysterious process is going to produce a GIANT GODDAMN EXPLOSION”. The problem is, it is not cheap for the medieval lord to verify the AI’s claim. Based on the lord’s knowledge, there is no a-priori reason to expect the process to produce explosives rather than something else, and the amount of background knowledge the lord would need in order to verify the theory is enormous. The lord could in-principle verify the AI’s claim experimentally, but then (a) the lord is following a complex  procedure which he does not understand handed to him by a not-necessarily-friendly AI, and (b) the lord is mixing homemade explosives in his backyard. Both of these are dubious decisions at best.

So if we’re already confident that the AI is aligned, sure, it can tell us what to look for. But if there’s two AIs side-by-side, and one is saying “that powder will explode” and the other is saying “the shit-pile ceremony allows one to see the world from afar, perhaps to spot holes in our defenses”, the lord cannot easily see which of them is wrong. The two can argue with each other debate-style, and the lord still will not easily be able to see which is wrong, because he would need enormously more background knowledge to evaluate the arguments correctly. And if he can’t tell what the problem is, then the iterative design process can’t fix it.

Example/Analogy: Leaded Gasoline

Leaded gasoline is a decent historical analogue of the Fusion Generator Problem, though less deadly. It did solve a real problem: engines ran smoother with leaded gas. The problems were nonobvious, and took a long time to manifest. The iterative design loop did not work, because we could not see the problem just by testing out leaded gas in a lab. A test would have had to run for decades, at large scale, in order to see the issue - and that’s exactly what happened.

One could reasonably object to this example as an analogy, on the basis that things which drive the human species extinct would be more obvious. Dead bodies draw attention. But what about things which make the human species more stupid or aggressive? Lead did exactly that, after all. It’s not hard to imagine a large-scale issue which makes humans stupid or aggressive to a much greater extent, but slowly over the course of years or decades, with the problems going undetected or unaddressed until too late.

That’s not intended to be a highly probable story; there’s too much specific detail. The point is that, even if the proximate cause of extinction is obvious, the factors which make that proximate cause possible may not be. A gradual path to extinction is a real possibility. When problems only manifest on long timescales, the iterative design process is bad at fixing them.

Meta Example/Analogy: Expertise and Gell-Mann Amnesia

If you don’t know a fair bit about software engineering, you won’t be able to distinguish good from bad software engineers.

(Assuming stats from a couple years ago are still representative, at least half my readers can probably confirm this from experience. On the other hand, last time I brought this up, one commenter said something along the lines of “Can’t we test whether the code works without knowing anything about programming?”. Would any software engineers like to explain in the comments why that’s not the only key question to ask?)

Similarly, consider Gell-Mann Amnesia:

You open the newspaper to an article on some subject you know well. In Murray’s case, physics. In mine, show business. You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect. I call these the “wet streets cause rain” stories. Paper’s full of them.

In any case, you read with exasperation or amusement the multiple errors in a story, and then turn the page to national or international affairs, and read as if the rest of the newspaper was somehow more accurate about Palestine than the baloney you just read. You turn the page, and forget what you know.

I think there’s a similar effect for expertise: software engineers realize that those outside their field have difficulty distinguishing good from bad software engineers, but often fail to generalize this to the insight that non-experts in most fields have difficulty distinguishing good from bad practitioners. There are of course some general-purpose tricks (and they are particularly useful expertise to have), but they only get you so far.

The difficulty of distinguishing good from bad experts breaks the iterative design loop at a meta level. We realize that we might not be asking the right questions, our object-level design loop might not suffice, so we go consult some experts. But then how do we iterate on our experts? How do we find better experts, or create better experts? Again, there are some general-purpose tricks available, but they’re limited. In general, if we cannot see when there’s a problem with our expert-choice, we cannot iterate to fix that problem.

More Fundamental: Getting What We Measure

I’m just going to directly quote Paul’s post on this one:

If I want to convince Bob to vote for Alice, I can experiment with many different persuasion strategies and see which ones work. Or I can build good predictive models of Bob’s behavior and then search for actions that will lead him to vote for Alice. These are powerful techniques for achieving any goal that can be easily measured over short time periods.

But if I want to help Bob figure out whether he should vote for Alice - whether voting for Alice would ultimately help create the kind of society he wants - that can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes. We still need to use data in order to improve over time, but we need to understand how to update on new data in order to improve.

Some examples of easy-to-measure vs. hard-to-measure goals:

  • Persuading me, vs. helping me figure out what’s true. (Thanks to Wei Dai for making this example crisp.)
  • Reducing my feeling of uncertainty, vs. increasing my knowledge about the world.
  • Improving my reported life satisfaction, vs. actually helping me live a good life.
  • Reducing reported crimes, vs. actually preventing crime.
  • Increasing my wealth on paper, vs. increasing my effective control over resources.

If I want to help Bob figure out whether he should vote for Alice, that can’t be done by trial and error. That really gets at the heart of why the iterative design loop is unlikely to suffice for alignment, even though it works so well in so many other fields. In other fields, we usually have a pretty good idea of what we want. In alignment, figuring out what we want is itself a central problem. Trial and error doesn’t suffice for figuring out what we want.

So what happens if we rely on trial and error to figure out what we want? More from Paul’s post:

We will try to harness this power by constructing proxies for what we care about, but over time those proxies will come apart:

  • Corporations will deliver value to consumers as measured by profit. Eventually this mostly means manipulating consumers, capturing regulators, extortion and theft.
  • Investors will “own” shares of increasingly profitable corporations, and will sometimes try to use their profits to affect the world. Eventually instead of actually having an impact they will be surrounded by advisors who manipulate them into thinking they’ve had an impact.
  • Law enforcement will drive down complaints and increase reported sense of security. Eventually this will be driven by creating a false sense of security, hiding information about law enforcement failures, suppressing complaints, and coercing and manipulating citizens.
  • Legislation may be optimized to seem like it is addressing real problems and helping constituents. Eventually that will be achieved by undermining our ability to actually perceive problems and constructing increasingly convincing narratives about where the world is going and what’s important.

For a while we will be able to overcome these problems by recognizing them, improving the proxies, and imposing ad-hoc restrictions that avoid manipulation or abuse. But as the system becomes more complex, that job itself becomes too challenging for human reasoning to solve directly and requires its own trial and error, and at the meta-level the process continues to pursue some easily measured objective (potentially over longer timescales). Eventually large-scale attempts to fix the problem are themselves opposed by the collective optimization of millions of optimizers pursuing simple goals.

As this world goes off the rails, there may not be any discrete point where consensus recognizes that things have gone off the rails.

[...]

We might describe the result as “going out with a whimper.” Human reasoning gradually stops being able to compete with sophisticated, systematized manipulation and deception which is continuously improving by trial and error; human control over levers of power gradually becomes less and less effective; we ultimately lose any real ability to influence our society’s trajectory.

Summary & Takeaways

In worlds where the iterative design loop works for alignment, we probably survive AGI. So, if we want to improve humanity’s chances of survival, we should mostly focus on worlds where, for one reason or another, the iterative design loop fails. Fast takeoff and deceptive inner misalignment are two widely-talked-about potential failure modes, but they’re not the only ones. I wouldn’t consider either of them among the most robust ways in which the design loop fails, although they are among the most obviously and immediately dangerous failures.

Among the most basic robust design loop failures is problem-hiding. It happens all the time in the real world, and in practice we tend to not find out about the hidden problems until after a disaster occurs. This is why RLHF is such a uniquely terrible strategy: unlike most other alignment schemes, it makes problems less visible rather than more visible. If we can’t see the problem, we can’t iterate on it.

A more complicated and less legible class of design loop failures is not knowing what to look for. We might just not ask the right questions (as in the fusion power generator example), we might not even have enough background knowledge to recognize the right questions when they’re pointed out (as in the medieval lord example), it might take a very long time to get feedback on the key problems (as in the leaded gasoline example), and at a meta level we might not have the expertise to distinguish real experts from non-experts when seeking advice (as in Gell-Mann Amnesia).

Finally, we talked about Paul’s “You Get What You Measure” scenario. As Paul put it: “If I want to help Bob figure out whether he should vote for Alice - whether voting for Alice would ultimately help create the kind of society he wants - that can’t be done by trial and error.” That really captures the core reason why an iterative design loop is likely to fail for alignment, despite working so well in so many other fields: in other fields, we usually know what we want and are trying to get it. In alignment, figuring out what we want is itself a central problem, and the iterative design loop does not suffice for figuring out what we want.

New Comment
18 comments, sorted by Click to highlight new comments since:

In worlds where the iterative design loop works for alignment, we probably survive AGI. So, if we want to improve humanity’s chances of survival, we should mostly focus on worlds where, for one reason or another, the iterative design loop fails. ... Among the most basic robust design loop failures is problem-hiding. It happens all the time in the real world, and in practice we tend to not find out about the hidden problems until after a disaster occurs. This is why RLHF is such a uniquely terrible strategy: unlike most other alignment schemes, it makes problems less visible rather than more visible. If we can’t see the problem, we can’t iterate on it.

This argument is structurally invalid, because it sets up a false dichotomy between "iterative design loop works" and "iterative design loop fails". Techniques like RLHF do some work towards fixing the problem and some work towards hiding the problem, but your bimodal assumption says that the former can't move us from failure to success. If you've basically ruled out a priori the possibility that RLHF helps at all, then of course it looks like a terrible strategy!

By contrast, suppose that there's a continuous spectrum of possibilities for how well iterative design works, and there's some threshold above which we survive and below which we don't. You can model the development of RLHF techniques as pushing us up the spectrum, but then eventually becoming useless if the threshold is just too high. From this perspective, there's an open question about whether the threshold is within the regime in which RLHF is helpful; I tend to think it will be if not overused.

The argument is not structurally invalid, because in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF. Working on RLHF does not particularly increase our chances of survival, in the worlds where RLHF doesn't make things worse.

That said, I admit that argument is not very cruxy for me. The cruxy part is that I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1. And I think the various examples/analogies in the post convey my main intuition-sources behind that claim. In particular, the excerpts/claims from Get What You Measure are pretty cruxy.

in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF

In worlds where iterative design works, it works by iteratively designing some techniques. Why wouldn't RLHF be one of them?

In particular, the excerpts/claims from Get What You Measure are pretty cruxy.

It seems pretty odd to explain this by quoting someone who thinks that this effect is dramatically less important than you do (i.e. nowhere near causing a ~100% probability of iterative design failing). Not gonna debate this on the object level, just flagging that this is very far from the type of thinking that can justifiably get you anywhere near those levels of confidence.

In worlds where iterative design works, it works by iteratively designing some techniques. Why wouldn't RLHF be one of them?

Wrong question. The point is not that RLHF can't be part of a solution, in such worlds. The point is that working on RLHF does not provide any counterfactual improvement to chances of survival, in such worlds.

Iterative design is something which happens automagically, for free, without any alignment researcher having to work on it. Customers see problems in their AI products, and companies are incentivized to fix them; that's iterative design from human feedback baked into everyday economic incentives. Engineers notice problems in the things they're building, open bugs in whatever tracking software they're using, and eventually fix them; that's iterative design baked into everyday engineering workflows. Companies hire people to test out their products, see what problems come up, then fix them; that's iterative design baked into everyday processes. And to a large extent, the fixes will occur by collecting problem-cases and then training them away, because ML engineers already have that affordance; it's one of the few easy ways of fixing apparent problems in ML systems. That will all happen regardless of whether any alignment researchers work on RLHF.

When I say that "in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF", that's what I'm talking about. Problems which RLHF can solve (i.e. problems which are easy for humans to notice and then train away) will already be solved by default, without any alignment researchers working on them. So, there is no counterfactual value in working on RLHF, even in worlds where it basically works.

I think you're just doing the bimodal thing again. Sure, if you condition on worlds in which alignment happens automagically, then it's not valuable to advance the techniques involved. But there's a spectrum of possible difficulty, and in the middle parts there are worlds where RLHF works, but only because we've done a lot of research into it in advance (e.g. exploring things like debate); or where RLHF doesn't work, but finding specific failure cases earlier allowed us to develop better techniques.

Yeah, ok, so I am making a substantive claim that the distribution is bimodal. (Or, more accurately, the distribution is wide and work on RLHF only counterfactually matters if we happen to land in a very specific tiny slice somewhere in the middle.) Those "middle worlds" are rare enough to be negligible; it would take a really weird accident for the world to end up such that the iteration cycles provided by ordinary economic/engineering activity would not produce aligned AI, but the extra iteration cycles provided by research into RLHF would produce aligned AI.

Upon further thought, I have another hypothesis about why there seems like a gap here. You claim here that the distribution is bimodal, but your previous claim ("I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1") suggests you don't actually think there's significant probability on the lower mode, you essentially think it's unimodal on the "iterative design fails" worlds.

I personally disagree with both the "significant probability on both modes, but not in between" hypothesis, and the "unimodal on iterative design fails" hypothesis, but I think that it's important to be clear about which you're defending - e.g. because if you were defending the former, then I'd want to dig into what you thought the first mode would actually look like and whether we could extend it to harder cases, whereas I wouldn't if you were defending the latter.

Yeah, that's fair. The reason I talked about it that way is that I was trying to give what I consider the strongest/most general argument, i.e. the argument with the fewest assumptions.

What I actually think is that:

  • nearly all the probability mass is on worlds the iterative design loop fails to align AGI, but...
  • conditional on that being wrong, nearly all the probability mass is on the number of bits of optimization from iterative design resulting from ordinary economic/engineering activity being sufficient to align AGI, i.e. it is very unlikely that adding a few extra bits of qualitatively-similar optimization pressure will make the difference. ("We are unlikely to hit/miss by a little bit" is the more general slogan.)

The second claim would be cruxy if I changed my mind on the first, and requires fewer assumptions, and therefore fewer inductive steps from readers' pre-existing models.

In general I think it's better to reason in terms of continuous variables like "how helpful is the iterative design loop" rather than "does it work or does it fail"?

My argument is more naturally phrased in the continuous setting, but if I translated it into the binary setting: the problem with your argument is that conditional on the first being wrong, then the second is not very action-guiding. E.g. conditional on the first, then the most impactful thing is probably to aim towards worlds in which we do hit or miss by a little bit; and that might still be true if it's 5% of worlds rather than 50% of worlds.

(Thinking out loud here...) In general, I am extremely suspicious of arguments that the expected-impact-maximizing strategy is to aim for marginal improvement (not just in alignment - this is a general heuristic); I think that is almost always false in practice, at least in situations where people bother to explicitly make the claim. So let's say I were somehow approximately-100% convinced that it's basically possible for iterative design to produce an AI. Then I'd expect AI is probably not an X-risk, but I still want to reduce the small remaining chance of alignment failure. Would I expect that doing more iterative design is the most impactful approach? Most probably not. In that world, I'd expect the risk is dominated by some kind of tail risks which iterative design could maybe handle in principle, but for which iterative design is really not the optimal tool - otherwise they'd already be handled by the default iterative design processes.

So I guess at that point I'd be looking at quantitative usefulness of iterative design, rather than binary.

General point: it's just really hard to get a situation where "do marginally more of the thing we already do lots of by default" is the most impactful strategy. In nearly all cases, there will be problems which the things-we-already-do-lots-of-by-default handle relatively poorly, and then we can have much higher impact by using some other kind of strategy which better handles the kind of problems which are relatively poorly handled by default.

[+][comment deleted]30

Fast takeoff: there will be a sudden phase shift in capabilities, and the design of whatever system first undergoes that phase shift needs to be right on the first try.

I would have said "irreversible catastrophe", not "fast takeoff". Isn't that the real problem? Iterative design presumably gets you a solution eventually, if one exists, but it's not guaranteed to get you a solution after N iterations, where N is some number determined ex ante. In extreme fast takeoff, we need to solve alignment with N=0 iterations. In slow takeoff (in a competitive uncoordinated world), we need to succeed within N<(whatever) iterations. The latter is less bad than the former, but as long as there's a deadline, there's a chance we'll miss it.

(Slow takeoff is a bit worse than that because arguably as the AIs get gradually more capable, the problem keeps changing; you're not iterating on exactly the same problem for the whole takeoff.)

You are correct. I was trying to list the two frames which I think people most often use, not necessarily the best versions of those frames, since I wanted to emphasize that there are lots of other ways the iterative design loop fails.

Nice post!

What would happen in your GPT-N fusion reactor story if you ask it a broader question about whether it is a good idea to share the plans? 

Perhaps relatedly:

>Ok, but can’t we have an AI tell us what questions we need to ask? That’s trainable, right? And we can apply the iterative design loop to make AIs suggest better questions?

I don't get what your response to this is. Of course, there is the verifiability issue (which I buy). But it seems that the verifiability issue alone is sufficient for failure. If you ask, "Can this design be turned into a bomb?" and the AI says, "No, it's safe for such and such reasons", then if you can't evaluate these reasons, it doesn't help you that you have asked the right question.

My response to the "get the AI to tell us what questions we need to ask" is that it fails for multiple reasons, any one of which is sufficient for failure. One of them is the verifiability issue. Another is the Gell-Mann Amnesia thing (which you could view as just another frame on the verifiability issue, but up a meta level). Another is the "get what we measure" problem.

Another failure mode which this post did not discuss is the Godzilla Problem. In the frame of this post: in order to work in practice the iterative design loop needs to be able to self-correct; if we make a mistake at one iteration it must be fixable at later iterations. "Get the AI to tell us what questions we need to ask" fails that test; just one iteration of acting on malicious advice from an AI can permanently break the design loop.

I wonder, almost just idle curiosity, whether or not the "measuring-via-proxy will cause value drift" is something we could formalize and iterate on first. Is the problem stable on the meta-level, or is there a way we can meaningfully define "not drifting from the proxy" without just generally solving alignment.

Intuitively I'd guess this is the "don't try to be cute" class of thought, but I was afraid to post at all and decided that I wanted to interact, even at the cost of (probably) saying something embarassing.

“Can’t we test whether the code works without knowing anything about programming?”

Knowing what to test to reliably decrease uncertainty about "whether the code works" includes knowing "a fair bit" about software engineering.

I agree with the distinction that being a programmer is not the only way how to know about programming, many hiring managers are not programmers themselves, they just have to know a fair bit about software engineering.

I wonder if implications for this kind of reasoning go beyond AI: indeed, you mention the incentive structure for AI as just being a special case of failing to incentivize people properly (e.g. the software executive), and the only difference being AI occurring at a scale which has the potential to drive extinction. But even in this respect, AI doesn't really seem unique: take the economic system as a whole, and "green" metrics, as a way to stave off catastrophic climate change. Firms, with the power to extinguish human life through slow processes like gradual climate destruction, will become incentivized towards methods of pollution that are easier to hide as regulations on carbon and greenhouse gases become more stringent. This seems like just a problem of an error-prone humanity having greater and greater control of our planet, and our technology and metrics, as a reflection of this, also being error-prone, only with greater and greater consequence for any given error.

Also, what do you see, more concretely, as a solution to this iterative problem? You argue that coming up with the right formalism for what we want, for example, as a way to do this, but this procedure is ultimately also iterative: we inevitably fail to specify our values correctly on some subset of scenarios, and then your reasoning equally applies on the meta-iteration procedure of specifying values, and waiting to see what it does in real systems. Whether with RL from human feedback or RL from human formalism, a sufficiently smart agent deployed on a sufficiently hard task will always find unintended easy ways to optimize an objective, and hide them, vs. solving the original task. Asking that we "get it right", and figuring out what we want, seems kind of equivalent to waiting for the right iteration of human feedback, except on a different iteration pipeline (which, to me, don't seem fundamentally different on the AGI scale).