AI will change the world, but won’t take it over by playing “3-dimensional chess”.

benedelman

Thanks for posting, I thought this was interesting and reasonable.

Some points of agreement:

I think many of these are real considerations that the risk is lower than it might otherwise appear.
I agree with your analysis that short-term and well-scoped decisions will probably tend to be a comparative advantage of AI systems.
I think it can be productive to explicitly focus on “narrow” systems (which pursue scoped short-term goals, without necessarily having specifically limited competence) and to lean heavily on the verification-vs-generation gap.
I think these considerations together with a deliberate decision to focus on narrowness could significnatly (though not indefinitely) postpone the point when alignment difficulties could become fatal.
I think that it's unrealistic for AI systems to rapidly improve their own performance without limits. Relatedly, I sympathize with your skepticism about the story of a galaxy-brained AI outwitting humanity in a game of 3 dimensional chess.

My most important disagreement is that I don’t find your objections to hypothesis 2 convincing. I think the biggest reason for this is that you are implicitly focusing on a particular mechanism that could make hypothesis 2 true (powerful AI systems are trained to pursue long-term goals because we want to leverage AI systems’ long-horizon planning ability) and neglecting two other mechanisms that I find very plausible. I’ll describe those in two child comments so that we can keep the threads separate. Out of your 6 claims, I think only claim 2 is relevant to either of these other mechanisms.

I also have some scattered disagreements throughout:

So far it seems extremely difficult to extract short-term modules from models pursuing long-term goals. It’s not clear how you would do it even in principle and I don’t think we have compelling examples. The AlphaZero -> Stockfish situation does not seem like a successful example to me, though maybe I'm missing something about the situation. So overall I think this is worth mentioning as a possibility that might reduce risk (alongside many others), but not something that qualitatively changes the picture.
I’m very skeptical about your inference from “CEOs don’t have the literal highest IQs” to “cognitive ability is not that important for performance as a CEO,” and even moreso for jumping all the way to “cognitive ability is not that important for long-term planning.” I think that (i) competent CEOs are quite smart even if not in the tails of the IQ distribution, (ii) there are many forms of cognitive ability which are only modestly correlated, and so the tails come apart, (iii) there are huge amounts of real-world experience that drive CEO performance beyond cognitive ability, (iv) CEO selection is not perfectly correlated with performance. Given all of that, I think you basically can’t get any juice out of this data. If anything I would say the high compensation of CEOs, their tendency to be unusually smart, and skill transferability across different companies seem to provide some evidence that CEO cognitive ability has major effects on firm performance (I suspect there is an economics literature investigating this claim). Overall I thought this was the weakest point of the article.
While I agree there are fundamental computational limits to performance, I don’t think they qualitatively change the picture about the singularity. This is ultimately a weedsy quantitative question and doesn’t seem central to your point so I won't get into it, but I’d be happy to elaborate if it feels like an important disagreement. I also don’t think the scaling laws you cite support your claim; ultimately the whole point is that the (compute vs performance) curves tend to fall with further R&D.
I would agree with the claim “more likely than not, AI systems won’t take over the world.” But I don’t find <50% doom very comforting! Indeed my own estimate is more like 10-20% (depending on what we are measuring) but I still consider this a plurality of total existential risk and a very appealing thing to work on. Overall I think most of the considerations you raise are more like quantitative adjustments to these probabilities, and so a lot depends on what is in fact baked in or how you feel about the other arguments on offer about AI takeover (in both directions).
I think you are greatly underestimating the difficulty of deterrence and prevention. If AI systems are superhuman for short-horizon tasks, it seems like humans would become reliant on AI help to prevent or contain bad behavior by other AIs. But if there are widespread alignment problems, then the AI systems charged with defending humans may instead join in to help disempower humanity. Without progress on alignment it seems like we are heading towards an increasingly unstable word. The situation is quite different from preventing or deterring human “bad actors;” amongst humans the question is how to avoid destructive negative-sum behavior, whereas in the hypothetical situation you are imagining vast numbers of AIs who are doing almost all the work and don't care about human flourishing, yet somehow trying to structure society so that it nevertheless leads to human flourishing.

[-]paulfchristiano3y1720

Mechanism 2: deceptive alignment

Suppose that during training my AI system had some arbitrary long-term goal. Many long-term goals would be best-served if the deployed AI system had that same goal. And so my AI is motivated to get a low loss, so that gradient descent won’t change its goals.

As a result, a very wide range of long-term goals will lead to competent loss-minimizing behavior. On the other hand, there is a very narrow range of short-term goals that lead to competent loss-minimizing behavior: “minimize the loss.”

So gradient descent on the short-term loss function can easily push towards long-term goals (in fact it would both push towards the precise short-term goals that result in low loss and arbitrary long-term goals, and it seems like a messy empirical question which one you get). This might not happen early in training, but eventually our model is competent enough to appreciate these arguments and perhaps for it to be extremely obvious to it that it should avoid taking actions that would be penalized by training.

It doesn’t seem like there are any behavioral checks we can do to easily push gradient descent back in the other direction, since an agent that is trying to get a low loss will always just adopt whatever behavior is best for getting a low loss (as long as it thinks it is on the training distribution).

This all is true even if my AI has subhuman long-horizon reasoning. Overall my take is maybe that there is a 25% chance that this becomes a serious issue soon enough to be relevant to us and that is resistant to simple attempts to fix it (though it’s also possible we will fail to even competently implement simple fixes). I expect to learn much more about this as we start engaging with AI systems intelligent enough for it to be a potential issue over the next 5-10 years.

This issue is discussed here. Overall I think it’s speculative but plausible.

[-]benedelman3y*62

I agree that this sort of deceptive misalignment story is speculative but a priori plausible. I think it's very difficult to reason about these sorts of nuanced inductive biases without having sufficiently tight analogies to current systems or theoretical models; how this will play out (as with other questions of inductive bias) probably depends to a large extent on what the high-level structure of the AI system looks like. Because of this, I think it's more likely than not that our predictions about what these inductive biases will look like are pretty off-base. That being said, here are the first few specific reasons to doubt the scenario which come to mind right now:

If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term. It's imaginable that the goal is a mesa-objective which is mixed in some inescapably non-modular way with the rest of the system, but then it would be surprising to me if the system's behavior could really best be best characterized as optimizing this single objective; as opposed to applying a bunch of heuristics, some of which involve pursuing mesa-objectives and some of which don't fit into that schema—so perhaps framing everything the agent does in terms of objectives isn't the most useful framing (?).
If an agent has a long-term objective, for which achieving the desired short-term objective is only instrumentally useful, then in order to succeed the agent needs to figure out how to minimize the loss by using its reasoning skills (by default, within a single episode). If, on the other hand, the agent has an appropriate short-term objective, then the agent will learn (across episodes) how to minimize the loss through gradient descent. I expect the latter scenario to typically result in better loss for statistical reasons, since the agent can take advantage of more samples. (This would be especially clear if, in the training paradigm of the future, the competence of the agent increases during training.)

(There's also the idea of imposing a speed prior; not sure how likely that direction is to pan out.)

Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It's not enough for it to be plausible that it could happen often; it needs to happen all the time.

[-]Vivek Hebbar3y72

If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term.

What kind of regularization could this be? And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?

[-]paulfchristiano3y64

Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It's not enough for it to be plausible that it could happen often; it needs to happen all the time.

I think the situation is much better if deceptive alignment is inconsistent. I also think that's more likely, particularly if we are trying.

That said, I don't think the problem goes away completely if deceptive alignment is inconsistent. We may still have limited ability to distinguish deceptively aligned models from models that are trying to optimize reward, or we may find that models that are trying to optimize reward are unsuitable in practice (e.g. because of the issues raised in mechanism 1) and so selecting for things that works means you are selecting for deceptive alignment.

[-]paulfchristiano3y168

Mechanism 1: Shifting horizon length in response to short-horizon tampering

Suppose I want my AI to write good code (say to help me run my business). The AI understands a lot about how to write code, how servers work, and how users behave, learned entirely from quick feedback and experimentation. Let’s say it has a human-level or even subhuman understanding of the overall business and other long-term planning.

(This example may seem a bit silly if you imagine a software-writing AI in isolation, but you should think of the same story playing out all across an economy in parallel as AI systems take on an extremely wide range of tasks.)

How do I train that system to use its understanding to write good code? Here are two simple options:

Process-based: Look at the AI’s code, have the AI explain why it made these decisions, and evaluate everything on paper.
Outcomes-based: Run the code, monitor resource usage, see what users say in the first hour after deployment.

Process-based feedback potentially handicaps my AI (even if it is only superhuman on short-horizon tasks). It’s not clear how large this advantage is, but I think our experience in practice is that “actually run your engineer’s code” is an extremely helpful technique for evaluating it in practice, and it becomes more important the better your engineers are and the less able you are to evaluate decisions they made.

So without some kind of technical progress I’m afraid we may often be pushed to use outcomes-based feedback to make our systems work better.

Unfortunately outcomes-based feedback creates an adversarial interaction between me and my AI. In addition to selecting for AIs that write good code, I’m selecting for AIs that use their short-horizon abilities to write software that covers up evidence of trouble and tampers with logs, that fakes or destroys user reports or manipulates users, and that generally messes with every mechanism we try to use to understand what’s going on. (Depending on exactly what goal the system internalized, it might also be motivated to write code that directly tampers with data fed back into the ML training process.)

Maybe we can make all of those things impossible with good enough security, so that the only way for code to look good is to actually be good. But now we are setting up a larger-scale, persistent conflict between humans and the AI systems they delegate work to. And many different AI systems could potentially all simultaneously achieve their separate short-term tasks by defecting and undermining human evaluations, making it harder to use AI evaluators to reduce the risk of tampering (or creating an increasingly unstable situation if we try to go that route).

I think small versions of sensor tampering are likely to happen and are not a big deal. I’m scared about the world where this kind of failure becomes a big pain and seriously impacts the usefulness of AI. Maybe I’d wildly guess a 40% chance that we end up with a significant problem like this if we keep building ML systems using roughly the same techniques we use today (and that those problems arise soon enough to be relevant).

There are many good things we could do to fix that problem, and in some sense it would be great news for AI alignment since we’d have a concrete technical problem to iterate on.

But there is also one very bad way we could fix this problem. We could notice incidents of tampering much later, e.g. by talking to users out-of-bound or noticing we aren’t getting return traffic or seeing that resource usage measurements were clearly faked. Then we can label those incidents as examples of wha not to do and throw that data back into the training process.

This will quickly select against policies that do this kind of short-term tampering, but it could so in two ways:

The AI learned to not to tamper with our measurements.
The AI learned to not to tamper with our measurements in any way we’d catch later.

In the second case, we’ve now created an AI with long-term goals even though all we really wanted was a short-term task, and we have a similar adversarial dynamic at a larger scale. And even if the AI has merely human-level competence at long-horizon tasks, it will still try to use its short-term competencies to prevent us from ever detecting and correcting its mistakes. If we keep training out every problem at a scale that we can correct, we will eventually be left with total failures that we can’t correct (e.g. because AI systems have permanently disempowered us).

I think it’s best to view this as a kind of overfitting: we wanted to get an AI that doesn’t do anything bad, but if we train on “did we eventually notice that it did something ad?” then we will end up with a system that is overfit to that objective. And just like with overfitting, you could also imagine it happening without explicitly training on the offending data. If 5% of AI systems have long-term goals and 95% have short-term goals, you could end up with the 95% consistently behaving badly by taking short-term opportunities to tamper with our measurements, and so we would select for the 5% with long-term goals.

[-]benedelman3y40

My main objection to this misalignment mechanism is that it requires people/businesses/etc. to ignore the very concern you are raising. I can imagine this happening for two reasons:

A small group of researchers raise alarm that this is going on, but society at large doesn't listen to them because everything seems to be going so well. This feels unlikely unless the AIs have an extremely high level of proficiency in hiding their tampering, so that the poor performance on the intended objective only comes back to bite the AI's employers once society is permanently disempowered by AI. Nigh-infallibly covering up tampering sounds like a very difficult task even for an AI that is super-human. I would expect at least some of the negative downstream effects of the tampering to slip through the cracks and for people to be very alarmed by these failures.
The consensus opinion is that your concern is real, but organizations still rely on outcome-based feedback in these situations anyway because if they don't they will be outcompeted in the short term by organizations that do. Maybe governments even try to restrict unsafe use of outcome-based feedback through regulation, but the regulations are ineffective. I'll need to think about this scenario further, but my initial objection is the same as my objection to reason 1: the scenario requires the actual tampering that is actually happening to be covered up so well that corporate leaders etc. think it will not hurt their bottom line (either through direct negative effects or through being caught by regulators) in expectation in the future.

Which of 1 and 2 do you think is likely? And can you elaborate on why you think AIs will be so good at covering up their tampering (or why your story stands up to tampering sometimes slipping through the cracks)?

Finally, if there aren't major problems resulting from the tampering until "AI systems have permanently disempowered us", why should we expect problems to emerge afterwards, unless the AI systems are cooperating / don't care about each other's tampering?

(Am I right that this is basically the same scenario you were describing in this post? https://www.alignmentforum.org/posts/AyNHoTWWAJ5eb99ji/another-outer-alignment-failure-story)

[-]leogao3y*928

A small group of researchers raise alarm that this is going on, but society at large doesn't listen to them because everything seems to be going so well.

Arguably this is already the situation with alignment. We have already observed empirical examples of many early alignment problems like reward hacking. One could make an argument that looks something like "well yes but this is just in a toy environment, and it's a big leap to it taking over the world", but it seems unclear when society will start listening. In analogy to the AI goalpost moving problem ("chess was never actually hard!"), in my model it seems entirely plausible that every time we observe some alignment failure it updates a few people but most people remain un-updated. I predict that for a large set of things currently claimed will cause people to take alignment seriously, most of them will either be ignored by most people once they happen, or never happen before catastrophic failure.

We can also see analogous dynamics in i.e climate change, where even given decades of hard numbers and tangible physical phenomena large amounts of people (and importantly, major polluters) still reject its existence, many interventions are undertaken which only serve as lip service (greenwashing), and all of this would be worse if renewables were still economically uncompetitive.

I expect the alignment situation to be strictly worse because a) I expect the most egregious failures to only come shortly before AGI, so once evidence as robust as climate change (i.e literally catching AIs red handed trying and almost succeeding at taking over the world), I estimate we have anywhere between a few years and negative years left b) the space of ineffectual alignment interventions is far larger and harder to distinguish from real solutions to the underlying problem c) in particular, training away failures in ways that don't solve the underlying problems (i.e incentivizing deception) is an extremely attractive option and there does not exist any solution to this technical problem, and just observing the visible problems disappear is insufficient to distinguish whether the underlying problems are solved d) 80% of the tech for solving climate change basically already exists or is within reach, and society basically just has to decide that it cares, and the cost to society is legible. For alignment, we have no idea how to solve the technical problem, or even how that solution will vaguely look. This makes it a harder sell to society, e) the economic value of AGI vastly outweighs the value of fossil fuels, making the vested interest substantially larger, f) especially due to deceptive alignment, I expect actually-aligned systems to be strictly more expensive than unaligned systems; the cost will be more than just a fixed % more money, but also cost in terms of additional difficulty and uncertainty, time to market disadvantage, etc.

[-]benedelman3y10

Thanks for laying out the case for this scenario, and for making a concrete analogy to a current world problem! I think our differing intuitions on how likely this scenario is might boil down to different intuitions about the following question:

To what extent will the costs of misalignment be borne by the direct users/employers of AI?

Addressing climate change is hard specifically because the costs of fossil fuel emissions are pretty much entirely borne by agents other than the emitters. If this weren't the case, then it wouldn't be a problem, for the reasons you've mentioned!

I agree that if the costs of misalignment are nearly entirely externalities, then your argument is convincing. And I have a lot of uncertainty about whether this is true. My gut intuition, though, is that employing a misaligned AI is less like "emitting CO2 into the atmosphere" and more like "employing a very misaligned human employee" or "using shoddy accounting practices" or "secretly taking sketchy shortcuts on engineering projects in order to save costs"—all of which yield serious risks for the employer, and all of which real-world companies take serious steps to avoid, even when these steps are costly (with high probability, if not in expectation) in the short term.

We have already observed empirical examples of many early alignment problems like reward hacking. One could make an argument that looks something like "well yes but this is just in a toy environment, and it's a big leap to it taking over the world", but it seems unclear when society will start listening.

I expect society (specifically, relevant decision-makers) to start listening once the demonstrated alignment problems actually hurt people, and for businesses to act once misalignment hurts their bottom lines (again, unless you think misalignment can always be shoved under the rug and not hurt anyone's bottom line). There's lots of room for this to happen in the middle ground between toy environments and taking over the world (unless you expect lightning-fast takeoff, which I don't).

[-]leogao3y*22

I expect that the key externalities will be borne by society. The main reason for this is I expect deceptive alignment to be a big deal. It will at some point be very easy to make AI appear safe, by making it pretend to be aligned, and very hard to make it actually aligned. Then, I expect something like the following to play out (this is already an optimistic rollout intended to isolate the externality aspect, not a representative one):

We start observing alignment failures in models. Maybe a bunch of AIs do things analogous to shoddy accounting practices. Everyone says "yes, AI safety is Very Important". Someone notices that when you punish the AI for exhibiting bad behaviour with RLHF or something the AI stops exhibiting bad behaviour (because it's pretending to be aligned). Some people are complaining that this doesn't actually make it aligned, but they're ignored or given a token mention. A bunch of regulations are passed to enforce that everyone uses RLHF to align their models. People notice that alignment failures decrease across the board. The models don't have to somehow magically all coordinate to not accidentally reveal deception, because even in cases where models fail in dangerous ways people chalk this up to the techniques not being perfect, but they're being iterated on, etc. Heck, humans commit fraud all the time and yet it doesn't cause people to suddenly stop trusting everyone they know when a high profile fraud case is exposed. And locally there's always the incentive to just make the accounting fraud go away by applying Well Known Technique rather than really dig deep and figuring out why it's happening. Also, a lot of people will have vested interest in not having the general public think that AI might be deceptive, and so will try to discredit the idea as being fringe. Over time, AI systems control more and more of the economy. At some point they will control enough of the economy to cause catastrophic damage, and a treacherous turn happens.

At every point through this story, the local incentive for most businesses is to do whatever it takes to make the AI stop committing accounting fraud or whatever, not to try and stave off a hypothetical long term catastrophe. A real life example that this is analogous to is antibiotic overuse.

This story does hinge on "sweeping under the rug" being easier than actually properly solving alignment, but if deceptive alignment is a thing and is even moderately hard to solve properly then this seems very likely the case.

I expect society (specifically, relevant decision-makers) to start listening once the demonstrated alignment problems actually hurt people

I predict that for most operationalizations of "actually hurt people", the result is that the right problems will not be paid attention to. And I don't expect lightning fast takeoff to be necessary. Again, in the case of climate change, which has very slow "takeoff", millions of people are directly impacted, and yet governments and major corporations move very slowly and mostly just say things about climate change mitigation being Very Important and doing token paper straw efforts. Deceptive alignment means that there is a very attractive easy option that makes the immediate crisis go away for a while.

But even setting aside the question of whether we should even expect to see warning signs, and whether deceptive alignment is a thing, I find it plausible that even the response to a warning sign that is as blatantly obvious as possible (an AI system tries to take over the world, fails, kills a bunch of people in the process) just results in front page headlines for a few days, some token statements, a bunch of political squabbling between people using the issue as a proxy fight for the broader "tech good or bad" narrative and a postmortem that results in patching the specific things that went wrong without trying to solve the underlying problem. (If even that; we're still doing gain of function research on coronaviruses!)

[-]benedelman3y41

Thank you for the insightful comments!! I've added thoughts on Mechanisms 1 and 2 below. Some reactions to your scattered disagreements (my personal opinions; not Boaz's):

I agree that extracting short-term modules from long-term systems is more likely than not to be extremely hard. (Also that we will have a better sense of the difficulty in the nearish future as more researchers work on this sort of task for current systems.)
I agree that the CEO point might be the weakest in the article. It seems very difficult to find high-quality evidence about the impact of intelligence on long-term strategic planning in complex systems, and this is a major source of my uncertainty about whether our thesis is true. Note that even if making CEOs smarter would improve their performance, it may still be the case that any intelligence boost is fully substitutable by augmentation with advanced short-term AI systems.
From published results I've seen (e.g. comparison of LSTMs vs Transformers in figure 7 of Kaplan et al., effects of architecture tweaks in other papers such as this one), architectural improvements (R&D) tend to have only a minimal effect on the exponent of scaling power laws; so the differences in the scaling laws could hypothetically be compensated for by increasing compute by a multiplicative constant. (Architecture choice can have a more significant effect on factors like parallelizability and stability of training.) I'm very curious whether you've seen results that suggest otherwise (I wouldn't be surprised if this were the case, the examples I've seen are very limited, and I'd love to see more extensive studies), or whether you have more relevant intuition/evidence for there being no "floor" to hypothetically achievable scaling laws.
I agree that our argument should result in a quantitative adjustment to some folk's estimated probability of catastrophe, rather than ruling out catastrophe entirely, and I agree that figuring out how to handle worst-case scenarios is very productive.
When you say "the AI systems charged with defending humans may instead join in to help disempower humanity", are you supposing that these systems have long-term goals? (even more specifically, goals that lead them to cooperate with each other to disempower humanity?)

[-]LawrenceC3y30

Given all of that, I think you basically can’t get any juice out of this data. If anything I would say the high compensation of CEOs, their tendency to be unusually smart, and skill transferability across different companies seem to provide some evidence that CEO cognitive ability has major effects on firm performance (I suspect there is an economics literature investigating this claim).

There's a few, for example the classic "Are CEOs Born Leaders?" which uses the same Swedish data and finds a linear relationship of cognitive ability with both log company assets and log CEO pay, though it also concludes that the effect isn't super large. The main reason there aren't more is that we generally don't have good cognitive data on most CEOs. (There are plenty of studies looking at education attainment or other proxies.) You can see this trend in the Dal Bo et al Table cited in the main post as well.

(As an aside, I'm a bit worried about the Swedish dataset, since the cognitive ability of Swedish large-firm CEOs is lower than Herrnstein and Murray (1996)'s estimated cognitive ability of 12.9 million Americans in managerial roles. Maybe something interesting happens with CEOs in Sweden?)

It is very well established that certain CEOs are consistently better than others, i.e. CEO level fixed effects matter significantly to company performance across a broad variety of outcomes.

[-]Vanessa Kosoy3y1924

IIUC the thesis of this article rest on several interrelated claims:

Long-term planning is not useful because of chaos
Short-term AIs have no alignment problem
Among humans, skill is not important for leadership, beyond some point
Human brains have an advantage w.r.t. animals because of "universality", and any further advantage can only come from scaling with resources.

I wish to address these claims one by one.

Claim 1

This is an erroneous application of chaos theory IMO. The core observation of chaos theory is, that in many dynamical systems with compact phase space, any distribution converges (in the Kantorovich-Rubinstein sense) to a unique stationary distribution. This means that small measurement errors lead to large prediction errors, and in the limit no information from the initial condition remains.

However, real-world dynamical systems are often not compact in the relevant approximation. In particular, acquisition of resources and development of new technologies are not bounded from above on a relevant scale. Indeed, trends in GDP growth and technological progress continue over long time scales and haven't converged, so far, to a stationary distribution. Ultimately, these quantities are also bounded for physical / information-theoretic / complexity-theoretic reasons, but since humanity is pretty far from saturating them, this leaves ample room for AI to have a long-term planning advantage over humanity.

Claim 2

Although it is true that, for sufficiently short-term planning horizons, AIs have less incentives to produce unintended consequences, problems remain.

One problem is that some tasks are very difficult to specify. For example, suppose that a group of humans armed with short-term AIs is engaged in cyberwarfare against a long-term AI. Then, even if every important step in the conflict can be modeled as short-term optimization, specifying the correct short-term goal can be a non-trivial task (how do you define "to hack" or "to prevent from hacking"?) that humans can't easily point their short-term AI towards.

Moreover, AIs trained on short-term objectives can still display long-term optimization out-of-distribution. This is because a long-term optimizer that is smart enough to distinguish between training and deployment can behave according to expectations during training while violating them as much as it wants when it's either outside of training or the correcting outer loop is too slow to matter.

Claim 3

This claim flies so much in the face of common sense (is there no such thing as business acumen? charisma? military genius?) that it needs a lot more supporting evidence IMO. The mere fact that IQs of e.g. CEOs are only moderately above average and not far above average only means that IQ stops to be a useful metric at that range, since beyond some point, different people have cognitive advantages in different domains. I think that, as scientists, we need to be careful of cavalierly dismissing the sort of skills we don't have.

As to the skepticism of the authors about social manipulation, I think that anyone who studied history or politics can attest that social manipulation has been used, and continues to be used, with enormous effects. (Btw, I think it's probably not that hard to separate a dog from a bone or child from a toy if you're willing to e.g. be completely ruthless with intimidation.)

Claim 4

While it might be true that there is a sense in which human brains are "qualitatively optimal", this still leaves a lot of room for quantitative advantage, similar to how among two universal computers, one can be vastly more efficient than the other for practical purposes. As a more relevant analogy, we can think of two learning algorithms that learn the same class of hypotheses while still having a significant difference in computational and/or sample efficiency. In the limit of infinite resources and data, both algorithms converge to the same results, but in practice one still has a big advantage over the other. While undoubtedly there are hard limits to virtually every performance metric, there is no reason to believe evolution brought human brains anywhere near those limits. Furthermore, even if "scaling with resources" is the only thing that matters, the ability of AI to scale might be vastly better than the ability of humans to scale because of communication bandwidth bottlenecks between humans, not to mention the limited trust humans have towards one another (as opposed to large distributed AI systems, or disparate AI systems that can formally verify each other's trustworthiness).

[-]boazbarak3y80

Hi Vanessa,

Let me try to respond (note the claim numbers below are not the same as in the essay, but rather as in Vanessa's comment):

Claim 1: Our claim is that one can separate out components - there is the predictable component which is non stationary, but is best approximated with a relatively simple baseline, and the chaotic component, which over the long run is just noise.In general, highly complex rules are more sensitive to noise (in fact, there are theorems along these lines in the field of Analysis of Boolean Functions), and so in the long run, the simpler component will dominate the accuracy.

Claim 2: Hacking is actually a fairly well-specified endeavor. People catalog, score, and classify security vulnerabilities. To hack would be to come up with a security vulnerability, and exploit code, which can be verified. Also, you seem to be envisioning a long-term AI that is then fine-tuned on a short-term task, but how did it evolve these long-term goals in the first place?

Claim 3: I would not say that there is no such thing as talent in being a CEO or presidents. I do however believe that the best leaders have been some combination of their particular characteristics and talents, and the situation they were in. Steve Jobs has led Apple to become the largest company in the world, but it is not clear that he is a "universal CEO" that would have done as good in any company (indeed he failed with NeXT). Similarly, Abraham Lincoln is typically ranked as the best U.S. president by historians, but again I think most would agree that he fit well the challenge that he had to face, rather than being someone that would have just as well handled the cold war or the 1970s energy crisis. Also, as Yafah points elsewhere here, for people to actually trust an AI with being the leader of a company or a country, it would need to not just be as good as humans or a little better, but better by a huge margin. In fact, most people's initial suspicion is that AIs (or even humans that don't look like them) is not "aligned" with their interests, and if you don't convince them otherwise, their default would be to keep them from positions of power.

Claim 4: The main point is that we need to measure the powers of a system as a whole, not compare the powers of an individual human with an individual AI. Clearly, if you took a human, made their memory capacity 10 times bigger, and made their speed 10 times faster, then they could do more things. But we are comparing with the case that humans will be assisted with short-term AIs that would help them in all of the tasks that are memory and speed intensive.

[-]Vanessa Kosoy3y48

Thanks for the responses Boaz!

Our claim is that one can separate out components - there is the predictable component which is non stationary, but is best approximated with a relatively simple baseline, and the chaotic component, which over the long run is just noise.In general, highly complex rules are more sensitive to noise (in fact, there are theorems along these lines in the field of Analysis of Boolean Functions), and so in the long run, the simpler component will dominate the accuracy.

I will look into analysis of boolean functions, thank you. However, unless you want to make your claim more rigorous, it seems suspect to me.

In reality, there are processes happening simultaneously on many different timescales, from the microscopic to the cosmological. And, these processes are coupled, so that the current equilibrium of each process can be regarded as a control signal for the higher timescale processes. This means we can do long-term planning by starting from the long timescales and back-chaining to short timescales, like I began to formalize here.

So, while eventually the entire universe reaches an equilibrium state (a.k.a. heat-death), there is plenty of room for long-term planning before that.

Hacking is actually a fairly well-specified endeavor. People catalog, score, and classify security vulnerabilities. To hack would be to come up with a security vulnerability, and exploit code, which can be verified.

Yeeees, it does seem like hacking is an especially bad example. But even in this example, my position is quite defensible. Yes, theoretically you can formally specify the desired behavior of the code and verify that it always happens. But, there are two problems with that: First, for many realistic software system, the formal specification would require colossal effort. Second, the formal verification is only as good as the formal model. For example, if the attacker found a hardware exploit, while your model assumes idealized behavior for the hardware, the verification doesn't help. And, it domains outside software the situation is much worse: how do you "verify" that your biological security measures are fool-proof, for example?

Also, you seem to be envisioning a long-term AI that is then fine-tuned on a short-term task, but how did it evolve these long-term goals in the first place?

When you're selecting for success on a short-term goal you might inadvertently produce a long-term agent (which, on the training distribution, is viewing the short-term goal as instrumental for its own goals), just like how evolution was selecting for genetic fitness but ended up producing agents with many preferences unrelated to that. More speculatively, there might be systematic reasons for such agents to arise, for example if good performance in the real-world requires physicalist epistemology which comes with inherent "long-terminess".

I would not say that there is no such thing as talent in being a CEO or presidents. I do however believe that the best leaders have been some combination of their particular characteristics and talents, and the situation they were in. Steve Jobs has led Apple to become the largest company in the world, but it is not clear that he is a "universal CEO" that would have done as good in any company (indeed he failed with NeXT).

This sounds like a story you can tell about anything. "Yes, such-and-such mathematician proved a really brilliant theorem A, but their effort to make progress in B didn't amount to much." Obviously, real-world performance depends on circumstances and not only on talent. This is doubly true in a competitive setting, where other similarly talented people are working against you. Nevertheless, a sufficiently large gap in talent can produce very lopsided outcomes.

Also, as Yafah points elsewhere here, for people to actually trust an AI with being the leader of a company or a country, it would need to not just be as good as humans or a little better, but better by a huge margin. In fact, most people's initial suspicion is that AIs (or even humans that don't look like them) is not "aligned" with their interests, and if you don't convince them otherwise, their default would be to keep them from positions of power.

First, it is entirely possible the AI will be better by a huge margin, because like with most things, there's no reason to believe evolution brought us anywhere near the theoretical optimum on this. (Yes, there was selective pressure, but no amount of selective pressure allowed evolution to invent spaceships, or nuclear reactors, or even the wheel.) Second, what if the AI poses as a human? Or, what if the AI uses a human as a front while pulling the strings behind the scenes? There will be no lack of volunteers to work as such a front, if in the short-term them it brings them wealth and status. Also, ironically, the more successful AI risk skeptics are at swaying public opinion, the easier the AIs job is and the weaker their argument becomes.

The main point is that we need to measure the powers of a system as a whole, not compare the powers of an individual human with an individual AI. Clearly, if you took a human, made their memory capacity 10 times bigger, and made their speed 10 times faster, then they could do more things. But we are comparing with the case that humans will be assisted with short-term AIs that would help them in all of the tasks that are memory and speed intensive.

Alright, I can see how the "universality" argument makes sense if you believe that "human + short-term AI = scaled-up human". The part I doubt is that this equation holds for any easy-to-specify value of "short-term AI".

[-]boazbarak3y30

Hi Vanesssa,

Perhaps given my short-term preference, it's not surprising that I find it hard to track very deep comment threads, but let me just give a couple of short responses.

I don't think the argument on hacking relied on the ability to formally verify systems. Formally verified systems could potentially skew the balance of power to the defender side, but even if they don't exist, I don't think balance is completely skewed to the attacker. You could imagine that, like today, there is a "cat and mouse" game, where both attackers and defenders try to find "zero day vulnerabilities" and exploit (in one case) or fix (in the other). I believe that in the world of powerful AI, this game would continue, with both sides having access to AI tools, which would empower both but not necessarily shift the balance to one or the other.

I think the question of whether a long-term planning agent could emerge from short-term training is a very interesting technical question! Of course we need to understand how to define "long term" and "short term" here. One way to think about this is the following: we can define various short-term metrics, which are evaluable using information in the short-term, and potentially correlated with long-term success. We would say that a strategy is purely long-term if it cannot be explained by making advances on any combination of these metrics.

[-]Vanessa Kosoy3y42

I don't think the argument on hacking relied on the ability to formally verify systems. Formally verified systems could potentially skew the balance of power to the defender side, but even if they don't exist, I don't think balance is completely skewed to the attacker.

My point was not about the defender/attacker balance. My point was that even short-term goals can be difficult to specify, which undermines the notion that we can easily empower ourselves by short-term AI.

Of course we need to understand how to define "long term" and "short term" here. One way to think about this is the following: we can define various short-term metrics, which are evaluable using information in the short-term, and potentially correlated with long-term success. We would say that a strategy is purely long-term if it cannot be explained by making advances on any combination of these metrics.

Sort of. The correct way to make it more rigorous, IMO, is using tools from algorithmic information theory, like I suggested here.

[-]Jonathan Uesato3y40

Hi, thanks both for writing this - I enjoyed it.

I'd be interested in your thoughts on how we can do this:
> However, perhaps more [emphasis] should be placed on building just-as-powerful AI systems that are restricted to short time horizons.

I can share some of my thoughts first, and would be keen to hear (both/either of) yours.

It's worth saying up front that I also think this is a very productive direction, and that your post lays out a good case for why.
Here's one salient baseline strategy, and a corresponding failure mode: At each point in time t, where is some time horizon h where AIs can pursue goals over horizon h. At each point in time, (social-impact-minded) AI companies aim to build "short-term goal" systems, which operate at horizon h, but not longer.
- Note that this seems very natural, and also exactly matches what one might expect a purely profit- or prestige-driven company to do.
However, if h turns out to steadily increase over time (as in the straight-line extrapolation referenced below), then this leaves us in a difficult position.
- Furthermore, this seems to leave us at the mercy of unknown empirical questions about deep learning. It doesn't seem we have improved our chances relative to the baseline of "do the easiest thing at each point in time".
So from a "differential progress" perspective, a more helpful research strategy seems to be: fix a task with some horizon h_1, then try to solve this task using only systems optimized over horizon h_2 < h_1.
- This framing also possibly highlights two additional difficulties: (1) technically, end-to-end optimization has been quite effective for many DL tasks and (2) sociologically, the DL community has a tremendous aesthetic preference towards end-to-end approaches (which also translates into approaches and infra which favor end-to-end approaches), which makes it harder to gain widespread adoption for other approaches. Also agreed there are many offsetting factors like interpretability, control, etc. which you mention.
- This suggests empirical angles similar to the one described [here](https://ought.org/updates/2022-04-06-process) by Ought.
I'd be very interested in how you think about this research direction, and particularly interested if you think there are other/complementary research directions which improve our chances of ending up with short time horizon AIs.

2. I understand this matches up with your core point, but to check understanding and confirm agreement - It seems far from certain that so-called "short-term goal" AIs will dominate, and given these uncertainties, it

I would guess you are both somewhat more optimistic about "short-term goal AIs" than I am (in most discussions, I normally find myself arguing for competitiveness of short-term goal AIs, so it's a nice change-of-perspective for me!). But I imagine(?) we might have similar views that:
- So long as p("short-term AIs dominate") is not very close to 1, this leaves substantial risk.
- So long as p("short-term AIs dominate") is not very close to 1, then there is reasonable room for dedicated efforts to push this probability up (or down).
There are many reasons this seems uncertain, but to spell out one: As you correctly point out, the applications where DL is useful right now (to the extent there are any) are predominantly short-term ones. But it also seems that the natural trend over time would be for AIs to move to competency on increasingly broad tasks in the future, and so the current situation doesn't provide much evidence whether to expect this straight-line extrapolation vs. a Goldilocks effect you describe.

3. As a final point, regarding this:

Loss-of-Control Hypothesis 2: In several key domains, only AIs with long-term goals will be powerful.
...
Why is Hypothesis 2 necessary for the “loss of control” scenario? The reason is that this scenario requires the “misaligned long-term powerful AI” to be not merely more powerful than humanity as it exists today, but more powerful than humanity in the future. Future humans will have at their disposal the assistance of short-term AIs.

A salient possibility for me is that long-term-optimized AIs are only mildly more powerful (say, equivalent to a 50% compute increase), but that this is enough for almost everyone to use long-term AIs. If more of the world (ML community, leaders, regulators, general public) agrees that long-horizon-optimized AIs are more dangerous than short-horizon-optimized AIs, then short-horizon-optimized AIs can become the norm in spite of this. But it seems unclear to what extent this will happen, and so this similarly seems well worth pushing on, for a social impact-minded person/team.

I would love to discuss more in the future, but in the interest of time (and because this is already quite long), I'm starting with what I expect are the most fruitful lines.

[-]boazbarak3y10

Thank you! I think that what we see right now is that as the horizon grows, the more "tricks" we need to make end-to-end learning works, to the extent that it might not really be end to end. So while supervised learning is very successful, and seems to be quite robust to choice of architecture, loss functions, etc., in RL we need to be much more careful, and often things won't work "out of the box" in a purely end to end fashion.

I think the question would be how performance scales with horizon, if the returns are rapidly diminishing, and the cost to train is rapidly increasing (as might well be the case because of diminishing gradient signals, and much smaller availability of data), then it could be that the "sweet spot" of what is economical to train would remain at a reasonably short horizon (far shorter than the planning needed to take over the world) for a long time.

[-]LawrenceC3y41

Claim 3: There may be fundamental “scaling laws” governing the amount of performance AI systems can achieve as a function of the data and computational resources.

I'm personally pretty sympathetic to the idea that there are indeed metrics through which model progress is continuous (both as a function of scale and over the course of training).

That being said: smooth performance along one metric doesn't necessarily imply smooth downstream performance! (E.g. from your "SGD learns parity close to the computational limit" paper, even though there exist smooth progress measures on how small neural networks learn parity, this does not explain away the sharp increase in accuracy. See also the results from the modular addition task.)

In particular, it's empirically true that smooth progress of log loss does not necessarily imply smooth progress on downstream performance. For example, in both the BIG-Bench and Wei et al's "Emergent Abilities of Large Language Models" papers, we see that smooth performance on cross entropy loss or does not imply continuous smooth progress in terms of error rate. And though GPT-3 follows the same log loss scaling curve as GPT-2, I'm not sure anyone would have predicted the suite of new abilities that would arise alongside the decrease in log loss.

(It also doesn't rule out the existence of better scaffolding or prompting techniques like Chain-of-Thought, which can both significantly improve downstream performance and even change the shape of scaling curves, without additional training).

[-]boazbarak3y20

It is indeed the case that sometimes we see phase transitions / discontinuous improvements, and this is an area which I am very interested in. Note however that (while not in our paper) typically in graphs such as BIG-Bench, the X axis is something like log number of parameters. So it does seem you pay quite a price to achieve improvement.

The claim there is not so much about the shape of the laws but rather about potential (though as you say, not certain at all) limitations as to what improvements you can achieve through pure software alone, without investing more compute and/or data. Some other (very rough) calculations of costs are attempted in my previous blog post.

[-]LawrenceC3y21

Yeah, I agree that a lot of the “phase transitions” look more discontinuous than they actually are due to the log on the x axis — the OG grokking paper definitely commits this sin, for example.

(I think there’s also another disagreement here about how close humans are to this natural limit.)

[-]David Scott Krueger (formerly: capybaralet)3y20

This is a great post. Thanks for writing it! I think Figure 1 is quite compelling and thought provoking.
I began writing a response, and then realized a lot of what I wanted to say has already been said by others, so I just noted where that was the case. I'll focus on points of disagreement.

Summary: I think the basic argument of the post is well summarized in Figure 1, and by Vanessa Kosoy’s comment.

A high-level counter-argument I didn't see others making:

I wasn't entirely sure what was your argument that long-term planning ability saturates... I've seen this argued both based on complexity and chaos, and I think here it's a bit of a mix of both.
- Counter-argument to chaos-argument: It seems we can make meaningful predictions of many relevant things far into the future (e.g. that the sun's remaining natural life-span is 7-8 billion years).
- Counter-argument to complexity-argument: Increases in predictive ability can have highly non-linear returns, both in terms of planning depth and planning accuracy.
  - Depth: You often only need to be "one step ahead" of your adversary in order to defeat them and win the whole "prize" (e.g. of market or geopolitical dominance), e.g. if I can predict the weather one day further ahead, this could have a major impact in military strategy.
  - Accuracy: If you can make more accurate predictions about, e.g. how prices of assets will change, you can make a killing in finance.

High-level counter-arguments I would've made that Vanessa already made:

This argument proves too much: it suggests that there are not major differences in ability to do long-term planning that matter.
Humans have not reached the limits of predictive ability

Low-level counter-arguments:

RE Claim 1: Why would AI only have an advantage in IQ as opposed to other forms of intelligence / cognitive skill? No argument is provided.
(Argued by Jonathan Uesato) RE Claim 3: Scaling laws provide ~zero evidence that we are at the limit of “what can be achieved with a certain level of resources”.

[-]Donald Hobson3y10

However, unlike in Chess games, humans can and will use all the tools at their disposal, including many tools (e.g., code-completion engines, optimizers for protein folding, etc..) that are currently classified as “Artificial Intelligence”.

Lets suppose that both the human and long term AI have a copy of chatGPT. However, as many of us has found, chatGPT is somewhat fickle, it doesn't reliably do what we actually want it to do. We are having short term, non-catastrophic alignment problems. But they do make the tool significantly less useful.

Does the long term AI suffer from the same problems? Quite possibly not if the chatGPT like capabilities are sufficiently integrated into the model.

[-]Donald Hobson3y10

A third task listed is “social manipulation.” Here we must admit we are skeptical. Anyone who has ever tried to convince a dog to part with a bone or a child with a toy could attest to the diminishing returns that an intelligence advantage has in such a situation.

Try convincing a rock to do something by arguing with it. The rock remains supremely unconvinced. You are much smarter than a rock.

In order to be convinced to do something, there needs to be sufficient complex structure to be capable of being convinced. This is the same reason that sophisticated malware can't run on simple analogue circuits.

Dogs aren't capable of being motivated by sophisticated philosophical arguments.

Of course, humans can get dogs to do all sorts of things through years of training.

Added to that, a human trying to part a bone from a dog isn't exactly applying the full intellectual power humanity can bring to bear. It isn't like a company doing statistics to optimize add click through.

Also, many of the fastest ways to get a small child to give up a toy might count as child abuse, and are therefore not options that naturally spring to mind. (Ie spinning a tail of terrifying monsters in the toy, that will get the child to drop the toy, run screaming and have nightmares for weeks)

[-]Donald Hobson3y10

The “loss of control” scenario posits a second phase transition, whereby once AI systems become more powerful, they would not merely enable humans to achieve more objectives quicker but would themselves become as qualitatively superior to humans as humans are to other animals.

I think you are imagining the first blue line, and asking the dotted blue line to justify its increased complexity penalty. Meanwhile, other people are imagining the orange line.

[-]Donald Hobson3y10

Indeed, our societies and firms do not currently select our leaders to be the top individuals in intellectual capacity. The evidence is very limited that “natural talent for leadership” (to the extent it exists) is as measurable and transferable as talent for chess, math, or athletics.

It is over longer timescales, and is harder to give uniform starting conditions, so of course it's harder to measure.

This leads to the inadequate equilibrium. But lying your backside off to voters and shareholders is also an intellectual skill of sorts. It's a mistake to see that Elon Musk doesn't know that much engineering, and conclude intelligence isn't important in business. Riding memes, building hype and getting away with financial misconduct are activities that take place in the brain.

[-]Donald Hobson3y10

Ever since the development of language, the intelligence of an individual human has not been a bottleneck for the achievements of humanity.

A surprising claim.

The brilliance of individuals like Newton may have been crucial for speeding up the Scientific Revolution, but there have been brilliant individuals for millennia. The crucial difference between Newton and Archimedes is not that Newton was smarter, but rather that he lived at a later time and thus was able to stand on the shoulders of more giants. As another example, a collection of humans, aided by Internet-connected computers, can do much better at pretty much any intelligence feat (including but not limited to IQ exams) than any single human.

There is a function with 2 inputs, intelligence and other resources. You are arguing about the shape of this function, given only a line of inputs.

This argument shows that, given fixed human intelligence, more other resources such as books, computers etc produce reasonably high returns in progress. Given the same intelligence, researchers with better tools are more productive.

This doesn't let you claim that human intelligence wasn't a bottleneck. As far as I can tell, you have presented no baysian evidence about what a smarter than max human mind might or might not be able to accomplish.

[-]David Scott Krueger (formerly: capybaralet)3y10

This is a great post. Thanks for writing it!

I agree with a lot of the counter-arguments others have mentioned.

Summary:

I think the basic argument of the post is well summarized in Figure 1, and by Vanessa Kosoy’s comment.
High-level counter-arguments already argued by Vanessa:
- This argument proves too much: it suggests that there are not major differences in ability to do long-term planning that matter.
- Humans have not reached the limits of predictive ability

You often only need to be one step ahead of your adversary to defeat them.
Prediction accuracy is not the relevant metric: an incremental increase in depth-of-planning could be decisive in conflicts (e.g. if I can predict the weather one day further ahead, this could have a major impact in military strategy).
- More generally, the ability to make large / highly leveraged bets on future outcomes means that slight advantages in prediction ability could be decisive.

Low-level counter-arguments:
(RE Claim 1: Why would AI only have an advantage in IQ as opposed to other forms of intelligence / cognitive skill? No argument is provided.
(Argued by Jonathan Uesato) RE Claim 3: Scaling laws provide ~zero evidence that we are at the limit of “what can be achieved with a certain level of resources”.
RE Claim 5: Systems trained with short-term objectives can learn to do long-term planning competently.

[-]Jeffrey Heninger3y00

I find the idea that intelligence is less useful for sufficiently complex systems or sufficiently long time frames interesting. Or at least the kind of intelligence that helps you make predictions. My intuition is that there is something there, although it's not quite the thing you're describing.

I agree that the optimal predictability of the future decays as you try to predict farther into the future. If the thing you're trying to predict in the technical sense, you can make this into a precise statement.

I disagree that the skill needed to match this optimum typically has a peak. Even for extremely chaotic systems, it is typically possible to find some structure to it that is not immediately obvious. Heuristics are sometimes more useful than precise calculations, but building good heuristics and know how to use them is itself a skill that improves with intelligence. I suspect that the skill needed to reach optimum usually monotonically increases with longer prediction times or more complexity.

Instead, the peak appears in the marginal benefit of additional intelligence. Consider the difference in prediction ability between two different intelligences. At small time / low complexity, there is little difference because both of them are very good at making predictions. A large times / complexity, the difference is again small because, even though neither is at optimum, the small size of the optimum limits how far apart they can be. The biggest difference can be seen at the intermediate scales, while there are still good predictions to be made, but they are hard to make.

A picture of how I think this works, similar to Figure 1, is linked here: https://drive.google.com/file/d/1-1xfsBWxX7VDs0ErEAc716TdypRUdgt-/view?usp=sharing

As long as there are some other skills relevant for most jobs that intelligence trades off against, we would expect the strongest incentives for intelligence to occur in the jobs where the marginal benefit of additional intelligence is the largest.

[-]quetzal_rainbow3y01

unpacking inner Eliezer model

If we live in world where superintelligent AGI can't have advantage in long-term planning over humans assisted by non-superintelligent narrow AIs (I frankly don't believe that we live in such world), then superintelligent AGI doesn't make complex long-term plans where it doesn't have advantage. It will make simple short-term plans where it has advantage, like "use superior engineering skills to hack into computer networks, infect as many computers as possible with its adapted for hidden distributed computations source code (here is a point of no return), design nanotech, train itself to an above average level in social engineering, find gullible and skilled enough people to build nanotech, create enough smart matter to sustain AGI without human infrastructure, kill everybody, pursue its unspeakable goals in the dead world".

Even if we imagine "AI CEO" the best (human aligned!) strategy I can imagine for such AI is "invent immortality, buy the whole world for it", not "scrutinize KPIs".

Next, I think your ideas about short/long-term goals are underspecified because you don't take into account the distinction between instrumental/terminal goals. Yes, human software engineers pursue short-term instrumental goal of "creating product", but they do it in process of pursuing long-term terminal goals like "be happy", "prove themselves worthy", "serve humanity", "have nice things", etc. It's quite hard to find system with short-term terminal goals, not short-term planning horizon due to computational limits. To put in another words, taskiness is an unsolved problem in AI alignment. We don't know how to tell superintelligent AGI "do this, don't do everything else, especially please don't disassemble everyone in process of doing this, stop after you've done this".

If you believe that "extract short-term modules from powerful long-term agent" is the optimal strategy in some sense (I don't even think that we can properly identify such modules without huge alignment work), then powerful long-term agent knows this too, and it knows that it's on time limit before you dissect it, and will plan accordingly.

Claims 3 and 4 imply claim "nobody will invent some clever trick to avoid this problems", which seems to me implausible.

Problems with claims 5 and 6 are covered in Nate Soares post about sharp left turn.

[-]cfoster03y20

It's quite hard to find system with short-term terminal goals, not short-term planning horizon due to computational limits. To put in another words, taskiness is an unsolved problem in AI alignment. We don't know how to tell superintelligent AGI "do this, don't do everything else, especially please don't disassemble everyone in process of doing this, stop after you've done this".

I dunno. The current state of traditional and neural AI look very much like "we only know how to build tasky systems", not like "we don't know how to build tasky systems". They mostly do a single well-scoped thing, the same thing that they were trained on, are restricted to a specified amount of processing time, and do not persist state across invocations, wiping their activations after the task is completed. Maybe we're so completely befuddled about goal-directedness etc. that these apparently very tasky systems have secret long-term terminal goals, but that seems like a stretch. If we later reach a point where we can't induce taskiness in our AI systems (because they're too competent or something), that will be a significant break from the existing trend.

[-]quetzal_rainbow3y10

I want to say "yes, but this is different", but not in the sense "I acknowledge existence of your evidence, but ignore it". My intuition tells me that we don't "induce" taskiness in the modern systems, it just happens because we build them not general enough. It probably won't hold when we start buliding models of capable agents in natural environment.

[-]cfoster03y32

Certainly possible. Though we seem to be continually marching down the list of tasks we once thought "can only be done with systems that are really general/agentic/intelligent" (think: spatial planning, playing games, proving theorems, understanding language, competitive programming...) and finding that, nope, actually we can engineer systems that have the distilled essence of that capability.

That makes a deflationary account of cognition, where we never see the promised reduction into "one big insight", but rather chunks of the AI field continue to break off & become unsexy but useful techniques (as happened with planning algorithms, compilers, functional programming, knowledge graphs etc., no longer even considered "real AI"), increasingly likely in my eyes. Maybe economic forces push against this, but I'm kinda doubtful, seeing how hard building agenty AI is proving and how useful these decomposed tasky AIs are looking.

[-]Donald Hobson3y21

Decomposed tasky AI's are pretty useful. Given we don't yet know how to build powerful agents, they are better than nothing. This is entirely consistent with a world where, once agenty AI is developed, it beats the pants of tasky AI.

^{^}

During the 90s-2000s, human-engine teams were able to consistently beat engines in “advanced chess” tournaments, but no major advanced chess tournament seems to have taken place since the release of AlphaZero and the resulting jump in engine strength, presumably because the human half of each team would be superfluous.

^{^}

The success of a bridge does hinge on its long-term stability, but stability can be tested before the bridge is built, and coming up with measures for load-bearing and other desiderata is standard practice in the engineering profession. An AI trained using such a short-term evaluation suite as its reward function may still “overoptimize” against the metric, a la Goodhart’s Law, but this can likely be addressed with regularization techniques.

^{^}

It may be the case that, for subtle reasons, if we try to train an AI with only short-term goals—e.g. by training in a series of short episodes—we could accidentally end up with an AI that has long-term goals. See Claim 6 below. But avoiding this pitfall seems like an easier problem than “aligning” the goals of an AI that is explicitly meant to care about the long-term.

^{^}

We don’t mean that they satisfy all the formal requirements to be defined as a chaotic system; though sensitivity to initial conditions is crucial.

^{^}

For a nice illustration, see Sam Trajtenberg’s construction of Minecraft in Minecraft, or this construction of Life in Life.

^{^}

Steve Jobs at Apple vs NeXT is one such example; success and failure can themselves be difficult to distinguish even with the benefit of hindsight, as in the case of Jack Welch.

^{^}

For example, such planning might require setting up many companies to earn large amounts of funds, conducting successful political campaigns in several countries, constructing laboratories without being detected, etc. Some such “take-over scenarios” are listed by Bostrom, as well as Yudkowski and Urban.

^{^}

It is hypothetically possible that companies would be better off en masse if they hired smarter CEOs than they currently do, but given the high compensation CEOs receive this doesn’t seem like a particularly plausible equilibrium.

Profession	Cognitive Score (standard deviations)	Annual Earnings
Mayors	6.2 ( ≈ +0.6σ )	679K SEK
Parliamentarians	6.4 ( ≈ +0.7σ )	802K SEK
CEOs (10-24 employees)	5.8 ( ≈ +0.4σ )	675K SEK
CEOs (25-249 employees)	6.2 ( ≈ +0.6σ )	1,046K SEK
CEOs (≥ 250 employees)	6.7 ( ≈ +0.85σ )	1,926K SEK
Medical Doctors	7.4 ( ≈ +1.2σ )	640K SEK
Lawyers and Judges	6.8 ( ≈ +0.9σ )	568K SEK
Economists	7 ( ≈ +1σ )	530K SEK
Political Scientists	6.8 ( ≈ +0.9σ )	513 SEK

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

56

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

56

Claim 1

Claim 2

Claim 3

Claim 4

Claim 3: There may be fundamental “scaling laws” governing the amount of performance AI systems can achieve as a function of the data and computational resources.

Our argument: an executive summary.

A digression: what is intelligence

A thought experiment: “The AI CEO vs. the AI advisor”

Technical Analysis

1. Key hypotheses behind the “Loss of Control” Scenario

2. Understanding the validity of the hypotheses

Claim 1: There are diminishing returns to information-processing skills with longer horizons.

Claim 2: It may be possible to extract powerful short-term modules from long-term systems.

2.1 The impact of the deep learning paradigm on Hypothesis 2

Claim 3: There may be fundamental “scaling laws” governing the amount of performance AI systems can achieve as a function of the data and computational resources.

Claim 4: When training with reinforcement learning, the gradient signal may decrease exponentially with the length of the horizon.

Claim 5: There will be powerful AI systems that are trained with short-term objective functions.

Claim 6: For a long-term goal to necessarily emerge from a system trained with a short-term objective, it must be correlated or causally related to that objective.

Summary