The Shortest Path Between Scylla and Charybdis

Thane Ruthenis

tl;dr: There's two diametrically opposed failure modes an alignment researcher can fall into: engaging in excessively concrete research whose findings won't timely generalize to AGI, and engaging in excessively abstract research whose findings won't timely connect to the practical reality.

Different people's assessments of what research is too abstract/concrete differ significantly based on their personal AI-Risk models. One person's too-abstract can be another's too-concrete.

The meta-level problem of alignment research is to pick a research direction that, on your subjective model of AI Risk, strikes a good balance between the two – and thereby arrives at the solution to alignment in as few steps as possible.

Introduction

Suppose that you're interested in solving AGI Alignment. There's a dizzying plethora of approaches to choose from:

What behavioral properties do the current-best AIs exhibit?
Can we already augment our research efforts with the AIs that exist today?
How far can "straightforward" alignment techniques like RLHF get us?
Can an AGI be born out of an AutoGPT-like setup? Would our ability to see its externalized monologue suffice for nullifying its dangers?
Can we make AIs-aligning-AIs work?
What are the mechanisms by which the current-best AIs function? How can we precisely intervene on their cognition in order to steer them?
What are the remaining challenges of scalable interpretability, and how can they be defeated?
What features do agenty systems convergently learn when subjected to selection pressures?
Is there such a thing as "natural abstractions"? How do we learn them?
What is the type signature of embedded agents and their values? What about the formal description of corrigibility?
What is the "correct" decision theory that an AGI would follow? And what's up with anthropic reasoning?
Et cetera, et cetera.

So... How the hell do you pick what to work on?

The starting point, of course, would be building up your own model of the problem. What's the nature of the threat? What's known about how ML models work? What's known about agents, and cognition? How does any of that relate to the threat? What are all extant approaches? What's each approach's theory-of-impact? What model of AI Risk does it assume? Does it agree with your model? Is it convincing? Is it tractable?

Once you've done that, you'll likely have eliminated a few approaches as obvious nonsense. But even afterwards, there might still be multiple avenues left that all seem convincing. How do you pick between those?

Personal fit might be one criterion. Choose the approach that best suits your skills and inclinations and opportunities. But that's risky: if you make a mistake, and end up working on something irrelevant just because it suits you better, you'll have multiplied your real-world impact by zero. Conversely, contributing to a tractable approach would be net-positive, even if you'd be working at a disadvantage. And who knows, maybe you'll find that re-specializing is surprisingly easy!

So what further objective criteria can you evaluate?

Regardless of one's model of AI Risk, there's two specific, diametrically opposed failure modes that any alignment researcher can fall into: being too concrete, and being too abstract.

The approach to choose should be one that maximizes the distance from both failure modes.

The Scylla: Atheoretic Empiricism

One pitfall would be engaging in research that doesn't generalize to aligning AGI.

An ad-absurdum example: You pick some specific LLM model, then start exhaustively investigating how it responds to different prompts, and what quirks it has. You're building giant look-up tables of "query, response", with no overarching structure and no attempt to theorize about the model's internals.

A more realistic example: You've decided to build a detailed understanding of a specific LLM's functionality – i. e., you're exhaustively focusing on that one LLM. You're building itemized lists of its neurons, investigating what inputs seem to activate each the strongest, what functions they implement; you're looking for quirks in its psychology, and trying to build a full understanding of it.

Now, certainly, you're uncovering some findings that'd generalize to all LLMs. But there'd be some point at which more time spent investigating this specific model wouldn't yield much data about other LLMs; only data about this one. Thus, inasmuch as you'd be spending time on that, you'd be wasting the time you could be spending actually working on alignment.

A fairly controversial take: Studying LLMs-in-general might, likewise, fall prey to that. Studying them reveals some information about AIs-in-general, and cognitive-systems-in-general. But if LLMs aren't already AGIs, there would be a point at which more time spent studying LLM cognition, instead of searching for a new research topic, would only yield you information about LLMs; not about AGIs.

A fairly implausible possibility: Likewise, it's not entirely certain that Deep Learning is AGI-complete. If we live in such a world, then studying DL is worthwhile inasmuch as it yields information about cognitive-systems-shaped-by-selection-pressures-in-general. But at some point, you'll have learned everything DL can teach you about whatever paradigm would be AGI-complete. So the additional time spent researching DL would only yield information about an irrelevant AGI-incomplete paradigm.

The Charybdis: Out-of-Touch Theorizing

The diametrically opposite pitfall is engaging in overly theoretical research that will never connect to reality.

Ad absurdum: You might decide to start with the fundamental philosophical problems. Why does anything exist? What is the metaphysical nature of reality? Is reductionism really true? What's up with qualia? That line of research will surely eventually meander down to reality! It aims to answer all questions it is possible to answer, after all, and "how can I align an AGI?" is a question. Hence, you'll eventually solve alignment.

More realistically: You might decide to work on formalizing the theory of algorithms-in-general. How can those be embedded into other algorithms? How can they interact, and interfere on each other?

Since AGI agents could be viewed as algorithms, once you have a gears-level model of this topic – once you properly understand what an "embedded algorithm" is, say – you'll be able to tell what the hell an "AGI" is, as well. You'll be able to specify it in your framework, define proper constraints on how an algorithm implementing an "aligned" "AGI" would look like, then just incrementally narrow down the space of algorithms. Eventually, you'll arrive at one that corresponds to an aligned AGI – and then it's just a matter of typing up the code.

Controversial example: Agency-foundations research might be this. Sure, the AGI we'll get on our hands might end up approximately isomorphic to an idealized game-theoretic agent. But that "approximately" might be doing a lot of heavy lifting. It might be that idealized-agent properties correspond to real-AGI properties so tenuously as to yield no useful information, such that you would have been better off studying LLMs.

Implausible example: Actually, GPT-5, stored deep inside OpenAI's data centers, already reached AGI. It'll take off before this year is up. Everyone should focus on trying to align this specific model; aiming for the general understanding of agents or AI cognition or LLMs is excessive and wasteful.

The Shortest Path

As you can see, the failures lie on a spectrum, and they're model-dependent to boot.

That is: Depending on how you think AIs/cognition/AI risks work, the same approach could be either hopelessly non-generalizable, or concerned with generalities too vacuous to ever matter.

As an example, consider my own favoured agenda, building a theory of embedded world-models. If you think LLMs have already basically reached AGI, and just need to be scaled-up in order to take off, I'm being out-of-touch: whatever results I'm arriving at will not connect to reality in time for the takeoff. Conversely, if you're skeptical that "train up a world-model and align it by retargeting the search" would suffice to yield us robust alignment, if you think we'll need much more manual control over the design for alignment to hold, then I'm basically playing with toys.

I, however, obviously think that I'm striking just the right balance. An approach that is as concrete as possible while still being AGI Alignment-complete.

That's the target you should be aiming to hit, as well. A lowest-scope project that's nevertheless sufficient.

Let's take a step back. In theory, given unlimited time, basically-all approaches would actually converge to an AGI Alignment solution:

If you're starting bottom-up, from the most concrete problems, like studying a specific LLM... Well, eventually you'll have itemized all of its properties and grown bored, so you'll move on to a different LLM. Upon doing so, you'll discover that a lot of your previous findings generalize. The second LLM will be utterly comprehended by you much quicker. Repeat a few times, and you'll have build up a solid understanding of the whole scope of what the LLM architecture permits. So you'll do the obvious next thing, and move on to studying some different architecture. That'll be easier, with your mastery of LLMs. Once you've iterated on this pattern some more, and went through a few different architectures, and generalized from them – why, you'll likely end up understanding an AGI-complete architecture somewhere along the way as well.
If you're starting top-down, from the most abstract problems: Well, as I'd outlined in the ad-absurdum example there, you'll eventually reconnect to reality even if starting from the fundamental philosophy. Existential questions to phenomena-in-general to cognition-in-general to AGI Alignment, say.

The issue? Choosing the wrong starting point would lengthen your journey immensely. And the timer's ticking.

Our goal isn't just to solve AGI Alignment, it's to solve it as quickly as possible.

So be sure to deeply consider all options available, and make your choice wisely. And once you've made it, stay ready to pivot at a moment's notice if you spy an even shorter pathway.

[-]Vladimir Nesov1y59

From the position of uncertainty, there is no optimal direction, only a model of good distribution of global efforts among all directions. A marginal choice of a spherical researcher without specific preferences should be based on identifying relatively neglected directions. A choice of an actual researcher with specific preferences should give weight to those preferences, which might greatly improve productivity.

[-]Thane Ruthenis1y10

A marginal choice of a spherical researcher without specific preferences should be based on identifying relatively neglected directions

Inside-view convincingness of these directions still has to be weighted in. E. g., "study the Bible for alignment insights" is a relatively neglected direction (just Unsong on it, really?), but that doesn't mean it'd be sensible to focus on it just because it's neglected. And even if your marginal contributions to the correct approach would be minimal because so many other people are working on it, that may still be more expected impact than setting off on a neglected (and very likely incorrect) one.

A choice of an actual researcher with specific preferences should give weight to those preferences

Oh, I'm not saying entirely ignore your preferences/comparative advantages. But if you're looking at a bunch of plausible directions, you can pick between them not solely based on your comparative advantages.

[-]Vladimir Nesov1y10

Inside-view convincingness of these directions still has to be weighted in.

I mean directions neglected relative to estimated good distribution of global effort. If I estimate good distribution of effort towards searching The Silmarillion for insights relevant to mechanistic interpretability to be zero, then it's not a relatively neglected direction.

if you're looking at a bunch of plausible directions, you can pick between them not solely based on your comparative advantages

Sure, by "give weight" I mean take into account, not take as the sole basis for a decision. The other major factor is that relative neglectedness I mentioned (in the sense I hopefully now clarified).

[-]Tsvi Benson-Tilsen1y40

And we can go further and say that the failures lie in a high-dimensional space, and that the apparent tradeoff is more a matter of finding the directions in which to pull the rope sideways. Propagating constraints between concepts and propositions is a way to go that seems hopeworthy to me. One wants to notice commonalities in how each of one's plans are doomed, and then address the common blockers / missing ideas. In other words, recurse to the "abstract" as much as is called for, even if you get really abstract; but treat [abstracting more than what you can directly see/feel as being demanded by your thinking] as a risky investment with opportunity cost.

AI ALIGNMENT FORUM
AF

20

The Shortest Path Between Scylla and Charybdis

20

Introduction

The Scylla: Atheoretic Empiricism

The Charybdis: Out-of-Touch Theorizing

The Shortest Path