Epistemic Strategies of Safety-Capabilities Tradeoffs

adamShimi

Introduction: Epistemic Strategies Redux

This post examines the epistemic strategies of Steve Byrnes’ Safety-capabilities tradeoff dials are inevitable in AGI.

(If you want to skim this post, just read the Summary subsection that display the epistemic strategy as a design pattern)

I introduced the concept in a recent post, but didn’t define them except as the “ways of producing” knowledge that are used in a piece of research. If we consider a post or paper as a computer program outputting (producing) knowledge about alignment, epistemic strategies are the underlying algorithm or, even more abstractly, the design patterns.

An example of epistemic strategy, common in natural sciences (and beyond), is

Look at the data
Find a good explanation
Predict new things with that explanation
Get new data for checking your prediction

More than just laying out some abstract recipe, analysis serves to understand how each step is done, whether that makes sense, and how each step (and the whole strategy) might fail. Just like a design pattern or an algorithm, it matters tremendously to know when to apply it and when to avoid it as well as subtleties to be aware of.

Laying this underlying structure bare matters in three ways:

It clarifies the research’s purpose and value for newcomers and researchers from other fields, with minimal assumptions of shared approaches.
- Just like a programmer switching to a new domain of problems will get up to speed faster and more reliably if they get access to the patterns/algorithms/tricks used in their new domain.
It focuses feedback and criticism on the most important parts of the idea/proposition/argument.
- Issues with an algorithm more often focus on the point of it instead of the details,whereas issues with an implementation of that algorithm can be as much about typos, optimization tricks and bad structure than about the actual core (the algorithm.
It builds a library of such strategies for alignment in particular, a cookbook newcomers and senior researchers alike can browse for inspiration or a take on some new post/paper they don’t grok.
- Like the glorious Game Programming Patterns who does exactly that for game programming

Thanks to Steve Byrnes for feedback on a draft of this post.

Defining Safety-Capabilities Tradeoffs

What sort of knowledge is Steve attempting to create in his post? He set up explicitly to show that any alignment proposal must deal with one or more tradeoffs between safety and capabilities (which he calls safety-capabilities tradeoff dials).

I will argue that the discussion should be framed as “Just how problematic is this dial? How do we minimize its negative impact?”, not “This particular approach has a dial, so it’s automatically doomed. Let’s throw it out and talk about something else instead.”

This is in opposition to claims that some alignment proposals should be deemed less promising or insufficient because they would include such tradeoffs.

(Recent examples of the latter attitude, at least arguably: here, here.)

A good way of framing the difference between safety and capabilities is that safety is about worst-case reasoning (improving the bad things that might happen) whereas capabilities is about best-case or average-case reasoning (improving the plans the AI might come up with). Nothing forbids a solution with great worst-case, average-case and best-case guarantees; yet it’s not incoherent to imagine a tradeoff between not failing too badly and succeeding as impressively as possible.

Then the problem is that if such tradeoffs exist, people will differ in their incentives and probabilities and preferences, in such a way that not everyone will agree on where to stand in the tradeoff. Given that safety is restrictive, we should expect people favoring capabilities over safety to get more impressive and sellable systems until existential risks kick in. Which is bad.

Showing the Inevitability of Safety-Capabilities Tradeoffs

Steve claims that any alignment proposal must include some safety-capabilities tradeoffs. What I’m interested in here is how he argues for his point, and whether his epistemic strategy makes sense.

Unfortunately, his section on exactly that is confusing. The section is called “Why do I say that these dials are inevitable?” (what we want, right?) and starts with this sentence:

Here are a few examples.

A list of examples sounds like a particularly bad way of showing that something is impossible to avoid. Hand-picking of examples comes to mind as a big risk, and more generally non-representative examples .

Yet Steve actually makes a decent argument for the inevitability of safety-capabilities tradeoffs, just far too implicitly. His examples are not examples of alignment proposals and their corresponding tradeoffs, but of places where tradeoffs might appear in any alignment proposal.

(Testing before deployment) More testing improves the safety guarantees and reduces our uncertainty, but costs time and money.
(Human feedback and/or supervision) Humans being able to understand and correct the model helps with safety, but makes the model slower, less competitive, and constrained to only proposed plans it can justify to humans — all of which make it less competitive and capable
(Access to resources) Constrained access to resources (internet, money, compute…) makes the model safer, but makes it less capable.
(Human norms and laws) Following human norms, laws and customs helps with safety but adds additional constraints on the capabilities.

That at least some of these tradeoffs must emerge in every alignment proposal is the (very) implicit last step of his epistemic strategy. And it’s unfortunately not so much argued for than stated. For example on testing:

Some amount of sandbox testing would help capabilities, by helping the team better understand how things are going. But there’s an optimal amount of sandbox testing for capabilities, and doing further testing beyond that point is a safety-capabilities tradeoff.

How can we actually argue for this instead of simply saying it? Here I go one step further than the original post (while staying coherent with Steve’s points) by proposing that we adapt how impossibility results are proved in Theoretical Computer Science. Impossibility proofs tend to focus on the potential counterexamples, and get to the gist of why they don’t actually work. This involves the sort of back and forth between trying to create a counterexample and showing why it doesn’t work described by the great Nancy Lynch in her A Hundred Impossibility Proofs for Distributed Computing (Yes, there are a hundred results, although many come for free by the same methods)

How does one go about working on an impossibility proof? The first thing to do is to try to avoid solving the problem, by using a reducibility to reduce some other unsolvable problem to it. If this fails, next consider your intuitions about the problem. This might not help much either: in my experience, my intuitions about which way the result will go have been wrong about 50% of the time.
Then it is time to begin the game of playing the positive and negative directions of a proof against each other. My colleagues and I have often worked alternately on one direction and the other, in each case until we got stuck. It is not a good idea to work just on an impossibility result, because there is always the unfortunate possibility that the task you are trying to prove is impossible is in fact possible, and some algorithm may surface.
An interesting interplay often arises when you work alternately on both directions. The limitations you find in designing an algorithm - e.g., the reason a particular algorithm fails - may be generalizable to give a limitation on all algorithms. [...] Conversely, the reasons that mathematical impossibility proof fails can sometimes be exploited to devise counterexample algorithms.

Although we have no hope of proving Steve’s claims in the near future (given our inability to formalize any of the relevant terms), this approach can be leveraged by looking for what would make a counterexample to each of Steve’s examples.

This means we’re looking for cases where there is no tradeoff between safety and capabilities: everyone agrees on what should be done. This amounts to saying that alignment people agree that there is nothing more to be done, which means one of two things:

The methods proposed (testing, human understanding…) are deemed useless because they cannot catch the relevant problems (maybe the model is superhumanly deceptive, and no test/supervision/constraints will change anything). In other worlds, problems are hidden in a way that our techniques cannot handle, and so there is no point in asking for more safety checks.
- Yet this hides a more high-level tradeoff: alignment people would say that we shouldn’t create and/or release the model at all in these conditions!
The methods proposed (testing, human understanding...) are deemed useless because even alignment people are all completely certain that they got the right scheme and that it will work.
- That sounds wildly improbable, and even if it was possible in principle, I don’t know anyone who would argue that it is probable in the near future.

Summary

The epistemic strategy at hands here is thus the following:

Arguing that a class of tradeoffs cannot be avoided in alignment proposals
- Give a list of tradeoffs from this class.
  - If possible from different parts/points in proposals.
- Argue that some of these tradeoffs appear for every proposal.
  - Extract different types of potential counterexamples.
  - Argue why each category of counterexamples can’t exist..

Breaking the Inevitability of Safety-Capabilities Tradeoffs

Recall that epistemic strategies are design patterns, blueprints — following one helps, but doesn’t ensure that the resulting argument will be correct. And epistemic strategies highlight where the meat of the reasoning is, thus where to focus attention and criticism.

So let’s take the summary strategy and propose ways of breaking it.

Arguing that a class of tradeoffs cannot be avoided in alignment proposals

Give a list of tradeoffs from this class.
- If possible from different parts/points in proposals.
  - Argue that they are too clustered in proposal space, too focused on a specific kind of proposals.
Argue that some of these tradeoffs appear for every proposal.
- Extract different types of potential counterexamples.
  - Argue that these are not all the possible types, for example by providing a counterexample that doesn’t fit in any.
- Argue why each category of counterexamples can’t exist.
  - Break one of these arguments, by showing a failure of reasoning.
  - Break one of these arguments by providing an actual counterexample from the category.