A key AI safety skill is moving back and forth, as needed, between "could we solve problem X if we assume Y?" and "can we assume Y?".
Promoted to curated: This is a technique I've seen mentioned in a bunch of places, but I haven't seen a good writeup for it, and I found it quite valuable to read.
It's easier to make your way to the supermarket than it is to compute the fastest route, which is yet easier than computing the fastest route for someone running backwards and doing two and a half jumping jacks every five seconds and who only follows the route p percent of the time. Sometimes, constraints are necessary. Constraints come with costs. Sometimes, the costs are worth it.
Aspiring researchers trying to think about AI alignment might[1] have a failure mode which goes something like… this:
The aspiring researcher slumps in frustration, mutters a curse under their breath, and hangs up their hat – "guess this whole alignment thing isn't for me...". And isn't that so? All their brain could do was pattern-match onto already-proposed solutions and cached thinking.
There's more than one thing going wrong here, but I'm just going to focus on one. Given that person's understanding of AI alignment, this problem is wildly overconstrained. Whether or not alignment research is right for them, there's just no way that anyone's brain is going to fulfill this insane solution request!
Sometimes, constraints are necessary. I think that the alignment community is pretty good at finding plausibly necessary constraints. Maybe some of the above aren't necessary – maybe there's One Clever Trick you come up with which obviates one of these concerns.
Constraints come with costs. Sometimes, the costs are worth it. In this context, I think the costs are very much worth it. Under this implicit framing of the problem, you're pretty hosed if you don't get even outer alignment right.
However, even if the real problem has crazy constraints, that doesn't mean you should immediately tackle the fully constrained problem. I think you should often relax the problem first: eliminate or weaken constraints until you reach a problem which is still a little confusing, but which you can get some traction on.
Historically, I tend to be too slow to relax research problems. On the flipside, all of my favorite research ideas were directly enabled by problem relaxation. Instead of just telling you what to do and then having you forget this advice in five minutes, I'm going to paint it into your mind using two stories.
Attainable Utility Preservation
It's spring of 2018, and I've written myself into a corner. My work with CHAI for that summer was supposed to be on impact measurement, but I inconveniently posted a convincing-to-me argument that impact measurement cannot admit a clean solution:
Summer dawned, and I occupied myself with reading – lots and lots of reading. Eventually, enough was enough – I wanted to figure this out. I strode through my school's library, markers in my hand and determination in my heart. I was determined not to leave before understanding a) exactly why impact measurement is impossible to solve cleanly, or b) how to solve it.
I reached the whiteboard, and then – with adrenaline pumping through my veins – I realized that I had no idea what this "impact" thing even is. Oops.
I'm staring at the whiteboard.
A minute passes.
59 more minutes pass.
I'd been thinking about how, in hindsight, it was so important that Shannon had first written a perfect chess-playing algorithm which required infinite compute, that Hutter had written an AGI algorithm which required infinite compute. I didn't know how to solve impact under all the constraints, but what if I assumed something here?
The answer was almost trivially obvious. My first thought was that negative impact would be a decrease in true utility, but that wasn't quite right. I realized that impact measure needs to also capture decrease in ability to achieve utility. That's an optimal value function... So the negative impact would be the decrease in attainable utility for human values![3]
I then wrote down The Attainable Utility Preservation Equation, more or less. Although it took me a few weeks to believe and realize, that equation solved all of the impact measurement problems which had seemed so insurmountable to me just minutes before.[4]
Formalizing Instrumental Convergence
It's spring of 2019, and I've written myself into a corner. My first post on AUP was confusing – I'd failed to truly communicate what I was trying to say. Inspired by Embedded Agency, I was planning an illustrated sequence of my own.
I was working through a bit of reasoning on how your ability to achieve one goal interacts with your ability to achieve seemingly unrelated goals. Spending a lot of money on red dice helps you for the collecting-dice goal, but makes it harder to become the best juggler in the world. That's a weird fact, but it's an important fact which underlies much of AUP's empirical success. I didn't understand why this fact was true.
At an impromptu presentation in 2018, I'd remarked that "AUP wields instrumental convergence as a weapon against the alignment problem itself". I tried thinking about it using the formalisms of reinforcement learning. Suddenly, I asked myself
I paused. I went outside for a walk, and I paced. The walk lengthened, and I still didn't understand why. Maybe it was just a "brute fact", an "emergent" phenomenon – nope, not buying that. There's an explanation somewhere.
I went back to the drawing board – to the whiteboard, in fact. I stopped trying to understand the general case and I focused on specific toy environments. I'm looking at an environment like this
and I'm thinking, most agents go from
1
to3
. "Why does my brain think this?", I asked myself. Unhelpfully, my brain decided not to respond.I'm staring at the whiteboard.
A minute passes.
29 more minutes pass.
I'm reminded of a paper my advisor had me read for my qualifying exam. The paper talked about a dual formulation for reinforcement learning environments, where you consider the available trajectories through the future instead of the available policies. I take a picture of the whiteboard and head back to my office.
I run into a friend. We start talking about work. I say, "I'm about 80% sure I have the insight I need – this is how I felt in the past in situations like this, and I turned out to be right".
I turned out to be right. I started building up an entire theory of this dual formalism. Instead of asking myself about the general case of instrumental convergence in arbitrary computable environments, I considered small deterministic Markov decision processes. I started proving everything I could, building up my understanding piece by piece. This turned out to make all difference.
Half a year later, I'd built up enough theory that I was able to explain a great deal (but not everything) about instrumental convergence.
Conclusion
Problem relaxation isn't always the right tactic. For example, if the problem isn't well-posed, it won't work well – imagine trying to "relax" the "problem" of free will! However, I think it's often the right move.
The move itself is simple: consider the simplest instance of the problem which is still confusing. Then, make a ton simplifying assumptions while still keeping part of the difficulty present – don't assume away all of the difficulty. Finally, tackle the relaxed problem.
In general, this seems like a skill that successful researchers and mathematicians learn to use. MIRI does a lot of this, for example. If you're new to the research game, this might be one of the crucial things to pick up on. Even though I detailed how this has worked for me, I think I could benefit from relaxing more.
The world is going to hell. You might be working on a hard (or even an impossible) problem. We plausibly stand on the precipice of extinction and utter annihilation.
Just relax.
This is meant as a reference post. I'm not the first to talk using problem relaxation in this way. For example, see The methodology of unbounded analysis.
This failure mode is just my best guess – I haven't actually surveyed aspiring researchers. ↩︎
The "convincing-to-me argument" contains a lot of confused reasoning about impact measurement, of course. For one, thinking about side effects is not a good way of conceptualizing the impact measurement problem. ↩︎
The initial thought wasn't as clear as "penalize decrease in attainable utility for human values" – I was initially quite confused by the AUP equation. "What the heck is this equation, and how do I break it?".
It took me a few weeks to get a handle for why it seemed to work so well. It wasn't for a month or two that I began to understand what was actually going on, eventually leading to the Reframing Impact sequence. However, for the reader's convenience, I whitewashed my reasoning here a bit. ↩︎
At first, I wasn't very excited about AUP – I was new to alignment, and it took a lot of evidence to overcome the prior improbability of my having actually found something to be excited about. It took several weeks before I stopped thinking it likely that my idea was probably secretly and horribly bad.
However, I kept staring at the strange equation – I kept trying to break it, to find some obvious loophole which would send me back to the drawing board. I never found it. Looking back over a year later, AUP does presently have loopholes, but they're not obvious, nor should they have sent me back to the drawing board.
I started to get excited about the idea. Two weeks later, my workday was wrapping up and I left the library.
I turned heel, descending back into the library's basement. I took the photograph. I'm glad that I did.
Discovering AUP was one of the happiest moments of my life. It gave me confidence that I could think, and it gave me some confidence that we can win – that we can solve alignment. ↩︎