What are you particularly interested in? I expect I could probably write it with a bit of rereading.
In terms of methodology, epistemology, etc, what did you do right/wrong? What advice would you today give to someone who produced something like your old goal-deconfusion work, or what did your previous self really need to hear?
Atonement for my sins towards deconfusion
I have argued that the deconfusion of goal-directedness should not follow what I dubbed “the backward approach”, that is starting from the applications for the concept and reverse-engineering its coherent existence (or contradiction) from there. I have also argued that deconfusion should always start and center around applications.
In summary, I was wrong. About the former
If deconfusion indeed starts at the applications, what about my arguments against the backward approach to goal-directedness? I wrote
My best answer is this post: an exploration of applications of deconfusing goal-directedness, and how they actually inform and constrain the deconfusion itself. The gist is that in discarding a different approach than what felt natural to me, I failed to notice all the ways in which applications do constraint and direct deconfusion. In this specific case, the most fruitful and important applications I’ve found are convergent subgoals, i replacing optimal policies, formalizing inner alignment and separating approval-directed systems from pure maximizers.
Thanks to John S. Wentworth for pushing hard on the importance of starting at the applications.
Applications
Convergent subgoals
Convergent subgoals (self-preservation, resource acquisitions…) are often important ingredients in scenarios starting with misspecified objectives and ending with catastrophic consequences. Without them, even an AGI would let itself be shut down, greatly reducing the related risks. Convergent subgoals are also clearly linked with goal-directedness, since the original argument proposes that most goals lead to them.
As an application for deconfusion, what does this entail? Well, goal-directed systems should be the whole class of systems that could have convergent subgoals. It doesn’t necessarily mean that most goal-directed systems will actually have such convergent subgoals, but a low-goal-directed system shouldn’t have them at all. Hence high goal-directedness should be a necessary (but not necessarily sufficient) condition for having convergent subgoals.
This constraint then point to concrete approaches for deconfusing goal-directedness that I’m currently pursuing:
Replacing optimal policies
When we talk about AGI having goals, we have a tendency to use optimal policies as a stand-in. These policies do have a lot of nice properties: they are maximally constrained by the goal, allow some reverse-engineering of goals without thinking about error models, and make it easy to predict what happens in the long-term -- optimality.
Yet as Richard points out in this comment, true optimal policies for real world tasks are probably incredibly complex and intractable. It’s fair to say that for any task we cannot just enumerate on, we probably haven’t built an optimal policy. For example AlphaGo and its successors are very good at Go, but they are not strictly speaking optimal.
The above point wouldn’t matter if optimal policies pretty much behaved just like merely competent ones. But that’s not the case in general: usually the very optimal strategy is something incredibly subtle that uses many tricks and details that we have no way of finding except through exhaustive search. Notably, the reason we expect quantilizers to be less catastrophic than pure maximizers is indeed that difference between optimal behavior and competent one.
Because of this, focusing on optimal policies when thinking about goal-directedness might have two dangerous effects:
What I take from this analysis is that we want to replace the optimality assumption by goal-directedness + some competence assumption.
Here we don’t really have a necessary condition, so unraveling what the constraint entails is slightly more involved. But we can still look at the problems with optimality, and turn them into requirements for goal-directedness:
Grounding inner alignment
Risks from Learned Optimisation introduced the concept of mesa-optimizers or inner optimizers to point to the results of search that might themselves be doing internal search/optimization. This has been consistently confusing, and people constantly complain about it. Abram has a recent post that looks at different ways of formalizing it.
In addition to the confusion, I believe that focusing on inner optimizers as currently defined underestimate the range of models that would be problematic to build, because I expect some goal-directed systems to not use optimization in this way or have explicit goals. I also expect goal-directedness to be easier to define than inner optimization, even if that probably comes from a bias.
Rephrasing, the application is that goal-directedness should be a sufficient condition for the arguments in Risks to apply.
The implications are quite obvious:
Approval-directed systems as less goal-directed
This application is definitely more niche than the others, but it seems quite important to me. Both Paul in his approval-directed post and Rohin in this comment to one of his posts on goal-directedness have proposed that approval-directed systems are inherently less goal-directed than pure maximizers.
Why? Because approval-directed systems would have a more flexible goal, and also wouldn’t not have the same convergent subgoals that we expect from competent goal-directed systems.
I share this intuition, but I haven’t been able to find a way to actually articulate it convincingly. Hence why I add this constraint: approval-directed systems should have low-goal-directedness (or at least lower than pure-maximizers)
Since the constraint is quite obvious, let’s focus on the approaches to goal-directedness this suggests.
Conclusion: all that is left is work
In refusing to focus on the application, I slowed myself down in two ways:
I attempted to correct my mistake in this post, by looking at the most important applications for deconfusing goal-directedness (convergent subgoals, optimality, inner optimization and approval directedness), and extracting constraints and questions to investigate from them.
This cuts my work for me on the topic; if you find yourself interested or excited by any of the research ideas proposed in this post, send me a message so we can talk!