Looks interesting and ambitious! But I am sensing some methodological obstacles here, which I would like to point out and explore. You write:

Within those projects, I'm aiming to work on subprojects that are:

Posed in terms that are familiar to conventional ML;

interesting to solve from the conventional ML perspective;

and whose solutions can be extended to the big issues in AI safety.

Now, take the example of the capture the cube game from the Deepmind blog post. This is a game where player 1 tries to move a cube into the white zone. and player 2 tries to move it into the blue zone on the other end of the board. If the agents learn to betray each other here, how would you fix this?

We'll experiment with ways of motivating the agents to avoid betrayals, or getting anywhere near to them, and see if these ideas scale.

There are three approaches to motivating agents to avoid betrayals in capture the cube that I can see:

change the physical reality of the game: change the physics of the game world or the initial state of the game world
change the reward functions of the players
change the ML algorithms inside the players, so that they are no longer capable of finding the optimal betrayal-based strategy.

Your agenda says that you want to find solutions that are interesting from the conventional ML perspective. However, in the conventional ML perspective:

tweaking the physics of the toy environment to improve agent behavior is out of scope. It is close to cheating on the benchmark.
any consideration of reward function design is out of scope. Tweaking it to improve learned behavior is again close to cheating.
introducing damage into your ML algorithms so that they will no longer find the optimal policy is just plain weird, out of scope, and close to cheating.

So I'd argue that you have nowhere to move if you want to solve this problem while also pleasing conventional ML researchers. Conventional ML researchers will always respond by saying that your solution is trivial, problem-specific, and therefore uninteresting.

OK, maybe I am painting too much of a hard-core bitter lesson picture of conventional ML research here. I could make the above observations disappear by using a notion of conventional ML research that is more liberal in what it will treat as in-scope, instead of as cheating.

What I would personally find exiting would be a methodological approach where you experiment with 1) and 2) above, and ignore 3).

In the capture the cube game, you might experiment with reward functions that give more points for a fast capture followed by a fast move to a winning zone, which ends the game, and less for a slow one. If you also make this an iterated game (it may already be a de-facto iterated game depending on the ML setup), I would expect that you can produce robust collaborative behavior with this time-based reward function. The agents may learn to do the equivalent of flipping a coin at the start to decide who will win this time: they will implicitly evolve a social contract about sharing scarce resources.

You might also investigate a game scoring variant with different time discount factors, factors which more heavily or more lightly penalize wins which take longer to achieve. I would expect that with higher penalties for taking a longer time to win, collaborative behavior under differences between player intelligence and ability will remain more robust, because even a weaker player can always slow down a stronger player a bit if they want to. This penalty approach might then generalize to other types of games.

The kind of thing I have in mind above could also be explored in much more simple toy worlds than those offered by XLand. I have been thinking of a game where we drop two players on a barren planet, where one has the reward function to maximize paperclips, and one to maximize staples. If the number of paperclips and staples is discounted, e,g, the time-based reward functions are and $s t a p l e s^{0.1}$ , this might produce more collaborative/sharing behavior, and suppress a risky fight to capture total dominance over resources.

Potentially, some branch of game theory has already produced a whole body of knowledge that examines this type of approach to turning competitive games into collaborative games, and has come up with useful general results and design principles. Do not know. I sometimes wonder about embarking on a broad game theory literature search to find out. The methodological danger of using XLand to examine these game theoretical questions is that by spending months working in the lab, you will save hours in the library.

These general methodological issues have been on my mind recently. I have been wondering if AI alignment/safety researchers should spend less time with ML researchers and their worldview, and more time with game theory people.

I would be interested in your thoughts on these methodological issues, specifically your thoughts about how you will handle them in this particular subproject. One option I did not discuss above is transfer learning which primes the agents on collaborative games only, to then explore their behavior on competitive games.