+1 for interesting investigations. I want to push back on your second point, though - the framing of the problem of high-level distributional shift. I don't think this actually captures the core thing we're worried about. For example, we can imagine a model that remains in the same environment, but becomes increasingly intelligent during training, until it realises that it has the option of doing a treacherous turn. Or we can think about the case of humans - the core skills and goals that make us dangerous to other species developed in our ancestral environment, which led to us changing our own environments. So the distributional shift was downstream of the underlying problem.
Also, in the real world, everything undergoes distributional shift all the time, so the concept doesn't narrow things down.
I've been clarifying my own understanding of the alignment problem over the past few months, and wanted to share my first writeups with folks here in case they're useful:
https://www.danieldewey.net/risk/
The site currently has 3 pages:
None of the ideas on the site are particularly new, and as I note, they're not consensus views, but the version of the basic case I lay out on the site is very short, doesn't have a lot of outside dependencies, and is put together out of nuts-and-bolts arguments that I think will be useful as a starting point for alignment work. I'm particularly hoping to avoid semantic arguments about "what counts as" inner vs outer alignment, optimization, agency, etc., in favor of more mechanical statements of how models could behave in different situations.
I think some readers on this forum will already have been thinking about alignment this way, and won't get a lot new out of the site; some (like me) will find it to be a helpful distillation of some of the major arguments that have come out over the past ~5 years; and some will have disagreements (which I'm curious to hear about).
I thought about posting all of this directly on the Alignment Forum / LessWrong, but ultimately decided I wanted a dedicated home for these ideas.
Out of everything on the site, the part I'm most hoping will be helpful to you is my (re)statement of two main problems in AI alignment. These map roughly onto outer and inner alignment, though different people use those terms differently, so not everyone will agree:
What's next? Ultimately, I'm hoping to figure out what kinds of research projects are most likely to produce forward progress towards training methods that avoid evaluation breakdown and high-level distribution shift. A world where we're making clear year-over-year progress towards these goals looks achievable to me.