Zbetna Fvapynver [rot13]
Whoops, mea culpa on that one! Deleted and changed to:
the main post there pointed out that seemingly anything can be trivially modeled as being a "utility maximizer" (further discussion here), whereas only some intelligent agents can be described as being "goal-directed" (as defined in this post), and the latter is a more useful concept for reasoning about AI safety.
[copying from my comment on the EA Forum x-post]
For reference, some other lists of AI safety problems that can be tackled by non-AI people:
Luke Muehlhauser's big (but somewhat old) list: "How to study superintelligence strategy"
AI Impacts has made several lists of research problems
Wei Dai's, "Problems in AI Alignment that philosophers could potentially contribute to"
Kaj Sotala's case for the relevance of psychology/cog sci to AI safety (I would add that Ought is currently testing the feasibility of IDA/Debate by doing psychological research)
*begins drafting longer proposal*
Yeah, this is definitely more high-risk, high-reward than the others, and the fact that there's potentially some very substantial spillover effects if successful makes me both excited and nervous about the concept. I'm thinking of Arbital as an example of "trying to solve way too many problems at once", so I want to manage expectations and just try to make some exercises that inspire people to think about the art of mathematizing certain fuzzy philosophical concepts. (Running title is "Formalization Exercises", but I'm not sure if there's a better pithy name that captures it).
In any case, I appreciate the feedback, Mr. Entworth.
(8)
In light of the “Fixed Points” critique, a set of exercises that seem more useful/reflective of MIRI’s research than those exercises. What I have in mind is taking some of the classic success stories of formalized philosophy (e.g. Turing machines, Kolmogorov complexity, Shannon information, Pearlian causality, etc., but this could also be done for reflective oracles and logical induction), introducing the problems they were meant to solve, and giving some stepping stones that guide one to have the intuitions and thoughts that (presumably) had to be developed to make the finished product. I get that this will be hard, but I think this can be feasibly done for some of the (mostly easier) concepts, and if done really well, it could even be a better way for people to learn those concepts than actually reading about them.
(7)
A critique of MIRI’s “Fixed Points” paradigm, expanding on some points I made on MIRIxDiscord a while ago (which would take a full post to properly articulate). Main issue is, I'm unsure if it's still guiding anyone's research and/or who outside MIRI would care.
(6)
An analysis of what kinds of differential progress we can expect from stronger ML. Actually, I don’t feel like writing this post, but I just don’t understand why Dai and Christiano, respectively, are particularly concerned about differential progress on the polynomial hierarchy and what’s easy-to-measure vs. hard-to-measure. My gut reaction is “maybe, but why privilege that axis of differential progress of all things”, and I can’t resolve that in my mind without doing a comprehensive analysis of potential “differential progresses” that ML could precipitate. Which, argh, sounds like an exhausting task, but someone should do it?
(5)
A skeptical take on Part I of “What failure looks like” (3 objections, to summarize briefly: not much evidence so far, not much precedent historically, and “why this, of all the possible axes of differential progress?”) [Unsure if these objections will stand up if written out more fully]
(4)
A post discussing my confusions about Goodhart and Garrabrant’s taxonomy of it. I find myself not completely satisfied with it:
1) “adversarial” seems too broad to be that useful as a category
2) It doesn’t clarify what phenomenon is meant by “Goodhart”; in particular, “regressional” doesn’t feel like something the original law was talking about, and any natural definition of “Goodhart” that includes it seems really broad
3) Whereas “regressional” and “extremal” (and perhaps “causal”) are defined statistically, “adversarial” is defined in terms of agents, and this may have downsides (I’m less sure about this objection)
But I’m also not sure how I’d reclassify it and that task seems hard. Which partially updates me in favor of the Taxonomy being good, but at the very least I feel there’s more to say about it.
(3)
“When and why should we be worried about robustness to distributional shift?”: When reading that section of Concrete Problems, there’s a temptation to just say “this isn’t relevant long-term, since an AGI by definition would have solved that problem”. But adversarial examples and the human safety problems (to the extent we worry about them) both say that in some circumstances we don’t expect this to be solved by default. I’d like to think more about when the naïve “AGI will be smart” intuition applies and when it breaks.
Nitpick: "transfer learning" is the standard term, no? It has a Wiki page and seems to get a more coherent batch of search results than googling "robustness to data shift".