A little late to the party, but
I'm confused about the minimax strategy.
The first thing I was confused about was what sorts of rules could constrain Murphy, based on my actions. For example, in a bit-string environment, the rule "every other bit is a 0" constrains Murphy (he can't reply with "111..."), but not based on my actions. It doesn't matter what bits I flip, Murphy can always just reply with the environment that is maximally bad, as long as it has 0s in every other bit. Another example would be if you have the rule "environment must be a valid chess board," then you can make whatever moves you want, and Murphy can just return the environment with the rule "if you make that move, then the next board state is you in checkmate", after all, you being in checkmate is a valid chessboard, and therefore meets the only rule you know. And you can't know what other rules Murphy plays by. You can't really run minimax on that, then, because all of Murphy's moves look like "set the state to the worst allowable state."
So, what kind of rules actually constrain Murphy based on my actions? My first take was "rules involving time," for instance if you have the rule "only one bit can be flipped per timestep" then you can constrain Murphy. If you flip a bit, then within the next timestep, you've eliminated some possibilities (they would require flipping that bit back and doing something else), so you can have a meaningful minimax on which action to take.
This didn't feel like the whole story though, so I had a talk with my friend about it, and eventually, we generalized it to "rules that consume resources." An example would be, if you have the rule "for every bit you flip, you must also flip one of the first 4 bits from a 1 to a 0", then we can constrain Murphy. If I flip any bit, that leaves 1 less bit for Murphy to use to mess with me.
But then the minimax strategy started looking worrying to me. If the only rules that you can use to constrain Murphy are ones that use resources, then wouldn't a minimax strategy have some positive preference for destroying resources in order to prevent Murphy from using them? It seems like a good way to minimize Murphy's best outcomes.
I'm glad to hear that the question of what hypotheses produce actionable behavior is on people's minds.
I modeled Murphy as an actual agent, because I figured a hypothesis like "A cloaked superintelligence is operating the area that will react to your decision to do X by doing Y" is always on the table, and is basically a template for allowing Murphy to perform arbitrary action Y.
I feel like I didn't quite grasp what you meant by "a constraint on Murphy is picked according to this probability distribution/prior, then Murphy chooses from the available options of the hypothesis they picked"
But based on your explanation after, it sounds like you essentially ignore hypotheses that don't constrain Murphy, because they act as an expected utility drop on all states, so it just means you're comparing -1,000,000 and -999,999, instead of 0 and 1. For example, there's a whole host of hypotheses of the form "A cloaked superintelligence converts all local usable energy into a hellscape if you do X", and since that's a possibility for every X, no action X is graded lower than the others by its existence.
That example is what got me thinking, in the first place, though. Such hypotheses don't lower everything equally, because, given other Laws of Physics, the superintelligence would need energy to hell-ify things. So arbitrarily consuming energy would reduce how bad the outcomes could be if a perfectly misaligned superintelligence was operating in the area. And, given that I am positing it as a perfectly misaligned superintelligence, we should both expect it to exist in the environment Murphy chooses (what could be worse?) and expect any reduction of its actions to be as positive of changes as a perfectly aligned superintelligence's actions could be, since preventing a maximally detrimental action should match, in terms of Utility, enabling a maximally beneficial action. Therefore, entropy-bombs.
Thinking about it more, assuming I'm not still making a mistake, this might just be a broader problem, not specific to this in any way. Aren't I basically positing Pascal's Mugging?
Anyway, thank you for replying. It helped.