All of ViktoriaMalyasova's Comments + Replies

Thank you for explaining this! But then how can this framework be used to model humans as agents?  People can easily imagine outcomes worse than death or destruction of the universe.

2Vanessa Kosoy
The short answer is, I don't know. The long answer is, here are some possibilities, roughly ordered from "boring" to "weird": 1. The framework is wrong. 2. The framework is incomplete, there is some extension which gets rid of monotonicity. There are some obvious ways to make such extensions, but they look uglier and without further research it's hard to say whether they break important things or not. 3. Humans are just not physicalist agents, you're not supposed to model them using this framework, even if this framework can be useful for AI. This is why humans took so much time coming up with science. 4. Like #3, and also if we thought long enough we would become convinced of some kind of simulation/deity hypothesis (where the simulator/deity is a physicalist), and this is normatively correct for us. 5. Because the universe is effectively finite (since it's asymptotically de Sitter), there are only so many computations that can run. Therefore, even if you only assign positive value to running certain computations, it effectively implies that running other computations is bad. Moreover, the fact the universe is finite is unsurprising since infinite universes tend to have all possible computations running which makes them roughly irrelevant hypotheses for a physicalist. 6. We are just confused about hell being worse than death. For example, maybe people in hell have no qualia. This makes some sense if you endorse the (natural for physicalists) anthropic theory that only the best-off future copy of you matters. You can imagine there always being a "dead copy" of you, so that if something worst-than-death happens to the apparent-you, your subjective experiences go into the "dead copy".

Then,  is considered to be a precursor of  in universe  when there is some -policy  s.t. applying the counterfactual " follows " to  (in the usual infra-Bayesian sense) causes  not to exist (i.e. its source code doesn't run).

A possible complication is, what if  implies that  creates  / doesn't interfere with the creation of ? In this case  might conceptually be a precursor, but the definition would not detect it.

Can you plea... (read more)

2Vanessa Kosoy
The problem is that if Θ implies that H creates G but you consider a counterfactual in which H doesn't create G then you get an inconsistent hypothesis i.e. a HUC which contains only 0. It is not clear what to do with that. In other words, the usual way of defining counterfactuals in IB (I tentatively named it "hard counterfactuals") only makes sense when the condition you're counterfactualizing on is something you have Knightian uncertainty about (which seems safe to assume if this condition is about your own future action but not safe to assume in general). In a child post I suggested solving this by defining "soft counterfactuals" where you consider coarsenings of Θ in addition to Θ itself.

So, let me try to summarize and check my understanding. In the first part of the post, you show that most random reward functions are not corrigible. This looks correct.

In the second part, you want to prove that VNM-coherence is incompatible with corrigibility in this universe, and I don't think I follow. So, suppose that R(A_blue),R(B_blue),R(C_blue)>max(R(A_red),R(B_red),R(C_red)). Now we change the dynamics so that the human will not correct the agent by default, but can be manipulated into it. Then we need to add states A_black and C_black, and arro... (read more)

2Alex Turner
Although I didn't make this explicit, one problem is that manipulation is still weakly optimal—as you say. That wouldn't fit the spirit of strict corrigibility, as defined in the post. Note that AUP doesn't have this problem.