Bunthut - AI Alignment Forum

Superrational Agents Kelly Bet Influence!

Maybe I'm missing something, but it seems to me that all of this is straightforwardly justified through simple selfish pareto-improvements.

Take a look at Critchs cake-splitting example in section 3.5. Now imagine varying the utility of splitting. How high does it need to get, before [red->Alice;green->Bob] is no longer a pareto improvement over [(split)] from both player's selfish perspective before the observation? It's 27, and thats also exactly where the decision flips when weighing Alice 0.9 and Bob 0.1 in red, and Alice 0.1 and Bob 0.9 in green.

Intuitively, I would say that the reason you don't bet influence all-or-nothing, or with some other strategy, is precisely because influence is not money. Influence can already be all-or-nothing all by itself, if one player never cares that much more than the other. The influence the "losing" bettor retains in the world where he lost is not some kind of direct benefit to him, the way money would be: it functions instead as a reminder of how bad a treatment he was willing to risk in the unlikely world, and that is of course proportional to how unlikely he thought it is.

So I think all this complicated strategizing you envision in influence betting, actually just comes out exactly to Critches results. Its true that there are many situations where this leads to influence bets that don't matter to the outcome, but they also don't hurt. The theorem only says that actions must be describable as following a certain policy, it doesn't exclude that they can be described by other policies as well.

FixDT

Bunthut7mo30

On reflection, I didn't quite understand this exploration business, but I think I can save a lot of it.

>You can do exploration, but the problem is that (unless you explore into non-fixed-point regions, violating epistemic constraints) your exploration can never confirm the existence of a fixed point which you didn't previously believe in.

I think the key here is in the word "confirm". Its true that unless you believe p is a fixed point, you can't just try out p and see the result. However, you can change your beliefs about p based on your results from exploring things other than p. (This is why I call the thing I'm objecting to humean trolling.) And there is good reason to think that the available fixed points are usually pretty dense in the space. For example, outside of the rule that binarizes our actions, there should usually be at least one fixed point for every possible action. Plus, as you explore, your beliefs change, creating new believed-fixed-points for you to explore.

>I think your idea for how to find repulsive fixed-points could work if there's a trader who can guess the location of the repulsive point exactly rather than approximately

I don't think thats needed. If my net beliefs have a closed surface in propability space on which they push outward, then necessarily those beliefs have a repulsive fixed point somewhere in that surface. I can then explore that believed fixed point. Then if its not a true fixed point, and I still believe in the closed surface, theres a new fixed point in that surface that I can again explore (generally more in the direction I just got pushed away from). This should converge on a true fixed point. The only thing that can go wrong is that I stop believing in the closed surface, and it seems like I should leave open that possibility - and even then, I might believe in it again after I do some checking along the outside.

>However, the wealth of that trader will act like a martingale; there's no reliable profit to be made (even on average) by enforcing this fixed point.

This I don't understand at all. If you're in a certain fixed point, shouldn't the traders that believe in it profit from the ones that don't?

FixDT

Bunthut8mo3-2

I don't think the learnability issues are really a problem. I mean, if doing a handstand with a burning 100 riyal bill between your toes under the full moon is an exception to all physical laws and actually creates utopia immediately, I'll never find out either. Assuming you agree that that's not a problem, why is the scenario you illustrate? In both cases, it's not like you can't find out, you just don't, because you stick to what you believe is the optimal action.

I don't think this would be a significant problem in practice any more than other kinds of humean trolling are. It always seems much more scary in these extremely barebones toy problems, where the connection between the causes and effects we create really are kind of arbitrary. I especially don't think it will be possible to learn the couterfactuals of FDTish cooperation and such in these small settings, no matter the method.

Plus you can still do value-of-information exploration. The repulsive fixed points are not that hard to find if you're looking for them. If you've encircled one and found repulsion all around the edge, you know there must be one in there, and can get there with a procedure that just reverses your usual steps. Combining this with simplicity priors over a larger setting into which the problem is integrated, I don't think its any more worrying than the handstand thing.

People care about each other even though they have imperfect motivational pointers?

Bunthut2y2-2

This prediction seems flatly wrong: I wouldn’t bring about an outcome like that. Why do I believe that? Because I have reasonably high-fidelity access to my own policy, via imagining myself in the relevant situations.

This seems like you're confusing two things here, because the thing you would want is not knowable by introspection. What I think you're introspecting is that if you'd noticed that the-thing-you-pursued-so-far was different from what your brother actually wants, you'd do what he actually wants. But the-thing-you-pursued-so-far doesn't play the role of "your utility function" in the goodhart argument. All of you plays into that. If the goodharting were to play out, your detector for differences between the-thing-you-pursued-so-far and what-your-brother-actually-wants would simply fail to warn you that it was happening, because it too can only use a proxy measure for the real thing.

My Current Take on Counterfactuals

Bunthut4y10

Why is the price of the un-actualized bet constant? My argument in the OP was to suppose that PCH is the dominant hypothesis, so, mostly controls market prices.

Thinking about this in detail, it seems like what influence traders have on the market price depends on a lot more of their inner workings than just their beliefs. I was thinking in a way where each trader only had one price for the bet, below which they bought and above which they sold, no matter how many units they traded (this might contradict "continuous trading strategies" because of finite wealth), in which case there would be a range of prices that could be the "market" price, and it could stay constant even with one end of that range shifting. But there could also be an outcome like yours, if the agents demand better and better prices to trade one more unit of the bet.

My Current Take on Counterfactuals

Bunthut4y00

So I don't see how we can be sure that PCH loses out overall. LCH has to exploit PCH -- but if LCH tries it, then we're seemingly in a situation where LCH has to sell for PCH's prices, in which case it suffers the loss I described in the OP.

So I've reread the logical induction paper for this, and I'm not sure I understand exploitation. Under 3.5, it says:

On each day, the reasoner receives 50¢ from T, but after day t, the reasoner must pay $1 every day thereafter.

So this sounds like before day t, T buys a share every day, and those shares never pay out - otherwise T would receive $t on day t in addition to everything mentioned here. Why?

In the version that I have in my head, theres a market with PCH and LCH in it that assigns constant price to the unactualised bet, so neither of them gain or lose anything with their trades on it, and LCH exploits PCH on the actualised ones.

But the special bundled contract doesn't go to zero like this, because the conditional contract only really pays out when the condition is satisfied or refuted.

So if I'm understanding this correctly: The conditional contract on (a|b) pays if a&b is proved, if a&~b is proved, and if ~a&~b is proved.

Now I have another question: how does logical induction arbitrage against contradiction? The bet on a pays $1 if a is proved. The bet on ~a pays $1 if not-a is proved. But the bet on ~a isn't "settled" when a is proved - why can't the market just go on believing its .7? (Likely this is related to my confusion with the paper).

My proposal is essentially similar to that, except I am trying to respect logic in most of the system, simply reducing its impact on action selection. But within my proposed system, I think the wrong 'prior' (ie distribution of wealth for traders) can make it susceptible again.
I'm not blocking Troll Bridge problems, I'm making the definition of rational agent broad enough that crossing is permissible. But if I think the Troll Bridge proof is actively irrational, I should be able to actually rule it out. IE, specify an X which is inconsistent with PA.

What makes you think that theres a "right" prior? You want a "good" learning mechanism for counterfactuals. To be good, such a mechanism would have to learn to make the inferences we consider good, at least with the "right" prior. But we can't pinpoint any wrong inference in Troll Bridge. It doesn't seem like whats stopping us from pinpointing the mistake in Troll Bridge is a lack of empirical data. So, a good mechanism would have to learn to be susceptible to Troll Bridge, especially with the "right" prior. I just don't see what would be a good reason for thinking theres a "right" prior that avoids Troll Bridge (other than "there just has to be some way of avoiding it"), that wouldn't also let us tell directly how to think about Troll Bridge, no learning needed.

My Current Take on Counterfactuals

Bunthut4y00

The payoff for 2-boxing is dependent on beliefs after 1-boxing because all share prices update every market day and the "payout" for a share is essentially what you can sell it for.

If a sentence is undecidable, then you could have two traders who disagree on its value indefinitely: one would have a highest price to buy, thats below the others lowest price to sell. But then anything between those two prices could be the "market price", in the classical supply and demand sense. If you say that the "payout" of a share is what you can sell it for... well, the "physical causation" trader is also buying shares on the counterfactual option that won't happen. And if he had to sell those, he couldn't sell them at a price close to where he bought them - he could only sell them at how much the "logical causation" trader values them, and so both would be losing "payout" on their trades with the unrealized option. Thats one interpretation of "sell". If theres a "market maker" in addition to both traders, it depends on what prices he makes - and as outlined above, there is a wide range of prices that would be consistent for him to offer as a market maker, including ways which are very close to the logical traders valuations - in which case, the logical trader is gaining on the physical one.

Trying to communicate a vague intuition here: There is a set of methods which rely on there being a time when "everything is done", to then look back from there and do credit assignment for everything that happened before. They characteristically use backwards induction to prove things. I think markets fall into this: the argument for why ideal markets don't have bubbles is that eventually, the real value will be revealed, and so the bubble has to pop, and then someone holds the bag, and you don't want to be that someone, and people predicting this and trying to avoid it will make the bubble pop earlier, in the idealised case instantly. I also think these methods aren't going to work well with embedding. They essentially use "after the world" as a subsitute for "outside the world".

My claim is: eventually, if you observe enough cases of "crossing" in similar circumstances, your expectation for "cross" should be consistent with the empirical history

My question was more "how should this roughly work" rather than "what conditions should it fulfill", because I think thinking about this illuminates my next point.

The hope is that we can block the troll argument completely if proving B->A does not imply cf(A|B)=1

This doesn't help against what I'm imagining, I'm not touching indicative B->A. So, standard Troll Bridge:

Reasoning within PA (ie, the logic of the agent):
- Suppose the agent crosses.
  - Further suppose that the agent proves that crossing implies U=-10.
    - Examining the source code of the agent, because we're assuming the agent crosses, either PA proved that crossing implies U=+10, or it proved that crossing implies U=0.
    - So, either way, PA is inconsistent -- by way of 0=-10 or +10=-10.
    - So the troll actually blows up the bridge, and really, U=-10.
  - Therefore (popping out of the second assumption), if the agent proves that crossing implies U=-10, then in fact crossing implies U=-10.
  - By Löb's theorem, crossing really implies U=-10.
  - So (since we're still under the assumption that the agent crosses), U=-10.
- So (popping out of the assumption that the agent crosses), the agent crossing implies U=-10.
Since we proved all of this in PA, the agent proves it, and proves no better utility in addition (unless PA is truly inconsistent). On the other hand, it will prove that not crossing gives it a safe U=0. So it will in fact not cross.

But now, say the agents counterfactual reasoning comes not from PA, but from system X. Then the argument fails because "suppose the agent proves crossing->U=-10 in PA" doesn't go any further because examining the sourcecode of the agent doesn't say anything about PA anymore, and "suppose the agent proves crossing->U=-10 in X" doesn't show that PA is inconsistent, so the bridge isn't blown up. But lets have a troll that blows up the bridge if X is inconsistent. Then we can make an argument like this:

Reasoning within X (ie, the logic of counterfactuals):
- Suppose the agent crosses.
  - Further suppose that the agent proves in X that crossing implies U=-10.
    - Examining the source code of the agent, because we're assuming the agent crosses, either X proved that crossing implies U=+10, or it proved that crossing implies U=0.
    - So, either way, X is inconsistent -- by way of 0=-10 or +10=-10.
    - So the troll actually blows up the bridge, and really, U=-10.
  - Therefore (popping out of the second assumption), if the agent proves that crossing implies U=-10, then in fact crossing implies U=-10.
  - By Löb's theorem, crossing really implies U=-10.
  - So (since we're still under the assumption that the agent crosses), U=-10.
- So (popping out of the assumption that the agent crosses), the agent crossing implies U=-10.
Since we proved all of this in X, the agent proves it, and proves no better utility in addition (unless X is truly inconsistent). On the other hand, it will prove that not crossing gives it a safe U=0. So it will in fact not cross.

Now, this argument relies on X and counterfactual reasoning having a lot of the properties of PA and normal reasoning. But even a system that doesn't run on proofs per se could still end up implementing something a lot like logic, and then it would have a property thats a lot like inconsistency, and then the troll could blow up the bridge conditionally on that. Basically, it still seems reasonable to me that counterfactual worlds should be closed under inference, up to our ability to infer. And I don't see which of the rules for manipulating logical implications wouldn't be valid for counterfactual implications in their own closed system, if you formally separate them. If you want your X to avoid this argument, then it needs to not-do something PA does. "Formal separation" between the systems isn't enough, because the results of counterfactual reasoning still really do effect your actions, and if the counterfactual reasoning system can understand this, Troll Bridge returns. And if there was such a something, we could just use a logic that doesn't do this in the first place, no need for the two-layer approach.

a convincing optimality result could

I'm also sceptical of optimality results. When you're doing subjective probability, any method you come up with will be proven optimal relative to its own prior - the difference between different subjective methods is only in their ontology, and the optimality results don't protect you against mistakes there. Also, when you're doing subjectivism, and it turns out the methods required to reach some optimality condition aren't subjectively optimal, you say "Don't be a stupid frequentist and do the subjectively optimal thing instead". So, your bottom line is written. If the optimality condition does come out in your favour, you can't be more sure because of it - that holds even under the radical version of expected evidence conservation. I also suspect that as subjectivism gets more "radical", there will be fewer optimality results besides the one relative to prior.

My Current Take on Counterfactuals

Bunthut4y00

Because we have a “basic counterfactual” proposition for what would happen if we 1-box and what would happen if we 2-box, and both of those propositions stick around, LCH’s bets about what happens in either case both matter. This is unlike conditional bets, where if we 1-box, then bets conditional on 2-boxing disappear, refunded, as if they were never made in the first place.

I don't understand this part. Your explanation of PCDT at least didn't prepare me for it, it doesn't mention betting. And why is the payoff for the counterfactual-2-boxing determined by the beliefs of the agent after 1-boxing?

And what I think is mostly independent of that confusion: I don't think things are as settled.

I'm more worried about the embedding problems with the trader in dutch book arguments, so the one against CDT isn't as decisive for me.

In the Troll Bridge hypothetical, we prove that [cross]->[U=-10]. This will make the conditional expectations poor. But this doesn’t have to change the counterfactuals.

But how is the counterfactual supposed to actually think? I don't think just having the agent unrevisably believe that crossing is counterfactually +10 is a reasonable answer, even if it doesn't have any instrumental problems in this case. I think it ought to be possible to get something like "whether to cross in troll bridge depends only on what you otherwise think about PAs consistency" with some logical method. But even short of that, there needs to be some method to adjust your counterfactuals if they fail to really match you conditionals. And if we had an actual concrete model of counterfactual reasoning instead of a list of desiderata, it might be possible to make a troll based on the consistency of whatever is inside this model, as opposed to PA.

I also think there is a good chance the answer to the cartesian boundary problem won't be "heres how to calculate where your boundary is", but something else of which boundaries are an approximation, and then something similar would go for counterfactuals, and then there won't be a counterfactual theory which respects embedding.

These later two considerations suggest the leftover work isn't just formalisation.

Identifiability Problem for Superrational Decision Theories

Bunthut4y00

are the two players physically precisely the same (including environment), at least insofar as the players can tell?

In the examples I gave yes. Because thats the case where we have a guarantee of equal policy, from which people try to generalize. If we say players can see their number, then the twins in the prisoners dilemma needn't play the same way either.

But this is one reason why correlated equilibria are, usually, a better abstraction than Nash equilibria.

The "signals" players receive for correlated equilibria are already semantic. So I'm suspicious that they are better by calling on our intuition more to be used, with the implied risks. For example I remember reading about a result to the effect that correlated equilibria are easier to learn. This is not something we would expect from your explanation of the differences: If we explicitly added something (like the signals) into the game, it would generally get more complicated.

Identifiability Problem for Superrational Decision Theories

Bunthut4y00

Well, if I understand the post correctly, you're saying that these two problems are fundamentally the same problem

No. I think:

...the reasoning presented is correct in both cases, and the lesson here is for our expectations of rationality...

As outlined in the last paragraph of the post. I want to convince people that TDT-like decision theories won't give a "neat" game theory, by giving an example where they're even less neat than classical game theory.

Actually it could.

I think you're thinking about a realistic case (same algorithm, similar environment) rather than the perfect symmetry used in the argument. A communication channel is of no use there because you could just ask yourself what you would send, if you had one, and then you know you would have just gotten that message from the copy as well.

I can use my knowledge of distributed computing to look at the sort of decision problems where you play with copies

I'd be interested. I think even just more solved examples of the reasoning we want are useful currently.

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments