User Comment Replies — AI Alignment Forum

Oh, I'm sorry; you're right. I messed up on step two of my proposed proof that your technique would be vulnerable to the same problem.

However, it still seems to me that agents using your technique would also be concerning likely to fail to cross, or otherwise suffer from other problems. Like last time, suppose $⊢ (A =^{'} C r o s s^{'} ⟹ U = - 10)$ and that $A =^{'} C r o s s^{'}$ . So if the agent decides to cross, it's either because of the chicken rule, because not crossing counterfactually results in utility $\leq$ -10, or because crossing counterfactually results in utility greater than -10... (read more)

2Abram Demski3y

Right. This is precisely the sacrifice I'm making in order to solve Troll Bridge. Something like this seems to be necessary for any solution, because we already know that if your expectations of consequences entirely respect entailment, you'll fall prey to the Troll Bridge! In fact, your "stop thinking"/"rollback" proposals have precisely the same feature: you're trying to construct expectations which don't respect the entailment. So I think if you reject this, you just have to accept Troll Bridge. Well, this is precisely not what I mean when I say that the counterfactuals line up with reality. What I mean is that they should be empirically grounded, so, in cases where the condition is actually met, we see the predicted result. Rather than saying this AI's counterfactual expectations are "wrong in reality", you should say they are "wrong in logic" or something like that. Otherwise you are sneaking in an assumption that (a) counterfactual scenarios are real, and (b) they really do respect entailment. We can become confident in my strange counterfactual by virtue of having seen it play out many times, eg, crossing similar bridges many times. This is the meat of my take on counterfactuals: to learn them in a way that respects reality, rather than trying to deduce them. To impose empiricism on them, ie, the idea that they must make accurate predictions in the cases we actually see. And it simply is the case that if we prefer such empirical beliefs to logic, here, we can cross. So in this particular example, we see a sort of evidence that respecting entailment is a wrong principle for counterfactual expectations. The 5&10 problem can also be thought of as evidence against entailment as counterfactual. You have to realize that reasoning in this way amounts to insisting that the correct answer to Troll Bridge is not crossing, because the troll bridge variant you are proposing just punishes anyone whose reasoning differs from entailment. And again, you were also propo

Troll Bridge

Chantiel3y00

If we define "bad reasoning" as "crossing when there is a proof that crossing is bad" in general, this begs the question of how to evaluate actions. Of course the troll will punish counterfactual reasoning which doesn't line up with this principle, in that case. The only surprising thing in the proof, then, is that the troll also punishes reasoners whose counterfactuals respect proofs (EG, EDT).

I'm concerned that may not realize that your own current take on counterfactuals respects logical to some extent, and that, if I'm reasoning correctly, could res... (read more)

2Abram Demski3y

If your point is that there are a lot of things to try, I readily accept this point, and do not mean to argue with it. I only intended to point out that, for your proposal to work, you would have to solve another hard problem. Ordinary Bayesian EDT has to finish its computation (of its probabilistic expectations) in order to proceed. What you are suggesting is to halt those calculations midway. I think you are imagining an agent who can think longer to get better results. But vanilla EDT does not describe such an agent. So, you can't start with EDT; you have to start with something else (such as logical induction EDT) which does already have a built-in notion of thinking longer. Then, my concern is that we won't have many guarantees for the performance of this system. True, it can stop thinking if it knows thinking will be harmful. However, if it mistakenly thinks a specific form of thought will be harmful, it has no device for correction. This is concerning because we expect "early" thoughts to be bad -- after all, you've got to spend a certain amount of time thinking before things converge to anything at all reasonable. So we're between a rock and a hard spot here: we have to stop quite early, because we know the proof of troll bridge is small. But we con't stop early, because we know things take a bit to converge. So I think this proposal is just "somewhat-logically-updateless-DT", which I don't think is a good solution. Generally I think rollback solutions are bad. (Several people have argued in their favor over the years; I find that I'm just never intrigued by that direction...) Some specific remarks: * Note that if you literally just roll back, you would go forward the same way again. So you need to somehow modify the rolled back state, creating a "pseudo-ignorant" belief states where you're not really uninformed, but rather, reconstruct something merely similar to an uninformed state. * It is my impression that this causes problems. * You mig

2Abram Demski3y

To elaborate a little, one way we could think about this would be that "in a broad variety of situations" the agent would think this property sounded pretty bad. For example, the hypothetical "PA proves ⊥" would be evaluated as pretty bad by a proof-based agent, in many situations; it would not expect its future self to make decisions well, so, it would often have pretty poor performance bounds for its future self (eg the lowest utility available in the given scenario). So far so good -- your condition seems like one which a counterfactual reasoner would broadly find concerning. It also passes the sniff test of "would I think the agent is being dumb if it didn't cross for this reason?" The fact that there's a troll waiting to blow up a bridge if I'm empirically incorrect about that very setup should not, in itself, make me too reluctant to cross a bridge. If I'm very confident that the situation is indeed as described, then intuitively, I should confidently cross. But it seems that, if I believe your proof, I would not believe this any more. You don't prove whether the agent crosses or not, but you do claim to prove that if the agent crosses, it in fact gets blown up. It seems you think the correct counterfactual (for such an agent) is indeed that it would get blown up if it crosses: So if the proof is to be believed, it seems like the philosophical argument falls flat? If the agent fails to cross for this reason, then it seems you think it is reasoning correctly. If it crosses and explodes, then it fails because it had wrong counterfactuals. This also does not seem like much of an indictment of how it was reasoning -- garbage in, garbage out. We can concern ourselves with achieving more robust reasoners, for sure, so that sometimes garbage in -> treasure out. But that's a far cry from the usual troll bridge argument, where the agent has a 100% correct description of the situation, and nonetheless, appears to mishandle it. To summarize: * The usual troll bri

Troll Bridge

Chantiel3y60

I'm not entirely sure what you consider to be a "bad" reason for crossing the bridge. However, I'm having a hard time finding a way to define it that both causes agents using evidential counterfactuals to necessarily fail while not having other agents fail.

One way to define a "bad" reason is an irrational one (or the chicken rule). However, if this is what is meant by a "bad" reason, it seems like this is an avoidable problem for an evidential agent, as long as that agent has control over what it decides to think about.

To illustrate, consider what I would ... (read more)

2Abram Demski3y

Ok. This threw me for a loop briefly. It seems like I hadn't considered your proposed definition of "bad reasoning" (ie "it's bad if the agent crosses despite it being provably bad to do so") -- or had forgotten about that case. I'm not sure I endorse the idea of defining "bad" first and then considering the space of agents who pass/fail according to that notion of "bad"; how this is supposed to work is, rather, that we critique a particular decision theory by proposing a notion of "bad" tailored to that particular decision theory. For example, if a specific decision theorist thinks proofs are the way to evaluate possible actions, then "PA proves ⊥" will be a convincing notion of "bad reasoning" for that specific decision theorist. If we define "bad reasoning" as "crossing when there is a proof that crossing is bad" in general, this begs the question of how to evaluate actions. Of course the troll will punish counterfactual reasoning which doesn't line up with this principle, in that case. The only surprising thing in the proof, then, is that the troll also punishes reasoners whose counterfactuals respect proofs (EG, EDT). If I had to make a stab at a generic notion of "bad", it would be "the agent's own way of evaluating consequences says that the consequences of its actions will be bad". But this is pretty ambiguous in some cases, such as chicken rule. I think a more appropriate way to generally characterize "bad reasoning" is just to say that proponents of the decision theory in question should agree that it looks bad. This is an open question, even for the examples I gave! I've been in discussions about Troll Bridge where proponents of proof-based DT (aka MUDT) argue that it makes perfect sense for the agent to think its action can control the consistency of PA in this case, so the reasoning isn't "bad", so the problem is unfair. I think it's correct to identify this as the crux of the argument -- whether I think the troll bridge argument incriminates proof

The ground of optimization

Chantiel4y00

You said your definition would not classify a bottle cap with water in it as an optimizer. This might be really nit-picky, but I'm not sure it's generally true.

I say this because the water in the bottle cap could evaporate. Thus, supposing there is no rain, from a wide range of possible states of the bottle cap, it would tend towards no longer having water in it.

I know you said you make an exception for tendencies towards increased entropy being considered optimizers. However, this does not increase the entropy of the bottlecap, It could potentially increa... (read more)

1Alex Flint4y

Thank you for this comment Chantiel. Yes, a container that engineered to evaporate water poured anywhere into it and condense it into a central area would be an optimizing system by my definition. That is a bit like a ball rolling down a hill, which is also an optimizing system and also has nothing resembling agency. I am The bottle cap example was actually about putting a bottle cap onto a bottle and asking whether, since the water now stays inside the bottle, it should be considered an optimizer. I pointed out that this would not qualify as an optimizing system because if you moved a water molecule from the bottle and place it outside the bottle, the bottle cap would not act to put it back inside.

AI ALIGNMENT FORUM
AF

All of Chantiel's Comments + Replies