Chantiel - AI Alignment Forum

Oh, I'm sorry; you're right. I messed up on step two of my proposed proof that your technique would be vulnerable to the same problem.

However, it still seems to me that agents using your technique would also be concerning likely to fail to cross, or otherwise suffer from other problems. Like last time, suppose and that $A =^{'} C r o s s^{'}$ . So if the agent decides to cross, it's either because of the chicken rule, because not crossing counterfactually results in utility $\leq$ -10, or because crossing counterfactually results in utility greater than -10.

If the agent crosses because of the chicken rule, then this is a bad reason, so the bridge will blow up.

I had already assumed that not crossing counterfactually results in utility greater than -10, so it can't be the middle case.

Suppose instead the crossing counterfactual results in a utility greater than -10 utility. This seems very strange. By assumption, it's provable using the AI's proof system that $(A =^{'} C r o s s ⟹ U = - 10)$ . And the AI's counterfactual environment is supposed to line up with reality.

So, in other words, the AI has decided to cross and has already proven that crossing entails it will get -10 utility. And if the counterfactual environment assigns greater than -10 utility, then that counterfactual environment provably, within the agent's proof system, doesn't line up with reality. So how do you get an AI to both believe it will cross, believe crossing entails -10 utility, and still counterfactually thinks that crossing will result in greater than -10 utility?

In this situation, the AI can prove, within its own proof system, that the counterfactual environment of getting > -10 utility is wrong. So I guess we need an agent that allows itself to use a certain counterfactual environment even though the AI already proved that it's wrong. I'm concerned about the functionality of such an agent. If it already ignores clear evidence that it's counterfactual environment is wrong in reality, then that would really make me question that agent's ability to use counterfactual environments in other situations that line up in reality.

So it seems to me that for an agent using your take on counterfactuals to cross, it would need to either think that not crossing counterfactually results in utility $\leq - 10$ , or to ignore conclusive evidence that the counterfactual environment it's using for its chosen action would in fact not line up with reality. Both of these options seem rather concerning to me.

Also, even if you do decide to let the AI ignore conclusive evidence (to the AI) that crossing makes utility be -10, I'm concerned the bridge would get blown up anyways. I know we haven't formalized "a bad reason", but we've taken it to mean something like, "something that seems like a bad reason to the AI". If the AI wants its counterfactual environments to line up with reality, and it can clearly see that, for the action it decides to take, it doesn't line up with reality, then this seems like a "bad" reason to me.

Troll Bridge

Chantiel3y00

If we define "bad reasoning" as "crossing when there is a proof that crossing is bad" in general, this begs the question of how to evaluate actions. Of course the troll will punish counterfactual reasoning which doesn't line up with this principle, in that case. The only surprising thing in the proof, then, is that the troll also punishes reasoners whose counterfactuals respect proofs (EG, EDT).

I'm concerned that may not realize that your own current take on counterfactuals respects logical to some extent, and that, if I'm reasoning correctly, could result in agents using it to fail the troll bridge problem.

You said in "My current take on counterfactuals", that counterfactual should line up with reality. That is, the action the agent actually takes should in the utility it was said to have in its counterfactual environment.

You say that a "bad reason" is one such that the agents the procedure would think is bad. The counterfactuals in your approach are supposed to line up with reality, so if an AI's counterfactuals don't line up in reality, then this seems like this is a "bad" reason according to the definition you gave. Now, if you let your agent think "I'll get < -10 utility if I don't cross", then it could potentially cross and not get blown up. But this seems like a very unintuitive and seemingly ridiculous counterfactual environment. Because of this, I'm pretty worried it could result in an AI with such counterfactual environments malfunctioning somehow. So I'll assume the AI doesn't have such a counterfactual environment.

Suppose acting using a counterfactual environment that doesn't line up with reality counts as a "bad" reason for agents using your counterfactuals. Also suppose that in the counterfactual environment in which the agent doesn't cross, the agent counterfactually gets more than -10 utility. Then:

Suppose
Suppose $A =^{'} C r o s s^{'}$ . Then if the agent crosses it must be because either it used the chicken rule or because its counterfactual environment doesn't line up with reality in this case. Either way, this is a bad reason for crossing, so the bridge gets blown up. Thus, the AI gets -10 utility.
Thus, $⊢ (⊢ A =^{'} C r o s s^{'} ⟹ U = - 10) ⟹ U = - 10$
Thus, by Lob's theorem, $⊢ A =^{'} C r o s s^{'} ⟹ U = - 10$

Thus, either the agent doesn't cross the bridge or it does and the bridge explodes. You might just decide to get around this by saying it's okay for the agent to think it would get less than -10 utility if it didn't cross. But I'm rather worried that this would cause other problems.

You seem to be assuming that the agent's architecture has solved the problem of logical updatelessness, IE, of applying reasoning only to the (precise) extent to which it is beneficial to do so. But this is one of the problems we would like to solve! So I object to the "stop thinking about it" step w/o more details of the decision theory which allows you to do so.

I'll talk about some ways I thought of potentially formalizing, "stop thinking if it's bad".

One simple way to try to do so is to have an agent using regular evidential decision theory but have a special, "stop thinking about this thing" action that it can take. Every so often, the agent considers taking this action using regular evidential decision theory. So, in the troll bridge case, it could potentially see that the path of reasoning it's following is potentially dangerous, and thus decide to stop. Also, the agent needs to avoid thinking too many thoughts before considering to take the "stop thinking about this thing" action. Otherwise, it could think all sorts of problematic thoughts before being able to stop itself.

This simple technique might actually be enough to solve the problem, especially if the AI has the ability to choose its own inference algorithm to find one that makes the AI able to realize, "thinking about this is bad" before it finds the concrete bad thing. And, for what it's worth, it's enough for me personally to get across the bridge.

Now, this might not work reliably, because the AI might stumble upon a problematic thought before realizing it's bad. So, instead, consider this. The AI has an ability to "rollback" its belief state and then adjust its search path. For example, suppose the AI thinks through the Troll Bridge proof and then realizes that it really messed up in doing so. Then it could have a hard-coded way rolling back its belief state to what it was before it thought about the troll bridge thing, mark the nodes it expanded as verboten, and then go expand some other node.

When should it rollback? Well, it shouldn't rollback every time it comes across any evidence that decreases its expected utility. Doing so would result in the AI being very poorly calibrated by ignoring all negative evidence.

One option to do this is to actually manually come up with a rollback method that both rollbacks whenever doing so increases expected utility without resulting in poor calibration.

However, this sounds sort of hard. One cheap trick get around doing this is to just give the AI a special action to modify its rollback method, and then to select the new rollback method using its regular decision procedure. For example, I think it's clear, using evidential decision theory, that choosing the rollback method, "Rollback any thought that decreases my expected utility estimate" would result in low utility. Because that would just mean the AI would delude itself. But I bet a rollback method that says, "Roll back if you find any troll bridge proofs" would work okay.

This trick might not be perfect, since the AI could potentially think about problematic before getting a rollback procedure good enough to roll it back. But as long as the AI is smart enough to realize that it should try to get a really good rollback function before doing much of anything else, then I bet it would work okay.

Also, don't forget that we still need to do something about the agent-simulates-predictor problem. In the agent-simulates-predictor problem, agents are penalized for thinking about things in too much detail. And in whatever counterfactual environment you use, you'll need a way to deal with the agent-simulates-predictor problem. I think the most obvious approach is by controlling what the AI things about. And if you've already done that, then you can pass the troll bridge problem for free.

Also, I think it's important to note that just the fact the AI is trying to avoid thinking of crossing-is-bad proofs makes the proofs (potentially) not go through. For example, in the proof you originally gave, you supposed there is a proof the crossing results in -10 utility, and thus says the agent must have crossed from the chicken rule. But if the AI is trying to avoid these sorts of "proofs", then if it does cross, it simply could have been because the AI decided to avoid following whatever train of thought would prove that it would get -10 utility. This is considered a reasonable thing to do by the AI, so it doesn't seem like a "bad" reason.

There may be possible alternative proofs that apply to an AI that tries to steer its reasoning away from problematic areas. I'm not sure, though. I also suspect that any such proofs would be more complicated and thus harder to find.

Troll Bridge

Chantiel3y60

I'm not entirely sure what you consider to be a "bad" reason for crossing the bridge. However, I'm having a hard time finding a way to define it that both causes agents using evidential counterfactuals to necessarily fail while not having other agents fail.

One way to define a "bad" reason is an irrational one (or the chicken rule). However, if this is what is meant by a "bad" reason, it seems like this is an avoidable problem for an evidential agent, as long as that agent has control over what it decides to think about.

To illustrate, consider what I would do if I was in the troll bridge situation and used evidential counterfactuals. Then I would reason, "I know the troll will only blow up the bridge if I cross for a bad reason, but I'm generally pretty reasonable, so I think I'll do fine if I cross". And then I'd stop thinking about it. I know that certain agents, given enough time to think about it, would end up not crossing, so I'd just make sure I didn't do that.

Another way that you might have had in mind is that a "bad" reason is one such that the action the AI takes results in a provably bad outcome despite the AI thinking the action would result in a good outcome, or the reason being the chicken rule. However, in this is the case, it seems to me that no agent would be able to cross the bridge without it being blown up, unless the agent's counterfactual environment in which it didn't cross scored less than -10 utility. But this doesn't seem like a very reasonable counterfactual environment.

To see why, consider an arbitrary agent with the following decision procedure. Let counterfactual be an arbitrary specification of what would happen in some counterfactual world.

def act():
    cross_eu = expected_utility(counterfactual('A = Cross'))
    stay_eu = expected_utility(counterfactual('A = Stay'))
    if cross_eu > stay_eu:
        return cross
    return stay

The chicken rule can be added, too, if you wish. I'll assume the expected utility of staying is greater than -10.

Then it seems you can adapt the proof you gave for your agent to show that an arbitrary agent satisfying the above description would also get -10 utility if it crossed. Specifically,

Suppose . Suppose 'A = Cross' Then the agent crossed either because of the chicken rule or because counterfactual environment in which the agent crossed had utility greater than -10, or the counterfactual environment in which the agent didn't cross had less than -10 utility. We assumed the counterfactual environment in which the agent doesn't cross has more than -10 utility. Thus, it must be either the chicken rule or because crossing had more than -10 utility in expectation. If it's because of the chicken rule, then this is a "bad" reason, so, the troll will destroy the bridge just like in the original proof. Thus, utility would equal -10. Suppose instead the agent crosses because expected_utility(counterfactual(A = Cross)) > -10. However, by the assumption, $⊢ A = C r o s s ⟹ U = - 10$ . Thus, since the agent actually crosses, this in fact provably results in -10 utility and the AI is thus wrong in thinking it would get a good outcome. Thus, the AI's action results in provably bad outcomes. Therefore, the troll destroys the bridge. Thus, utility would equal -10. Thus, 'A = Cross \implies U = -10`. Thus, ( $⊢ A = C r o s s ⟹ U = - 10) ⟹ (A = C r o s s ⟹ U = - 10)$ . Thus, by Lob's theorem, $A = C r o s s ⟹ U = - 10$

As I said, you could potentially avoid getting the bridge destroyed by assigning expected utility less than -10 to the counterfactual environment in which the AI doesn't cross. This seems like a "silly" counterfactual environment, so it doesn't seem like something we would want an AI to think. Also, since it seems like a silly thing to think, a troll may consider the use of such a counterfactual environment to be a bad reason to cross the bridge, and thus destroy it anyways.

The ground of optimization

Chantiel4y00

You said your definition would not classify a bottle cap with water in it as an optimizer. This might be really nit-picky, but I'm not sure it's generally true.

I say this because the water in the bottle cap could evaporate. Thus, supposing there is no rain, from a wide range of possible states of the bottle cap, it would tend towards no longer having water in it.

I know you said you make an exception for tendencies towards increased entropy being considered optimizers. However, this does not increase the entropy of the bottlecap, It could potentially increase the entropy of the water that was in the bottle cap, but this is not necessarily the case. For example, if the bottle cap is kept in a sealed container, the water vapor could potentially condense into a small puddle with the same entropy as it had in the bottle cap.

If my memory of physics is correct, water evaporating would still increases the total entropy of the total system in which the bottle cap is located, by virtue of releasing some heat into the environment . However, note that humans and robots also, merely by doing mechanical work and thus forming heat which is then dispersed into the environment, result in increased entropy of the system they're in. So you can't rule out any system that makes its environment tend towards increased entropy from being an optimizer, because that's what humans and robots do, too.

That said, if you clarify that the bottle cap is not in any such contained system, I think the water would result in a higher-entropy state.

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments