TurnTrout's shortform feed

If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do think it's more likely than not that the former wins, but it's not guaranteed.
Probably someone like Ryan believes most of those failures will come in the form of explicit conflict or sudden attacks. I can also imagine slower erosions of global utility, for example by safe interfaces/defenses between humans becoming unworkable slop into which most resources go.

If this doesn't happen at roughly the same speed for all humans, you also get power imbalance and its consequences. One could argue that differences in resources between humans will augment, in which case this is the only stable state.

If instead of perfect single-single alignment we get the partial (or more taxing) fix I expect, the situation degrades further. Extending the analogy, this would be the smart humans sometimes being possessed by spirits with different utilities, which not only has direct negative consequences but could also complicate coordination once it's common knowledge.

AI companies are unlikely to make high-assurance safety cases if timelines are short

Martín Soto2mo4-3

Most difficulties you raise here could imo change drastically with tens of billions being injected into AI safety, especially thanks to new ideas coming out of left field that might make safety cases way more efficient. (I'm probably more optimistic about new ideas than you, partly because "it always subjectively feels like there are no big ideas left", and AI safety is so young.)
If your government picks you as a champion and gives you amazing resources, you no longer have to worry about national competition, and that amount seems doable. You still have to worry about international competition, but will you feel so closely tied that you can't even spare that much? My guess would be no. That said, I still don't expect certain lab leaders to want to do this.

The same is not true of security though, that's a tough one.

Eliciting bad contexts

Martín Soto2mo55

See our recent work (especially section on backdoors) which opens the door to directly asking the model. Although there are obstacles like Reversal Curse and it's unclear if it can be made to scale.

Automation collapse

Martín Soto5mo44

Very cool! But I think there's a crisper way to communicate the central point of this piece (or at least, a way that would have been more immediately transparent to me). Here it is:

Say you are going to use Process X to obtain a new Model. Process X can be as simple as "pre-train on this dataset", or as complex as "use a bureaucracy of Model A to train a new LLM, then have Model B test it, then have Model C scaffold it into a control protocol, then have Model D produce some written arguments for the scaffold being safe, have a human read them, and if they reject delete everything". Whatever Process X is, you have only two ways to obtain evidence that Process X has a particular property (like "safety"): looking a priori at the spec of Process X (without running it), or running (parts of) Process X and observing its outputs a posteriori. In the former case, you clearly need an argument for why this particular spec has the property. But in the latter case, you also need an argument for why observing those particular outputs ensures the property for this particular spec. (Pedantically speaking, this is just Kuhn's theory-ladenness of observations.)

Of course, the above reasoning doesn't rule out the possibility that the required arguments are pretty trivial to make. That's why you summarize some well-known complications of automation, showing that the argument will not be trivial when Process X contains a lot of automation, and in fact it'd be simpler if we could do away with the automation.

It is also the case that the outputs observed from Process X might themselves be human-readable arguments. While this could indeed alleviate the burden of human argument-generation, we still need a previous (possibly simpler) argument for why "a human accepting those output arguments" actually ensures the property (especially given those arguments could be highly out-of-distribution for the human).

In Defense of Open-Minded UDT

Martín Soto8mo1-2

That is: in this case at least it seems like there's concrete reason to believe we can have some cake and eat some too.

I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can't do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases your utility in other worlds). The fundamental dichotomy remains as sharp, and it's misleading to imply we can surmount it. It's great to discuss, given this dichotomy, which trade-offs we humans are more comfortable making. But I've felt this was obscured in many relevant conversations.

This content-work seems primarily aimed at discovering and navigating actual problems similar to the decision-theoretic examples I'm using in my arguments. I'm more interested in gaining insights about what sorts of AI designs humans should implement. IE, the specific decision problem I'm interested in doing work to help navigate is the tiling problem.

My point is that the theoretical work you are shooting for is so general that it's closer to "what sorts of AI designs (priors and decision theories) should always be implemented", rather than "what sorts of AI designs should humans in particular, in this particular environment, implement".
And I think we won't gain insights on the former, because there are no general solutions, due to fundamental trade-offs ("no-free-lunchs").
I think we could gain many insights on the former, but that the methods better fit for that are less formal/theoretical and way messier/"eye-balling"/iterating.

In Defense of Open-Minded UDT

Martín Soto8mo10

Excellent explanation, congratulations! Sad I'll have to miss the discussion.

Interlocutor: Neither option is plausible. If you update, you're not dynamically consistent, and you face an incentive to modify into updatelessness. If you bound cross-branch entanglements in the prior, you need to explain why reality itself also bounds such entanglements, or else you're simply advising people to be delusional.

You found yourself a very nice interlocutor. I think we truly cannot have our cake and eat it: either you update, making you susceptible to infohazards=traps (if they exist, and they might exist), or you don't, making you entrenched forever. I think we need to stop dancing around this fact, recognize that a fully-general solution in the formalism is not possible, and instead look into the details of our particular case. Sure, our environment might be adversarially bad, traps might be everywhere. But under this uncertainty, which ways do we think are best to recognize and prevent traps (while updating on other things). This is kind of studying and predicting generalization: given my past observations, where do I think I will suddenly fall out of distribution (into a trap)?

Me: I'm not sure if that's exactly the condition, but at least it motivates the idea that there's some condition differentiating when we should be updateful vs updateless. I think uncertainty about "our own beliefs" is subtly wrong; it seems more like uncertainty about which beliefs we endorse.

This was very though-provoking, but unfortunately I still think this crashes head-on with the realization that, a priori and in full generality, we can't differentiate between safe and unsafe updates. Indeed, why would we expect that no one will punish us by updating on "our own beliefs" or "which beliefs I endorse"? After all, that's just one more part of reality (without a clear boundary separating it).

It sounds like you are correctly explaining that our choice of prior will be, in some important sense, arbitrary: we can't know the correct one in advance, we always have to rely on extrapolating contingent past observations.
But then, it seems like your reaction is still hoping that we can have our cake and eat it: "I will remain uncertain about which beliefs I endorse, and only later will I update on the fact that I am in this or that reality. If I'm in the Infinite Counterlogical Mugging... then I will just eventually change my prior because I noticed I'm in the bad world!". But then again, why would we think this update is safe? That's just not being updateless, and losing out on the strategic gains from not updating.

Since a solution doesn't exist in full generality, I think we should pivot to more concrete work related to the "content" (our particular human priors and our particular environment) instead of the "formalism". For example:

Conceptual or empirical work on which are the robust and safe ways to extract information from humans (Suddenly LLM pre-training becomes safety work)
Conceptual or empirical work on which actions or reasoning are more likely to unearth traps under different assumptions (although this work could unearth traps)
Compilation or observation of properties of our environment (our physical reality) that could have some weak signal on which kinds of moves are safe
- Unavoidably, this will involve some philosophical / almost-ethical reflection about which worlds we care about and which ones we are willing to give up.

Meaning & Agency

Martín Soto1y80

(I will not try to prove transitivity here, since my goal is to get the overall picture across; I have not checked it, although I expect it to hold.)

Transitivity doesn't hold, here's a counterexample.

The intuitive story is: X's action tells you whether Z failed, Y fails sometimes, and Z fails more rarely.

The full counterexample (all of the following is according to your beliefs ): Say available actions are 0 and 1. There is a hidden fair coin, and your utility is high if you manage to match the coin, and low if you don't. Y peeks at the coin, and takes the correct action, except when it fails, which has a 1/4 chance. Z does the same, but it only fails with a 1/100 chance. X plays 1 iff Z has failed.
Given X's and Y's action, you always go with Y's action, since X tells you nothing about the coin, and Y gives you some information. Given Z's and Y's actions, you always go with Z's, because it's less likely to have failed (even when they disagree). But given Z's and X's, there will be some times (1/100), in which you see X played 1, and then you will not play the same as Z.

The same counterexample works for beliefs (or continuous actions) instead of discrete actions (where you will choose a probability $p \in [0, 1]$ to believe, instead of an action $a \in {0, 1}$ ), but needs a couple small changes. Now both Z and Y fail with 1/4 probability (independently). Also, Y outputs its guess as 0.75 or 0.25 (instead of 1 or 0), because YOU (that is, $P_{1}$ ) will be taking into account the possibility that it has failed (and Y better output whatever you will want to guess after seeing it). Instead of Z, consider A as the third expert, which outputs 0.5 if Z and Y disagree, 15/16 if they agree on yes, and 1/16 if they agree on no. X still tells you whether Z failed. Seeing Y and X, you always go with Y's guess. Seeing A and Y, you always go with A's guess. But if you see A = 15/16 and X = 1, you know both failed, and guess 0. (In fact, even when you see X = 0, you will guess 1 instead of 15/16.)

How LDT helps reduce the AI arms race

Martín Soto1y11

I think this has a fix-point selection problem: If one or both of them start with a different prior under which the other player punishes them for not racing / doesn't reward them enough (maybe because they have very little faith in the other's rationality, or because they think it's not within their power to decide that, and also there's not enough evidential correlation in their decisions), then they'll race.

Of course, considerations about whether the other player normatively endorses something LDT-like also enter the picture. And even if individual humans would endorse it (and that's already a medium-big if), I worry our usual decision structures (for example in AI labs) don't incentivize it (and what's the probability some convincing decision theorist cuts through them? not sure).

P₂B: Plan to P₂B Better

Martín Soto2y30

we have only said that P2B is the convergent instrumental goal. Whenever there are obvious actions that directly lead towards the goal, a planner should take them instead.

Hmm, given your general definition of planning, shouldn't it include realizations (and their corresponding guided actions) of the form "further thinking about this plan is worse than already acquiring some value now", so that P2B itself already includes acquiring the terminal goal (and optimizing solely for P2B is thus optimal)?

I guess your idea is "plan to P2B better" means "plan with the sole goal of improving P2B", so that it's a "non-value-laden" instrumental goal.

Thinking About Filtered Evidence Is (Very!) Hard

Martín Soto2y80

Since this hypothesis makes distinct predictions, it is possible for the confidence to rise above 50% after finitely many observations.

I was confused about why this is the case. I now think I've got an answer (please anyone confirm):
The description length of the Turing Machine enumerating theorems of PA is constant. The description length of any Turing Machine that enumerates theorems of PA up until time-step n and the does something else grows with n (for big enough n). Since any probability prior over Turing Machines has an implicit simplicity bias, no matter what prior we have, for big enough n the latter Turing Machines will (jointly) get arbitrarily low probability relative to the first one. Thus, after enough time-steps, given all observations are PA theorems, our listener will assign arbitrarily higher probability to the first one than all the rest, and thus the first one will be over 50%.

Edit: Okay, I now saw you mention the "getting over 50%" problem further down:

I don't know if the argument works out exactly as I sketched; it's possible that the rich hypothesis assumption needs to be "and also positive weight on a particular enumeration". Given that, we can argue: take one such enumeration; as we continue getting observations consistent with that observation, the hypothesis which predicts it loses no weight, and hypotheses which (eventually) predict other things must (eventually) lose weight; so, the updated probability eventually believes that particular enumeration will continue with probability > 1/2.

But I think the argument goes through already with the rich hypothesis assumption as initially stated. If the listener has non-zero prior probability on the speaker enumerating theorems of PA, it must have non-zero probability on it doing so in a particular enumeration. (unless our specification of the listener structure doesn't even consider different enumerations? but I was just thinking of their hypothesis space as different Turing Machines the whole time) And then my argument above goes through, which I think is just your argument + explicitly mentioning the additional required detail about the simplicity prior.