The distinction between "accidental" and "negligent" is always a bit political. It's a question of assignment of credit/blame for hypothetical worlds, which is pretty much impossible in any real-world causality model.
I do agree that in most discussions, "accident" often implies a single unexpected outcome, rather than a repeated risk profile and multiple moves toward the bad outcome. Even so, if it doesn't reach the level of negligence for any one actor, Eliezer's term "inadequate equilibrium" may be more accurate.
Which means that using a...
Interesting take, but I'll note that these are not acausal, just indirect-causal. Voting is a good example - counts are public, so future voters KNOW how many of their fellow citizens take it seriously enough to participate.
In all of these examples, there is a signaling path to future impact. Which humans are perhaps over-evolved to focus on.
I really wish you'd included the outside-of-game considerations. The example of what to eat for dinner is OVERWHELMINGLY about the future relationship between the diners, not about the result itself. This is true of all real-world bargaining (where you're making commitments and compromises) - you're giving up some immediate value in order to make future interactions way better.
Agreed. The bargaining solution for the entire game can be very different from adding up the bargaining solutions for the subgames. If there's a subgame where Alice cares very much about victory in that subgame (interior decorating choices) and Bob doesn't care much, and another subgame where Bob cares very much about it (food choice) and Alice doesn't care much, then the bargaining solution of the entire relationship game will end up being something like "Alice and Bob get some relative weights on how important their preferences are, and in all the subgam...
Is there an ELI5 doc about what's "normal" for Oracles, and why they're constrained in that way? The examples I see confuse me in that they are exploring what seem like edge cases, and I'm missing the underlying model that makes these cases critical.
Specifically, when you say "It's only guaranteed to be correct on the actual decision", why does the agent not know what "correct" means for the decision?
I don't follow the half-universe argument. Are you somehow sending the AGI outside of your light-cone? Or have you crafted the AGI utility function and altered your own to not care about the others' half? I don't get the model of utility that works for
The only information you have about the other half is your utility.
My conception of utility is that it's a synthetic calculation from observations about the state of the universe, not that it's a thing on it's own which can carry information.
Sorry, I didn't mean to be accusatory in that, only descriptive in a way that I hope will let me understand what you're trying to model/measure as "alignment", with the prerequisite understanding of what the payout matrix indicates. http://cs.brown.edu/courses/cs1951k/lectures/2020/chapters1and2.pdf is one reference, but I'll admit it's baked in to my understanding to the point that I don't know where I first saw it. I can't find any references to the other interpretation (that the payouts are something other than a ranking of preferences...
I went back and re-read your https://www.lesswrong.com/posts/8LEPDY36jBYpijrSw/what-counts-as-defection post, and it's much clearer to me that you're NOT using standard game-theory payouts (utility) here. You're using some hybrid of utility and resource payouts, where you seem to normalize payout amounts, but then don't limit the decision to the payouts - players have a utility function which converts the payouts (for all players, not just themselves) into something they maximize in their decision. It's not clear whether they include any non-mo...
I think this is backward. The game's payout matrix determines the alignment. Fixed-sum games imply (in the mathematical sense) unaligned players, and common-payoff games ARE the definition of alignment.
When you start looking at meta-games (where resource payoffs differ from utility payoffs, based on agent goals), then "alignment" starts to make sense as a distinct measurement - it's how much the players' utility functions transform the payoffs (in the sub-games of a series, and in the overall game) from fixed-sum to common-payoff.
In everyday life, "notion" implies low-confidence. Often derogatory - low-confidence for the speaker, and a further implication that the holder/object of discussion doesn't even have the idea of confidence.
You might just use "proposition" or "claim" to mean the specific thing that a probability belief applies to.
Depending on your threat modeling of a given breach, this could be comforting or terrifying.
The economic incentives to attack and to defend are usually similar. Systems get broken sometimes but not always.
If the cost of a loss (AGI escapes, takes over the world, and runs it worse than humans are) is much higher, that changes the "economic incentives" about this. It implies that "sometimes but not always" is a very dangerous equilibrium. If the cost of a loss (AGI has a bit more influence on the outside world, but doesn't actually destroy much) is more inline with today's incentives, it's a fine thing.
It's worth being careful to acknowledge that this set of assumptions is far more limited than the game-theoretical underpinnings. Because it requires interpersonal utility summation, you can't normalize in the same ways, and you need to do a LOT more work to show that any given situation fits this model. Most situations and policies don't even fit the more general individual-utility model, and I suspect even fewer will fit this extension.
That said, I like having it formalized, and I look forward to the extension to multi-coalition situations. A spy can benefit Russia and the world more than they hurt the average US resident.
"kill all humans, then shut down" is probably the action that most minimizes change. Leaving those buggers alive will cause more (and harder to predict) change than anything else the agent might do.
There's no way to talk about this in the abstract sense of change - it has to be differential from a counterfactual (aka: causal), and can only be measured by other agents' evaluation functions. The world changes for lots of reasons, and an agent might have most of it's impact by PREVENTING a change, or by FAILING to change something t...
[note: this is bugging me more than it should. I really don't get why this is worth so much repetition of examples that don't show anything new.]
I'll admit I'm one of those who doesn't see CDT as hopeless. It takes a LOT of hypothetical setup to show cases where it fails, and neither newcomb nor this seem to be as much about decision theory as about free will.
Part of this is my failing. I keep thinking CDT is "classical decision theory", and it means "make the best conditional predictions you can, and then maximiz...
To summarize my confusion, does CDT require that the agent unconditionally believe in perfect free will independent of history (and, ironically, with no causality for the exercise of will)? If so, that should be the main topic of dispute - the frequency of actual case where it makes bad predictions, not that it makes bad decisions in ludicrously-unlikely-and-perhaps-impossible situations.
Sorta, yes. CDT requires that you choose actions not by thinking "conditional on my doing A, what happens?" but rather by some other method (there are differen...
I think you're missing at least one key element in your model: uncertainty about future predictions. Commitments have a very high cost in terms of future consequence-effecting decision space. Consequentialism does _not_ imply a very high discount rate, and we're allowed to recognize the limits of our prediction and to give up some power in the short term to reserve our flexibility for the future.
Also, one of the reasons that this kind of interaction is rare among humans is that commitment is impossible for humans. We can change our minds even ...
Based on other comments, I realize I'm making an assumption for something you haven't specified. How is uy chosen? If it's random and independent, then my assertion holds, if it's selected by an adversary who knows the players' full strategies somehow, then R is just a way of keeping a secret from the adversary - sequence doesn't matter, but knowledge does.
uy and R are independently chosen from well-defined distributions. Regardless of sequence, neither knows the other and CANNOT be chosen based on the other. I'll see if I can find time tonight to figure out whether I'm saying your claim 1 is wrong (it dropped epsilon too soon from the floor value, but I'm not sure if it's more fundamentally problematic than that) or that your claim 2 is misleading.
My current expectation is that I'll find that your claim 2 results are available in situation 1, by using your given function with a pre-agreed value rather than a random one.
I'm missing something (and I haven't digested the math, so maybe it's obvious but just missing from the narrative description). Is epsilon the same for both players, in that they see the same V, it just may not exactly match u? or is it different for each player, meaning for the same u, they have different V? From your analysis (risk of 0), it sounds like the latter.
In that case, I don't see how additional shared knowledge helps coordinate them, nor why it needs to be random rather than just a fixed value they agree on in advance. A...
Thanks for this - I'm in a more peripheral part of the industry (consumer/industrial LLM usage, not directly at an AI lab), and my timelines are somewhat longer (5 years for 50% chance), but I may be using a different criterion for "automate virtually all remote workers". It'll be a fair bit of time (in AI frame - a year or ten) between "labs show generality sufficient to automate most remote work" and "most remote work is actually performed by AI".