Just this guy, you know?
The distinction between "accidental" and "negligent" is always a bit political. It's a question of assignment of credit/blame for hypothetical worlds, which is pretty much impossible in any real-world causality model.
I do agree that in most discussions, "accident" often implies a single unexpected outcome, rather than a repeated risk profile and multiple moves toward the bad outcome. Even so, if it doesn't reach the level of negligence for any one actor, Eliezer's term "inadequate equilibrium" may be more accurate.
Which means that using a different word will be correctly identified as a desire to shift responsibility from "it's a risk that might happen" to "these entities are bringing that risk on all of us".
Interesting take, but I'll note that these are not acausal, just indirect-causal. Voting is a good example - counts are public, so future voters KNOW how many of their fellow citizens take it seriously enough to participate.
In all of these examples, there is a signaling path to future impact. Which humans are perhaps over-evolved to focus on.
I really wish you'd included the outside-of-game considerations. The example of what to eat for dinner is OVERWHELMINGLY about the future relationship between the diners, not about the result itself. This is true of all real-world bargaining (where you're making commitments and compromises) - you're giving up some immediate value in order to make future interactions way better.
Is there an ELI5 doc about what's "normal" for Oracles, and why they're constrained in that way? The examples I see confuse me in that they are exploring what seem like edge cases, and I'm missing the underlying model that makes these cases critical.
Specifically, when you say "It's only guaranteed to be correct on the actual decision", why does the agent not know what "correct" means for the decision?
I don't follow the half-universe argument. Are you somehow sending the AGI outside of your light-cone? Or have you crafted the AGI utility function and altered your own to not care about the others' half? I don't get the model of utility that works for
The only information you have about the other half is your utility.
My conception of utility is that it's a synthetic calculation from observations about the state of the universe, not that it's a thing on it's own which can carry information.
Sorry, I didn't mean to be accusatory in that, only descriptive in a way that I hope will let me understand what you're trying to model/measure as "alignment", with the prerequisite understanding of what the payout matrix indicates. http://cs.brown.edu/courses/cs1951k/lectures/2020/chapters1and2.pdf is one reference, but I'll admit it's baked in to my understanding to the point that I don't know where I first saw it. I can't find any references to the other interpretation (that the payouts are something other than a ranking of preferences by each player).
So the question is "what DO these payout numbers represent"? and "what other factors go into an agent's decision of which row/column to choose"?
Thanks for this - I'm in a more peripheral part of the industry (consumer/industrial LLM usage, not directly at an AI lab), and my timelines are somewhat longer (5 years for 50% chance), but I may be using a different criterion for "automate virtually all remote workers". It'll be a fair bit of time (in AI frame - a year or ten) between "labs show generality sufficient to automate most remote work" and "most remote work is actually performed by AI".