AI ALIGNMENT FORUM
AF

All of Dagon's Comments + Replies

Thanks for this - I'm in a more peripheral part of the industry (consumer/industrial LLM usage, not directly at an AI lab), and my timelines are somewhat longer (5 years for 50% chance), but I may be using a different criterion for "automate virtually all remote workers". It'll be a fair bit of time (in AI frame - a year or ten) between "labs show generality sufficient to automate most remote work" and "most remote work is actually performed by AI".

5Ryan Greenblatt3mo

A key dynamic is that I think massive acceleration in AI is likely after the point when AIs can accelerate labor working on AI R&D. (Due to all of: the direct effects of accelerating AI software progress, this acceleration rolling out to hardware R&D and scaling up chip production, and potentially greatly increased investment.) See also here and here. So, you might very quickly (1-2 years) go from "the AIs are great, fast, and cheap software engineers speeding up AI R&D" to "wildly superhuman AI that can achieve massive technical accomplishments".

Why I hate the "accident vs. misuse" AI x-risk dichotomy (quick thoughts on "structural risk")

Dagon2y*52

The distinction between "accidental" and "negligent" is always a bit political. It's a question of assignment of credit/blame for hypothetical worlds, which is pretty much impossible in any real-world causality model.

I do agree that in most discussions, "accident" often implies a single unexpected outcome, rather than a repeated risk profile and multiple moves toward the bad outcome. Even so, if it doesn't reach the level of negligence for any one actor, Eliezer's term "inadequate equilibrium" may be more accurate.

Which means that using a... (read more)

3David Scott Krueger2y

I think inadequate equillibrium is too specific and insider jargon-y.

Humans do acausal coordination all the time

Dagon2y710

Interesting take, but I'll note that these are not acausal, just indirect-causal. Voting is a good example - counts are public, so future voters KNOW how many of their fellow citizens take it seriously enough to participate.

In all of these examples, there is a signaling path to future impact. Which humans are perhaps over-evolved to focus on.

1Gordon Seidoh Worley2y

Right. Nothing that happens in the same Hubble volume can really be said to not be causally connected. Nonetheless I like the point of the OP even if it's made in an imprecise way.

3Adam Jermyn2y

Hmmmm. I agree that there is a signal path to future impact (at least in voting). Two responses there: 1. There isn't such a signal in recycling. I have no idea how much my town recycles. Ditto for carbon offsets. How many of my closest friends offset the carbon from their flights? I have no idea. 2. Counts being public tells me how many people voted, but there's something a little funny there. There's almost no signal from my vote in there (concretely, I don't think my vote changes the number from one that tells other people "voting isn't worth it" to "voting is worth it"). I notice I'm confused how to think about this though, and maybe you can clarify/expand on your indirect signal point?

Unifying Bargaining Notions (1/2)

Dagon3y-22

I really wish you'd included the outside-of-game considerations. The example of what to eat for dinner is OVERWHELMINGLY about the future relationship between the diners, not about the result itself. This is true of all real-world bargaining (where you're making commitments and compromises) - you're giving up some immediate value in order to make future interactions way better.

Diffractor3y22

Agreed. The bargaining solution for the entire game can be very different from adding up the bargaining solutions for the subgames. If there's a subgame where Alice cares very much about victory in that subgame (interior decorating choices) and Bob doesn't care much, and another subgame where Bob cares very much about it (food choice) and Alice doesn't care much, then the bargaining solution of the entire relationship game will end up being something like "Alice and Bob get some relative weights on how important their preferences are, and in all the subgam... (read more)

Oracle predictions don't apply to non-existent worlds

Dagon4y00

Thanks for patience with this. I am still missing some fundamental assumption or framing about why this is non-obvious (IMO, either the Oracle is wrong, or the choice is illusory). I'll continue to examine the discussions and examples in hopes that it will click.

1Chris_Leong4y

I presume Vladimir and me are likely discussing this from within the determinist paradigm in which "either the Oracle is wrong, or the choice is illusory" doesn't apply (although I propose a similar idea in Why 1-boxing doesn't imply backwards causation).

0Vladimir Nesov4y

This is similar to determinism vs. free will, and suggests the following example. The Oracle proclaims: "The world will follow the laws of physics!". But in the counterfactual where an agent takes a decision that won't actually be taken, the fact of taking that counterfactual decision contradicts the agent's cognition following the laws of physics. Yet we want to think about the world within the counterfactual as if the laws of physics are followed.

Oracle predictions don't apply to non-existent worlds

Dagon4y00

Hmm. So does this only apply to CDT agents, who foolishly believe that their decision is not subject to predictions?

2Vladimir Nesov4y

No, I suspect it's a correct ingredient of counterfactuals, one I didn't see discussed before, not an error restricted to a particular decision theory. There is no contradiction in considering each of the counterfactuals as having a given possible decision made by the agent and satisfying the Oracle's prediction, as the agent doesn't know that it won't make this exact decision. And if it does make this exact decision, the prediction is going to be correct, just like the possible decision indexing the counterfactual is going to be the decision actually taken. Most decision theories allow explicitly considering different possible decisions, and adding correctness of the Oracle's prediction into the mix doesn't seem fundamentally different in any way, it's similarly sketchy.

Oracle predictions don't apply to non-existent worlds

Dagon4y00

Is there an ELI5 doc about what's "normal" for Oracles, and why they're constrained in that way? The examples I see confuse me in that they are exploring what seem like edge cases, and I'm missing the underlying model that makes these cases critical.

Specifically, when you say "It's only guaranteed to be correct on the actual decision", why does the agent not know what "correct" means for the decision?

1Vladimir Nesov4y

The agent knows what "correct" means, correctness of a claim is defined for the possible worlds that the agent is considering while making its decision (which by local tradition we confusingly collectively call "counterfactuals", even though one of them is generated by the actual decision and isn't contrary to any fact). In the post Chris_Leong draws attention to the point that since the Oracle knows which possible world is actual, there is nothing forcing its prediction to be correct on the other possible worlds that the agent foolishly considers, not knowing that they are contrary to fact. And my point in this thread is that despite the uncertainty it seems like we have to magically stipulate correctness of the Oracle on all possible worlds in the same way that we already magically stipulate the possibility of making different decisions in different possible worlds, and this analogy might cast some light on the nature of this magic.

Oracle predictions don't apply to non-existent worlds

Dagon4y00

Sure, that's a sane Oracle. The Weird Oracle used in so many thought experiments doesn't say ""The taxi will arrive in one minute!", it says "You will grab your coat in time for the taxi.".

2Vladimir Nesov4y

No, this is an important point: the agent normally doesn't know the correctness scope of the Oracle's prediction. It's only guaranteed to be correct on the actual decision, and can be incorrect in all other counterfactuals. So if the agent knows the boundaries of the correctness scope, they may play chicken and render the Oracle wrong by enacting the counterfactual where the prediction is false. And if the agent doesn't know the boundaries of the prediction's correctness, how are they to make use of it in evaluating counterfactuals? It seems that the way to reason about this is to stipulate correctness of the prediction in all counterfactuals, even though it's not necessarily correct in all counterfactuals, in the same way as the agent's decision that is being considered is stipulated to be different in different counterfactuals, even though the algorithm forces it to be the same. So it's a good generalization of the problem of formulating counterfactuals, it moves the intervention point from agent's own decisions to correctness of powerful predictors' claims. These claims act on the counterfactuals generated by the agent's own decisions, not on the counterfactuals generated by delivery of possible claims, so it's not about merely treating predictors as agents, it's a novel setup.

A world in which the alignment problem seems lower-stakes

Dagon4y10

I don't follow the half-universe argument. Are you somehow sending the AGI outside of your light-cone? Or have you crafted the AGI utility function and altered your own to not care about the others' half? I don't get the model of utility that works for

The only information you have about the other half is your utility.

My conception of utility is that it's a synthetic calculation from observations about the state of the universe, not that it's a thing on it's own which can carry information.

3Alex Turner4y

Yeah, we are magically instantly influencing an AGI which will thereafter be outside of our light cone. This is not a proposal, or something which I'm claiming is possible in our universe. Just take for granted that such a thing is possible in this contrived example environment. Well, maybe here's a better way of communicating what I'm after: Suppose that you have beliefs about the initial state of the right (AGI) half, and you know how it's going to evolve; this gives you a distribution over right-half universe histories - you have beliefs about the AGI's initial state, and you can compute the consequences of those beliefs in terms of how the right half of the universe will end up. In this way, you can take expected utility over the joint universe history, without being able to observe what's actually happening on the AGI's end. This is similar to how I prefer "start a universe which grows to be filled with human flourishing" over "start a universe which fills itself with suffering", even though I may not observe the fruits of either decision. Is this clearer?

Open problem: how can we quantify player alignment in 2x2 normal-form games?

Dagon4y00

Sorry, I didn't mean to be accusatory in that, only descriptive in a way that I hope will let me understand what you're trying to model/measure as "alignment", with the prerequisite understanding of what the payout matrix indicates. http://cs.brown.edu/courses/cs1951k/lectures/2020/chapters1and2.pdf is one reference, but I'll admit it's baked in to my understanding to the point that I don't know where I first saw it. I can't find any references to the other interpretation (that the payouts are something other than a ranking of preferences... (read more)

2Alex Turner4y

Right, thanks! 1. I think I agree that payout represents player utility. 2. The agent's decision can be made in any way. Best response, worst response, random response, etc. I just don't want to assume the players are making decisions via best response to each strategy profile (which is just some joint strategy of all the game's players). Like, in rock-paper-scissors, if we consider the strategy profile P1: rock, P2: scissors, I'm not assuming that P2 would respond to this by playing paper. And when I talk about 'responses', I do mean 'response' in the 'best response' sense; the same way one can reason about Nash equilibria in non-iterated games, we can imagine asking "how would the player respond to this outcome?". Another point for triangulating my thoughts here is Vanessa's answer, which I think resolves the open question.

Open problem: how can we quantify player alignment in 2x2 normal-form games?

Dagon4y00

I went back and re-read your https://www.lesswrong.com/posts/8LEPDY36jBYpijrSw/what-counts-as-defection post, and it's much clearer to me that you're NOT using standard game-theory payouts (utility) here. You're using some hybrid of utility and resource payouts, where you seem to normalize payout amounts, but then don't limit the decision to the payouts - players have a utility function which converts the payouts (for all players, not just themselves) into something they maximize in their decision. It's not clear whether they include any non-mo... (read more)

0Alex Turner4y

Thanks for taking the time to read further / understand what I'm trying to communicate. Can you point me to the perspective you consider standard, so I know what part of my communication was unclear / how to reply to the claim that I'm not using "standard" payouts/utility?

Open problem: how can we quantify player alignment in 2x2 normal-form games?

Dagon4y10

Payout correlation IS the metric of alignment. A player who isn't trying to maximize their (utility) payout is actually not playing the game you've defined. You're simply incorrect (or describing a different payout matrix than you state) that a player doesn't "have to select a best response".

2Alex Turner4y

Do you have a citation? You seem to believe that this is common knowledge among game theorists, but I don't think I've ever encountered that. Jacob and I have already considered payout correlation, and I agree that it has some desirable properties. However, * it's symmetric across players, * it's invariant to player rationality * which matters, since alignment seems to not just be a function of incentives, but of what-actually-happens and how that affects different players * it equally weights each outcome in the normal-form game, ignoring relevant local dynamics. For example, what if part of the game table is zero-sum, and part is common-payoff? Correlation then can be controlled by zero-sum outcomes which are strictly dominated for all players. For example: 1 / 1 || 2 / 2 -.5 / .5 || 1 / 1 and so I don't think it's a slam-dunk solution. At the very least, it would require significant support. Why? I suppose it's common to assume (a kind of local) rationality for each player, but I'm not interested in assuming that here. It may be easier to analyze the best-response case as a first start, though.

1gjm4y

I think "X and Y are playing a game of stag hunt" has multiple meanings. The meaning generally assumed in game theory when considering just a single game is that the outcomes in the game matrix are utilities. In that case, I completely agree with Dagon: if on some occasion you prefer to pick "hare" even though you know I will pick "stag", then we are not actually playing the stag hunt game. (Because part of what it means to be playing stag hunt rather than some other game is that we both consider (stag,stag) the best outcome.) But there are some other situations that might be described by saying that X and Y are playing stag hunt. Maybe we are playing an iterated stag hunt. Then (by definition) what I care about is still some sort of aggregation of per-round outcomes, and (by definition) each round's outcome still has (stag,stag) best for me, etc. -- but now I need to strategize over the whole course of the game, and e.g. maybe I think that on a particular occasion choosing "hare" when you chose "stag" will make you understand that you're being punished for a previous choice of "hare" and make you more likely to choose "stag" in future. Or maybe we're playing an iterated iterated stag hunt. Now maybe I choose "hare" when you chose "stag", knowing that it will make things worse for me over subsequent rounds, but hoping that other people looking at our interactions will learn the rule Don't Fuck With Gareth and never, ever choose anything other than "stag" when playing with me. Or maybe we're playing a game in which the stag hunt matrix describes some sort of payouts that are not exactly utilities. E.g., we're in a psychology experiment and the experimenter has shown us a 2x2 table telling us how many dollars we will get in various cases -- but maybe I'm a billionaire and literally don't care whether I get $1 or $10 and figure I might as well try to maximize your payout, or maybe you're a perfect altruist and (in the absence of any knowledge about our financial s

Open problem: how can we quantify player alignment in 2x2 normal-form games?

Answer by DagonJun 16, 202120

I think this is backward. The game's payout matrix determines the alignment. Fixed-sum games imply (in the mathematical sense) unaligned players, and common-payoff games ARE the definition of alignment.

When you start looking at meta-games (where resource payoffs differ from utility payoffs, based on agent goals), then "alignment" starts to make sense as a distinct measurement - it's how much the players' utility functions transform the payoffs (in the sub-games of a series, and in the overall game) from fixed-sum to common-payoff.

2Alex Turner4y

I don't follow. How can fixed-sum games mathematically imply unaligned players, without a formal metric of alignment between the players? Also, the payout matrix need not determine the alignment, since each player could have a different policy from strategy profiles to responses, which in principle doesn't have to select a best response. For example, imagine playing stag hunt with someone who responds 'hare' to stag/stag; this isn't a best response for them, but it minimizes your payoff. However, another partner could respond 'stag' to stag/stag, which (I think) makes them "less unaligned with you" with you than the partner who responds 'hare' to stag/stag.

"Beliefs" vs. "Notions"

Dagon4y20

In everyday life, "notion" implies low-confidence. Often derogatory - low-confidence for the speaker, and a further implication that the holder/object of discussion doesn't even have the idea of confidence.

You might just use "proposition" or "claim" to mean the specific thing that a probability belief applies to.

1David Scott Krueger4y

True, but it seems the meaning I'm using it for is primary:

Safer sandboxing via collective separation

Dagon5y10

Depending on your threat modeling of a given breach, this could be comforting or terrifying.

The economic incentives to attack and to defend are usually similar. Systems get broken sometimes but not always.

If the cost of a loss (AGI escapes, takes over the world, and runs it worse than humans are) is much higher, that changes the "economic incentives" about this. It implies that "sometimes but not always" is a very dangerous equilibrium. If the cost of a loss (AGI has a bit more influence on the outside world, but doesn't actually destroy much) is more inline with today's incentives, it's a fine thing.

1Donald Hobson5y

In today's banking systems, the amount of money the hacker gains is about what the bank looses. Therefore, the current balance of effort should have about as much money going into defending the bank and attacking it. So I generalize to say that attacking is about as hard as defending in computer security, if the time and intellect doing both are similar, the attacker wins about half the time. (ie between 10% and 90% or something.) When dealing with AI systems, the total intellectual output must be greater than that of your security team in order to be economically viable. (Or you would pay the security team to do the task directly) For a special purpose AI that only recognises images, this is fine. For a significantly superhuman AI, it will rip through all your security. For a general AI of about human level, if the amount of work it is doing is significantly greater than the amount of work your security team did, then it has probably put more thought into your security vulnerabilities than the humans have. And yes, I am operating on the background assumption that the default behaviour of an escaped AI is to destroy the world. Especially if it has access to loads of compute, and the humans have given it an intelligence dial.

What counts as defection?

Dagon5y30

It's worth being careful to acknowledge that this set of assumptions is far more limited than the game-theoretical underpinnings. Because it requires interpersonal utility summation, you can't normalize in the same ways, and you need to do a LOT more work to show that any given situation fits this model. Most situations and policies don't even fit the more general individual-utility model, and I suspect even fewer will fit this extension.

That said, I like having it formalized, and I look forward to the extension to multi-coalition situations. A spy can benefit Russia and the world more than they hurt the average US resident.

1Alex Turner5y

I very much agree that interpersonal utility comparability is a strong assumption. I'll add a note.

"How conservative" should the partial maximisers be?

Dagon5y20

"kill all humans, then shut down" is probably the action that most minimizes change. Leaving those buggers alive will cause more (and harder to predict) change than anything else the agent might do.

There's no way to talk about this in the abstract sense of change - it has to be differential from a counterfactual (aka: causal), and can only be measured by other agents' evaluation functions. The world changes for lots of reasons, and an agent might have most of it's impact by PREVENTING a change, or by FAILING to change something t... (read more)

1Alex Turner5y

I agree it doesn't make sense to talk about this kind of change as what we want impact measures to penalize, but i think you could talk about this abstract sense of change. You could have an agent with beliefs about the world state, and some distance function over world states, and then penalize change in observed world state compared to some counterfactual. This kind of change isn't the same thing as perceived impact, however.

Predictors exist: CDT going bonkers... forever

Dagon5y40

[note: this is bugging me more than it should. I really don't get why this is worth so much repetition of examples that don't show anything new.]

I'll admit I'm one of those who doesn't see CDT as hopeless. It takes a LOT of hypothetical setup to show cases where it fails, and neither newcomb nor this seem to be as much about decision theory as about free will.

Part of this is my failing. I keep thinking CDT is "classical decision theory", and it means "make the best conditional predictions you can, and then maximiz... (read more)

3Stuart Armstrong5y

You may like this, then: https://www.lesswrong.com/posts/9m2fzjNSJmd3yxxKG/acdt-a-hack-y-acausal-decision-theory

Daniel Kokotajlo5y70

To summarize my confusion, does CDT require that the agent unconditionally believe in perfect free will independent of history (and, ironically, with no causality for the exercise of will)? If so, that should be the main topic of dispute - the frequency of actual case where it makes bad predictions, not that it makes bad decisions in ludicrously-unlikely-and-perhaps-impossible situations.

Sorta, yes. CDT requires that you choose actions not by thinking "conditional on my doing A, what happens?" but rather by some other method (there are differen... (read more)

The Commitment Races problem

Dagon6y*20

I think you're missing at least one key element in your model: uncertainty about future predictions. Commitments have a very high cost in terms of future consequence-effecting decision space. Consequentialism does _not_ imply a very high discount rate, and we're allowed to recognize the limits of our prediction and to give up some power in the short term to reserve our flexibility for the future.

Also, one of the reasons that this kind of interaction is rare among humans is that commitment is impossible for humans. We can change our minds even ... (read more)

3Daniel Kokotajlo5y

I don't think I was missing that element. The way I think about it is: There is some balance that must be struck between making commitments sooner (risking making foolish decisions due to ignorance) and later (risking not having the right commitments made when a situations arises in which they would be handy). A commitment race is a collective action problem where individuals benefit from going far to the "sooner" end of the spectrum relative to the point that would be optimal for everyone if they could coordinate. I agree about humans not being able to make commitments--at least, not arbitrary commitments. (Arguably, getting angry and seeking revenge when someone murders your family is a commitment you made when you were born.) I think we should investigate whether this inability is something evolution "chose" or not. I agree it's a race in knowledge/understanding as well as time. (The two are related.) But I don't think more knowledge = more power. For example, if I don't know anything and decide to commit to plan X which benefits me, else war, and you know more than me--in particular, you know enough about me to know what I will commit to--and you are cowardly, then you'll go along with my plan.

Buridan's ass in coordination games

Dagon7y00

Nope. Random choice gives a specific value for R each game. The outcome for that iteration is IDENTICAL to the outcome if that R was chosen intentionally. Randomness only has game value as a mechanism to keep information from an adversarial actor.

1Jessica Taylor7y

To be clear, by "worst-case guarantee" I mean "the expected utility is guaranteed to be pretty good regardless of uy", which is unattainable without shared randomness (claim 1). I think you are either misunderstanding or disagreeing with a lot of the terminology on randomized algorithms and worst-case guarantees that are commonly used in CS and statistics. This article is a decent introduction to this topic.

Buridan's ass in coordination games

Dagon7y10

Sure, but non-adversarial cases (really, any cases where u is determined independently of strategies chosen) can just choose R as a fixed part of the strategy, rather than a random shared component determined later.

3Jessica Taylor7y

That's right, but getting the worst-case guarantee requires this initial choice to be random.

Buridan's ass in coordination games

Dagon7y10

Based on other comments, I realize I'm making an assumption for something you haven't specified. How is uy chosen? If it's random and independent, then my assertion holds, if it's selected by an adversary who knows the players' full strategies somehow, then R is just a way of keeping a secret from the adversary - sequence doesn't matter, but knowledge does.

1Jessica Taylor7y

Claim 1 says there exists some uy value for which the algorithm gets high regret, so we might as well assume it's chosen to maximize regret. Claim 2 says the algorithm has low regret regrardless of uy , so we might as well assume it's chosen to maximize regret.

Buridan's ass in coordination games

Dagon7y-10

uy and R are independently chosen from well-defined distributions. Regardless of sequence, neither knows the other and CANNOT be chosen based on the other. I'll see if I can find time tonight to figure out whether I'm saying your claim 1 is wrong (it dropped epsilon too soon from the floor value, but I'm not sure if it's more fundamentally problematic than that) or that your claim 2 is misleading.

My current expectation is that I'll find that your claim 2 results are available in situation 1, by using your given function with a pre-agreed value rather than a random one.

0Laszlo Treszkai7y

True, they will fail to cooperate for some R, but the values of such R have a low probability. (But yeah, it's also required that uy and R are chosen independently—otherwise an adversary could just choose either so that it results in the players choosing different actions.) The smoothness comes in from marginalising a random R. The coordination comes from making R and ε common knowledge, so they cooperate using the correlation in their observations—an interesting phenomenon. (How can I write LaTeX in the comments?)

2Rohin Shah7y

The theorems are of the form "For all uy, you get good outcomes" or "There exists a uy that causes bad outcomes". When you want to prove statements of this form, uy is chosen adversarially, so it matters whether it is chosen before or after R. What distribution is uy chosen from? That's not specified anywhere in the post.

Buridan's ass in coordination games

Dagon7y00

R∼Uniform([0,1])

How can it possibly matter whether R is chosen before or after uy? R is completely independent of u, right? It's not a covert communication mechanism about the players' observations, it's a random value.

1Jessica Taylor7y

If uy is chosen after R then it might be chosen to depend on R in such a way that the algorithm gets bad performance, e.g. using the method in the proof of Claim 1.

Buridan's ass in coordination games

Dagon7y00

Sure, but that goes for a randomly-chosen R too. For every possible R, there is a u value for which they get bad outcomes. It doesn't get better by randomly choosing R.

1Jessica Taylor7y

The assumption is that R is chosen after uy . So for every uy the pair of policies gets a good expected utility. See the point on Bayesian algorithms in the conclusion for more on why "get a high expected utility regardless of uy " might be a desirable goal.

Buridan's ass in coordination games

Dagon7y10

I'm missing something (and I haven't digested the math, so maybe it's obvious but just missing from the narrative description). Is epsilon the same for both players, in that they see the same V, it just may not exactly match u? or is it different for each player, meaning for the same u, they have different V? From your analysis (risk of 0), it sounds like the latter.

In that case, I don't see how additional shared knowledge helps coordinate them, nor why it needs to be random rather than just a fixed value they agree on in advance. A... (read more)

1Donald Hobson7y

If you assume a fixed probability distribution over possible $u_y$ that both players know when coordinating, then they can set up the rules they choose to make sure that they probably win. The extra random information is only useful because of the implicit "for all $u_y$". If some malicious person had overheard their strategy, and was allowed to choose $u_y$, but didn't have access to the random number source, then the random numbers are useful.

2Jessica Taylor7y

ϵ is the same for both players but V1 and V2 (the players' observations of uy ) are different, both sampled independently uniformly from [uy−ϵ,uy+ϵ] . If they decide on R=0.5 then there exists some uy value for which they get a bad expected utility (see Claim 1).