AI ALIGNMENT FORUM
AF

All of Scott Garrabrant's Comments + Replies

Here are the most interesting things about these objects to me that I think this post does not capture.

Given a distribution over non-negative non-identically-zero infrafunctions, up to a positive scalar multiple, the pointwise geometric expectation exists, and is an infra function (up to a positive scalar multiple).

(I am not going to give all the math and be careful here, but hopefully this comment will provide enough of a pointer if someone wants to investigate this.)

This is a bit of a miracle. Compare this with arithmetic expectation of utility fun... (read more)

Infrafunctions and Robust Optimization

Scott Garrabrant2y71

I have been thinking about this same mathematical object (although with a different orientation/motivation) as where I want to go with a weaker replacement for utility functions.

I get the impression that for Diffractor/Vanessa, the heart of a concave-value-function-on-lotteries is that it represents the worst case utility over some set of possible utility functions. For me, on the other hand, a concave value function represents the capacity for compromise -- if I get at least half the good if I get what I want with 50% probability, then I have the capacity... (read more)

Concave Utility Question

Scott Garrabrant2y40

Then it is equivalent to the thing I call B2 in edit 2 in the post (Assuming A1-A3).

In this case, your modified B2 is my B2, and your B3 is my A4, which follows from A5 assuming A1-A3 and B2, so your suspicion that these imply C4 is stronger than my Q6, which is false, as I argue here.

However, without A5, it is actually much easier to see that this doesn't work. The counterexample here satisfies my A1-A3, your weaker version of B2, your B3, and violates C4.

Concave Utility Question

Scott Garrabrant2y20

Your B3 is equivalent to A4 (assuming A1-3).

Concave Utility Question

Scott Garrabrant2y62

Your B2 is going to rule out a bunch of concave functions. I was hoping to only use axioms consistent with all (continuous) concave functions.

2Vanessa Kosoy2y

Oops. What if instead of "for any p" we go with "there exists p"?

Concave Utility Question

Scott Garrabrant2y20

I am skeptical that it will be possible to salvage any nice VNM-like theorem here that makes it all the way to concavity. It seems like the jump necessary to fix this counterexample will be hard to express in terms of only a preference relation.

Concave Utility Question

Answer by Scott GarrabrantApr 15, 202330

The answers to Q3, Q4 and Q6 are all no. I will give a sketchy argument here.

Consider the one dimensional case, where the lotteries are represented by real numbers in the interval $L = [0, 1]$ , and consider the function $u : L \to [0, 1]$ given by $u (x) = \frac{1}{2} - (x - \frac{1}{3})^{3} (x - \frac{2}{3})$ . Let $⪰$ be the preference order given by $x ⪰ y$ if and only if $u (x) \geq u (y)$ .

$u$ is continuous and quasi-concave, which means $⪰$ is going to satisfy A1, A2, A3, A4, and B2. Further, since $u$ is monotonically increasing up to the unique argmax, and ... (read more)

2Scott Garrabrant2y

Concave Utility Question

Scott Garrabrant2y30

You can also think of A5 in terms of its contrapositive: For all $A, B \in L$ , if $A ≻ B$ , then for all $0 < p \leq 1$ $A ≻ p A + (1 - p) B$

This is basically just the strict version of A4. I probably should have written it that way instead. I wanted to use $⪰$ instead of $≻$ , because it is closer to the base definition, but that is not how I was natively thinking about it, and I probably should have written it the way I think about it.

Concave Utility Question

Scott Garrabrant2y30

Alex's counterexample as stated is not a counterexample to Q4, since it is in fact concave.

I believe your counterexample violates A5, taking $B = \neg X$ , $A = X$ , and $p = \frac{1}{2}$ .

1James Payor2y

Seems right, oops! A5 is here saying that if any part of my u is flat it had better stay flat! I think I can repair my counterexample but looks like you've already found your own.

Concave Utility Question

Scott Garrabrant2y52

That does not rule out your counterexample. The condition is never met in your counterexample.

4Alex Mennen2y

Oh, derp. You're right.

Concave Utility Question

Answer by Scott GarrabrantApr 15, 202320

The answer to Q1 is no, using the same counter example here. However, the spirit of my original question lives on in Q4 (and Q6).

Concave Utility Question

Answer by Scott GarrabrantApr 15, 202320

Claim: A1, A2, A3, A5, and B2 imply A4.

Proof: Assume we have a preference ordering that satisfies A1, A2, A3, A5, and B2, and consider lotteries $A, B \in L$ , and $p \in [0, 1]$ , with $A ⪰ B$ . Let $C = p A + (1 - p) B$ . It suffices to show $C ⪰ B$ . Assume not, for the purpose of contradiction. Then (by axiom A1), $B ≻ C$ . Thus by axiom B2 there exists a $D \in L$ such that $B ≻ D ≻ C$ . By axiom A3, we may assume $D = q B + (1 - q) C$ for some $q \in [0, 1]$ . Observe that $C = r A + (1 - r) D$ where $r = \frac{p q}{1 - p + p q} \in [0, 1]$ . $r$ is positive, since otherwise... (read more)

Concave Utility Question

Scott Garrabrant2y40

Oh, nvm, that is fine, maybe it works.

Concave Utility Question

Scott Garrabrant2y40

Oh, no, I made a mistake, this counterexample violates A3. However, the proposed fix still doesn't work, because you just need a function that is decreasing in probability of $x$ , but does not hit 0, and then jumps to 0 when probability of $x$ is 1.

4Scott Garrabrant2y

Oh, nvm, that is fine, maybe it works.

Concave Utility Question

Scott Garrabrant2y20

I haven't actually thought about whether A5 implies A4 though. It is plausible that it does. (together with A1-A3, or some other simple axioms,)

When $A ≻ B$ , we get A4 from A5, so it suffices to replace A4 with the special case that $A \sim B$ . If $A \sim B$ , and $A, B ≻ X$ , a mixture of $A$ and $B$ , then all we need to do is have any Y such that $A ≻ Y ≻ X$ , then we can get $Y^{'}$ between $A$ and $X$ by A3, and then $X$ will also be a mixture of $Y^{'}$ and $B$ , contradicting A5, since $B ≻ Y^{'}$ .

A1,A2,A3,A5 do ... (read more)

Concave Utility Question

Scott Garrabrant2y20

(and everywhere you say "good" and "bad", they are the non-strict versions of the words)

1James Payor2y

yep!

Concave Utility Question

Scott Garrabrant2y30

Your understanding of A4 is right. In A5, "good" should be replaced with "bad."

1James Payor2y

Okay, I now think A5 implies: "if moving by Δ is good, then moving by any negative multiple −nΔ is bad". Which checks out to me re concavity.

2Scott Garrabrant2y

(and everywhere you say "good" and "bad", they are the non-strict versions of the words)

Concave Utility Question

Scott Garrabrant2y30

You have the inequality backwards. You can't apply A5 when the mixture is better than the endpoint, only when the mixture is worse than the endpoint.

1James Payor2y

Got it, thanks!

Concave Utility Question

Scott Garrabrant2y40

That proposed axiom to add does not work. Consider the function on lotteries over ${x, y, z}$ that gives utility 1 if $z$ is supported, and otherwise gives utility equality to the probability of $x$ . This function is concave but not continuous, satisfies A1-A5 and the extra axiom I just proposed, and cannot be made continuous.

4Scott Garrabrant2y

Oh, no, I made a mistake, this counterexample violates A3. However, the proposed fix still doesn't work, because you just need a function that is decreasing in probability of x, but does not hit 0, and then jumps to 0 when probability of x is 1.

Concave Utility Question

Scott Garrabrant2y40

I edited the post to remove the continuity assumption from the main conclusion. However, my guess is that if we get a VNM-like result, we will want to add back in another axiom that gives us continuity,

Concave Utility Question

Scott Garrabrant2y40

I meant the conclusions to all be adding to the previous one, so this actually also answers the main question I stated, by violating continuity, but not the main question I care about. I will edit the post to say that I actually care about concavity, even without continuity.

4Scott Garrabrant2y

Concave Utility Question

Scott Garrabrant2y40

Nice! This, of course, seems like something we should salvage, by e.g. adding an axiom that if A is strictly preferred to B, there should be a lottery strictly between them.

3Alex Mennen2y

I think the way I would rule out my counterexample is by strengthening A3 to if A≻B and B≻C then there is p∈(0,1)...

4Scott Garrabrant2y

That proposed axiom to add does not work. Consider the function on lotteries over {x,y,z} that gives utility 1 if z is supported, and otherwise gives utility equality to the probability of x. This function is concave but not continuous, satisfies A1-A5 and the extra axiom I just proposed, and cannot be made continuous.

Concave Utility Question

Scott Garrabrant2y*20

To see why A1-A4 is not enough to prove C4 on its own, consider the preference relation on the space of lotteries between two outcomes X and Y such that all lotteries are equivalent if $P (X) \leq \frac{1}{2}$ , and if $P (X) \geq \frac{1}{2}$ , higher values of $P (X)$ are preferred. This satisfies A1-A4, but cannot be expressed with a concave function, since we would have to have $u (\frac{X + Y}{2}) = u (X) < \frac{u (X) + u (Y)}{2}$ , contradicting concavity. We can, however express it with a quasi-concave function: $U (p X + (1 - p) Y) = max (0, p - \frac{1}{2})$ .

Concave Utility Question

Scott Garrabrant2y30

I believe using A4 (and maybe also A5) in multiple places will be important to proving a positive result. This is because A1, A2, and A3 are extremely week on their own.

A1-A3 is not even enough to prove C1. To see a counterexample, take any well ordering on $R / Q$ , and consider the preference ordering over the space of lotteries on a two element set of deterministic outcomes. If two lotteries have probabilities of the first outcome that differ by a rational number, they are equivalent, otherwise, you compare them according to your well ordering. Th... (read more)

Finite Factored Sets

Scott Garrabrant2y22

This underrated post is pretty good at explaining how to translate between FFSs and DAGs.

Logical induction for software engineers

Scott Garrabrant2y133

I believe this post is (for the most part) accurate and demonstrates understanding of what is going on with logical induction. Thanks for writing (and coding) it!

3Alex Flint2y

Thanks Scott

The Least Controversial Application of Geometric Rationality

Scott Garrabrant2y20

I think your numbers are wrong, and the right column on the output should say 20% 20% 20%.

The output actually agrees with each of the components on every event in that component's sigma algebra. The input distributions don't actually have any conflicting beliefs, and so of course the output chooses a distribution that doesn't disagree with either.

I agree that the 0s are a bit unfortunate.

I think the best way to think of the type of the object you get out is not a probability distribution on $W,$ but what I am calling a partial probability distribut... (read more)

1Vivek Hebbar2y

Yeah, the right column should obviously be all 20s. There must be a bug in my code[1] :/ Take the following hypothesis h3: If I add this into P with weight 10−9, then the middle column is still nearly zero. But I can now ask for the probablity of the event in h3 corresponding to the center square, and I get back an answer very close to zero. Where did this confidence come from? I guess I'm basically wondering what this procedure is aspiring to be. Some candidates I have in mind: 1. Extension to the coarse case of regular hypothesis mixing (where we go from P(w) and Q(w) to aP(w)+(1−a)Q(w)) 2. Extension of some kind of Bayesian update-flavored thing where we go to P(w)Q(w) then renormalize 1. ETA: P(w)aQ(w)1−a seems more plausible than P(w)Q(w) 3. Some kind of "aggregation of experts who we trust a lot unless they contradict each other", which isn't cleanly analogous to either of the above Even in case 3, the near-zeros are really weird. The only cases I can think of where it makes sense are things like "The events are outcomes of a quantum process. Physics technique 1 creates hypothesis 1, and technique 2 creates hypothesis 2. Both techniques are very accurate, and the uncertainity they express is due to fundamental unknowability. Since we know both tables are correct, we can confidently rule out the middle column, and thus rule out certain events in hypothesis 3." But more typically, the uncertainity is in the maps of the respective hypotheses, not in the territory, in which case the middle zeros seem unfounded. And to be clear, the reason it seems like a real issue[2] is that when you add in hypothesis 3 you have events in the middle which you can query, but the values can stay arbitrarily close to zero if you add in hypothesis 3 with low weight. 1. ^ ETA: Found the bug, it was fixable by substituting a single character 2. ^ Rather than "if a zero falls in the forest and no hypothesis is around to hear it, does it real

Counterfactability

Scott Garrabrant2y20

Yeah, remember the above is all for updateless agents, which are already computationally intractable. For updateful agents, we will want to talk about conditional counterfactability. For example, if you and I are in a prisoners dilemma, we could would conditional on all the stuff that happened prior to us being put in separate cells, and given this condition, the histories are much smaller.

Also, we could do all of our reasoning up to a high level world model that makes histories more reasonably sized.

Also, if we could think of counterfactability as a... (read more)

Counterfactability

Scott Garrabrant2y20

I agree, this is why I said I am being sloppy with conflating the output and our understanding of the output. We want our understanding of the output to screen off the history.

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Scott Garrabrant2y20

I mean, the definition is a little vague. If your meaning is something like "It goes in A if it is more accurately described as controlled by the viscera, and it goes in P if it is more accurately described as controlled by the environment," then I guess you can get a bijection by definition, but it is not obvious these are natural categories. I think there will be parts of the boundary that feel like they are controlled by both or neither, depending on how strictly you mean "controlled by."

Boundaries vs Frames

Scott Garrabrant2y34

My default plan is to not try to rename Cartesian frames, mostly because the benefit seems small, and I care more about building up the FFS ontology over the Cartesian frame one.

Boundaries vs Frames

Scott Garrabrant2y37

I agree completely. I am not really happy with any of the language in this post, and I want it to have scope limited to this post. I will for the most part say boundary for both the additive and multiplicative variants.

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Scott Garrabrant2y20

More of my thoughts here.

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Scott Garrabrant2y30

To be clear, everywhere I say “is wrong,” I mean I wish the model is slightly different, not that anything is actually is mistaken. In most cases, I don’t really have much of an idea how to actually implement my recommendation.

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Scott Garrabrant2y21

Forcing the AxP bijection is an interesting idea, but it feels a little too approximate to my taste.

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Scott Garrabrant2y41

Oh yeah, oops, that is what it says. Wasn’t careful, and was responding to reading an old draft. I agree that the post is already saying roughly what I want there. Instead, I should have said that the B=AxP bijection is especially unrealistic. Sorry.

1Andrew Critch2y

Why is it unrealistic? Do you actually mean it's unrealistic that the set I've defined as "A" will be interpretable at "actions" in the usual coarse-grained sense? If so I think that's a topic for another post when I get into talking about the coarsened variables Vc,Ac,Pc,Ec...

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

Scott Garrabrant2y1413

Overall, this is my favorite thing I have read on lesswrong in the last year.

Agreements:

I agree very strongly with most of this post, both in the way you are thinking about boundaries, and in the scale and scope of applications of boundaries to important problems.

In particular on the applications, I think that boundaries as you are defining them are crucial to developing decision theory and bargaining theory (and indeed are already helpful for thinking about bargaining and fairness in real life), but I also agree with your other potential applications.

I pa... (read more)

2Scott Garrabrant2y

More of my thoughts here.

3Scott Garrabrant2y

4Andrew Critch2y

Thanks, Scott! Are you sure? The informal description I gave for A and P allow for the active boundary to be a bit passive and the passive boundary to be a bit active. From the post: There's a question of how to factor B into a zillion fine-grained features in the first place, but given such a factorization, I think we can define A and P fairly straightforwardly using Shapley value to decide how much V versus E is controlling each feature, and then A and P won't overlap and will cover everything.

Boundaries vs Frames

Scott Garrabrant2y30

Note that I wrote this post a month ago, while seeing an earlier draft of the sequence post 3a (before active/passive distinction was central) and was waiting to post it until after that post. I am posting it now unedited, so some of the thoughts here might be outdated. In particular, I think this post does not respect enough the sense in which the FFS ontology is wrong in that it does not have space for expressing the direction of entanglement.

Why Agent Foundations? An Overly Abstract Explanation

Scott Garrabrant3y160

I mostly agree with this post.

Figuring out the True Name of a thing, a mathematical formulation sufficiently precise that one can apply lots of optimization pressure without the formulation breaking down, is absolutely possible and does happen.

Precision feels pretty far from the true name of the important feature of true names, I am not quite sure what precision means, but on one definition, precision is the opposite of generality, and true names seem anti-precise. I am not saying precision is not a virtue, and it does seem like precision is involved. (lik... (read more)

4johnswentworth3y

You're right, I wasn't being sufficiently careful about the wording of a bolded sentence. I should have said "robust" where it said "precise". Updated in the post; thankyou. Also I basically agree that robustness to optimization is not the True Name of True Names, though it might be a sufficient condition.

The Plan

Scott Garrabrant3y30

I agree with this asymmetry.

One thing I am confused about is whether to think of the e-coli as qualitatively different from the human. The e-coli is taking actions that can be well modeled by an optimization process searching for actions that would be good if this optimization process output them, which has some reflection in it.

It feels like it can behaviorally be well modeled this way, but is mechanistically not shaped like this, I feel like the mechanistic fact is more important, but I feel like we are much closer to having behavioral definitions of agency than mechanistic ones.

3johnswentworth3y

I would say the e-coli's fitness function has some kind of reflection baked into it, as does a human's fitness function. The qualitative difference between the two is that a human's own world model also has an explicit self-model in it, which is separate from the reflection baked into a human's fitness function. After that, I'd say that deriving the (probable) mechanistic properties from the fitness functions is the name of the game. ... so yeah, I'm on basically the same page as you here.

The Plan

Scott Garrabrant3y80

Which isn't *that* large an update. The average number of agent foundations researchers (That are public facing enough that you can update on their lack of progress) at MIRI over the last decade is like 4.

Figuring out how to factor in researcher quality is hard, but it seems plausible to me that the amount of quality adjusted attention directed at your subgoal over the next decade is significantly larger than the amount of attention directed at your subgoal over the last decade. (Which would not all come from you. I do think that Agent Foundations today is... (read more)

3johnswentworth3y

This all sounds right. In particular, for folks reading, I symmetrically agree with this part: ... i.e. I endorse Scott's research program, mine is indeed similar, I wouldn't be the least bit surprised if we disagree about what comes next but we're pretty aligned on what to do now. Also, I realize now that I didn't emphasize it in the OP, but a large chunk of my "50/50 chance of success" comes from other peoples' work playing a central role, and the agent foundations team at MIRI is obviously at the top of the list of people whose work is likely to fit that bill. (There's also the whole topic of producing more such people, which I didn't talk about in the OP at all, but I'm tentatively optimistic on that front too.)

The Plan

Scott Garrabrant3y80

To operationalize, I claim that MIRI has been directed at a close enough target to yours that you probably should update on MIRI's lack of progress at least as much as you would if MIRI was doing the same thing as you, but for half as long.

Scott Garrabrant3y80

The Plan

Scott Garrabrant3y80

Hmm, yeah, we might disagree about how much reflection(self-reference) is a central part of agency in general.

It seems plausible that it is important to distinguish between the e-coli and the human along a reflection axis (or even more so, distinguish between evolution and a human). Then maybe you are more focused on the general class of agents, and MIRI is more focused on the more specific class of "reflective agents."

Then, there is the question of whether reflection is going to be a central part of the path to (F/D)OOM.

Does this seem right to you?

5johnswentworth3y

That does seem right. I do expect reflection to be a pretty central part of the path to FOOM, but I expect it to be way easier to analyze once the non-reflective foundations of agency are sorted out. There are good reasons to expect otherwise on an outside view - i.e. all the various impossibility results in logic and computing. On the other hand, my inside view says it will make more sense once we understand e.g. how abstraction produces maps smaller than the territory while still allowing robust reasoning, how counterfactuals naturally pop out of such abstractions, how that all leads to something conceptually like a Cartesian boundary, the relationship between abstract "agent" and the physical parts which comprise the agent, etc. If I imagine what my work would look like if I started out expecting reflection to be the taut constraint, then it does seem like I'd follow a path a lot more like MIRI's. So yeah, this fits.

Scott Garrabrant3y80

The Plan

Scott Garrabrant3y220

I want to disagree about MIRI.

Mostly, I think that MIRI (or at least a significant subset of MIRI) has always been primarily directed at agenty systems in general.

I want to separate agent foundations at MIRI into three eras. The Eliezer Era (2001-2013), the Benya Era (2014-2016), and the Scott Era(2017-).

The transitions between eras had an almost complete overhaul of the people involved. In spite of this, I believe that they have roughly all been directed at the same thing, and that John is directed at the same thing.

The proposed mechanism behi... (read more)

0johnswentworth3y

Main response is in another comment; this is a tangential comment about prescriptive vs descriptive viewpoints on agency. I think viewing agency as "the pipeline from the prescriptive to the descriptive" systematically misses a lot of key pieces. One central example of this: any properties of (inner/mesa) agents which stem from broad optima, rather than merely optima. (For instance, I expect that modularity of trained/evolved systems mostly comes from broad optima.) Such properties are not prescriptive principles; a narrow optimum is still an optimum. Yet we should expect such properties to apply to agenty systems in practice, including humans, other organisms, and trained ML systems. The Kelly criterion is another good example: Abram has argued that it's not a prescriptive principle, but it is still a very strong descriptive principle for agents in suitable environments. More importantly, I think starting from prescriptive principles makes it much easier to miss a bunch of the key foundational questions - for instance, things like "what is an optimizer?" or "what are goals?". Questions like these need some kind of answer in order for many prescriptive principles to make sense in the first place. Also, as far as I can tell to date, there is an asymmetry: a viewpoint starting from prescriptive principles misses key properties, but I have not seen any sign of key principles which would be missed starting from a descriptive viewpoint. (I know of philosophical arguments to the contrary, e.g. this, but I do not expect such things to cash out into any significant technical problem for agency/alignment, any more than I expect arguments about solipsism to cash out into any significant technical problem.)

johnswentworth3y30

I generally agree with most of this, but I think it misses the main claim I wanted to make. I totally agree that all three eras of MIRI's agent foundations research had some vision of the general theory of agency behind them, driving things. My point of disagreement is that, for most of MIRI's history, elucidating that general theory has not been the primary optimization objective.

Let's go through some examples.

The Sequences: we can definitely see Eliezer's understanding of the general theory of agency in many places, especially when talking about Bayes an... (read more)

Countably Factored Spaces

Scott Garrabrant4y60

Note that the title is misleading. This is really countable dimension factored spaces, which is much better, since it allows for the possibility of something kind of like continuous time, where between any two points in time, you can specify a time strictly between them.

Finite Factored Sets: Polynomials and Probability

Scott Garrabrant4y30

Fixed, Thanks.

Finite Factored Sets: Inferring Time

Scott Garrabrant4y40

Yeah, also note that the history of $X$ given $Y$ is not actually a well defined concept. There is only the history of $X$ given $y$ for $y \in Y$ . You could define it to be the union of all of those, but that would not actually be used in the definition of orthogonality. In this case $h^{F} (X | y)$ , $h^{F} (V | y)$ , and $h^{F} (Z | y)$ are all independent of choice of $y \in Y$ , but in general, you should be careful about that.

7Rohin Shah4y

Yeah, fair point. (I did get this right in the summary; turns out if you try to explain things from first principles it becomes blindingly obvious what you should and shouldn't be doing.)

Finite Factored Sets: Inferring Time

Scott Garrabrant4y20

I think that works, I didn't look very hard. Yore histories of X given Y and V given Y are wrong, but it doesn't change the conclusion.

2Rohin Shah4y

Yeah, both of those should be {X,V}, if I'm not mistaken (a second time).

Finite Factored Sets: Orthogonality and Time

Scott Garrabrant4y20

I could do that. I think it wouldn't be useful, and wouldn't generalize to sub partitions.

Garrabrant and Shah on human modeling in AGI

Scott Garrabrant4y30

I don't know, the negation of the first thing? A system that can freely model humans, or at least perform computation indistinguishable from modeling humans.

1Ben Pace4y

Not modeling vs modeling. Thx.