All of Shmi's Comments + Replies

Well written. Do you have a few examples of pivoting when it becomes apparent that the daily grind no longer optimizes for solving the problem?

2Adam Shimi
In a limited context, the first example that comes to me is high performers in competitive sports and games. Because if they truly only give a shit about winning (and the best generally do), they will throw away their legacy approaches when they find a new one, however it pains them.

I know very little about this area, but I suspect that a writeup like this classic explanation of Godel Incompleteness might be a step in the right direction: Godel incompleteness.

1Andrew Critch
That thing is hilarious and good!  Thanks for sharing it.  As for the relevance, it explains the statement of Gödel's theorem, but not the proof it.  So, it could be pretty straightforwardly reworked to explain the statement of Löb's theorem, but not so easily the proof of Löb's theorem.  With this post, I'm in the business of trying to find a proof of Löb that's really intuitive/simple, rather than just a statement of it that's intuitive/simple.

I meant this:

Shard Question: How does the human brain ensure alignment with its values, and how can we use that information to ensure the alignment of an AI with its designers' values?

which does indeed beg the question in the standard meaning of it.

My point is that there is very much no alignment between different values! They are independent at best and contradictory in many cases. There is an illusion of coherent values that is a rationalization. The difference in values sometimes leads to catastrophic Fantasia-like outcomes on the margins (e.g. people w... (read more)

2Alex Turner
Hm. I think you can dissolve the perceived question-begging by replacing "values" with its substance:  We're then asking a relatively well-defined question with a guaranteed-to-exist answer. 

That seems like a useful decomposition! Point 2 seems to beg the question, why does it assume that the brain can "ensure alignment with its values", as opposed to, say, synthesizes an illusion of values by aggregating data from various shards?

1Jacy Reese Anthis
Thanks for the comment. I take "beg the question" to mean "assumes its conclusion," but it seems like you just mean Point 2 assumes something you disagree with, which is fair. I can see reasonable definitions of aligned and misaligned in which brains would fall into either category. For example, insofar as our values are a certain sort of evolutionary (e.g., valuing reproduction), human brains have misaligned mesaoptimization like craving sugar. If sugar craving itself is the value, then arguably we're well-aligned. In terms of synthesizing an illusion, what exactly would make it illusory? If the synthesis (i.e., combination of the various shards and associated data) is leading to brains going about their business in a not-catastrophic way (e.g., not being constantly insane or paralyzed), then that seems to meet the bar for alignment that many, particularly agent foundations proponents, favor. See, for example, Nate's recent post: The example I like is just getting an AI to fill a container of water, which human brains are able to do, but in Fantasia, the sorceror's apprentice Mickey Mouse was not able to do! So that's a basic sense in which brains are aligned, but again I'm not sure how exactly you would differentiate alignment with its values from synthesis of an illusion.

Just a small remark

Open a blank google doc, set a one hour timer, and start writing out your case for why AI Safety is the most important problem to work on

Not "why", but "whether" is the first step. Otherwise you end up being a clever arguer.

No, "why" is correct. See the rest of the sentence:

Write out all the counter-arguments you can think of, and repeat
 

It's saying assume it's correct, then assume it's wrong, and repeat. Clever arguers don't usually devil advocate themselves.

Value extrapolation is thus necessary for AI alignment. It is also almost sufficient, since it allows AIs to draw correct conclusions from imperfectly defined human data.

I am missing something... The idea of correctly extrapolating human values is basically the definition of the Eliezer's original proposal, CEV. In fact, it's right there in the name. What is the progress over the last decade?

5Stuart Armstrong
CEV is based on extrapolating the person; the values are what the person would have had, had they been smarter, known more, had more self-control, etc... Once you have defined the idealised person, the values emerge as a consequence. I've criticised this idea in the past, mainly because the process to generate the idealised person seems vulnerable to negative attractors (Eliezer's most recent version of CEV has less of this problem). Value extrapolation and model splintering are based on extrapolating features and concepts in models, to other models. This can be done without knowing human psychology or (initially) anything about knowing anything about humans at all, including their existence. See for example the value extrapolation partially resolves symbol grounding post; I would never write "CEV partially resolves symbol grounding". On the contrary, CEV needs symbol grounding.
1Evan R. Murphy
I wanted to look up CEV after reading this comment. Here's a link for anyone else looking: https://intelligence.org/files/CEV.pdf That acronym stands for "Coherent Extrapolated Volition" not "Coherent Extrapolated Values". But from skimming the paper just now, I think agree with shminux that it's basically the same idea.

I'm confused... What you call the "Pure Reality" view seems to work just fine, no? (I think you had a different name for it, pure counterfactuals or something.) What do you need counterfactuals/Augmented Reality for? Presumably making decisions thanks to "having a choice" in this framework, right? In the pure reality framework the "student and the test" example one would dispassionately calculate what kind of a student algorithm passes the test, without talking about making a decision to study or not to study. Same with the Newcomb's, of course, one just looks at what kind of agents end up with a given payoff. So... why pick an AR view over the PR view, what's the benefit?

1Chris_Leong
Excellent question. Maybe I haven't framed this well enough. We need a way of talking about the fact that both your outcome and your action are fixed by the past. We also need a way of talking about the fact that we can augment the world with counterfactuals (Of course, since we don't have complete knowledge of the world, we typically won't know which is the factual and which are the counterfactuals). And that these are two distinct ways of looking at the world. I'll try to think about a cleaner way of framing this, but do you have any suggestions? (For the record, the term I used before was Raw Counterfactuals - meaning consistent counterfactuals - and that's a different concept than looking at the world in a particular way). (Something that might help is that if we are looking at multiple possible pure realities, then we've introduced counterfactuals as only one is true and "possible" is determined by the map rather than the territory)

First, I really like this shift in thinking, partly because it moves the needle toward an anti-realist position, where you don't even need to postulate an external world (you probably don't see it that way, despite saying "Everything is a subjective preference evaluation").

Second, I wonder if you need an even stronger restriction, not just computable, but efficiently computable, given that it's the agent that is doing the computation, not some theoretical AIXI. This would probably also change "too easily" in "those e... (read more)

0Abram Demski
I definitely see it as a shift in that direction, although I'm not ready to really bite the bullets -- I'm still feeling out what I personally see as the implications. Like, I want a realist-but-anti-realist view ;p Right, that's very much what I'm thinking.

Feel free to let me know either way, even if you find that the posts seem totally wrong or missing the point.

1Gordon Seidoh Worley
Okay, so now that I've had more time to think about it, I do really like the idea of thinking of "decisions" as the subjective expression of what it feels like to learn what universe you are in, and this holds true for the third-person perspective of considering the "decisions" of others: they still go through the whole process that feels from the inside like choosing or deciding, but from the outside there is no need to appeal to this to talk about "decisions". Instead, to the outside observers, "decisions" are just resolutions of uncertainty about what will happen to a part of the universe modeled as another agent. This seems quite elegant for my purposes, as I don't run into the problems associated with formalizing UDT (at least, not yet), and it let's me modify my model for understanding human values to push "decisions" outside of it or into the after-the-fact part.
Answer by Shmi*30

My answer is a rather standard compatibilist one, the algorithm in your brain produces the sensation of free will as an artifact of an optimization process.

There is nothing you can do about it (you are executing an algorithm, after all), but your subjective perception of free will may change as you interact with other algorithms, like me or Jessica or whoever. There aren't really any objective intentional "decisions", only our perception of them. Therefore there the decision theories are just byproducts of all these algorithms executing. It... (read more)

2Gordon Seidoh Worley
Thanks, I'll revisit these. They seem like they might be pointing towards a useful resolution I can use to better model values.

According to this SSC book review, "the secret of our success" is the ability to learn culture + the accumulation of said culture, which seems a bit broader than ability to learn language + language that you describe.

3Alex Zhu
That's one of the "unique intellectual superpowers" that I think language confers us: (I do think it helps to explicitly name our ability to learn culture as something that sets us apart, and wish I'd made that more front-and-center.)

Right, that's the question. Sure, it is easy to state that "metric must be a faithful representation of the target", but it never is, is it? From the point of view of double inversion, optimizing the target is a hard inverse problem, because, like in your pizza example, the true "values" (pizza is a preference on the background of an otherwise balanced diet) is not easily observable. What would be a double inverse in this case? Maybe something like trying various amounts of pizza and getting the feedback on enjoyment? That would match the long division pattern. I'm not sure.

Looking for "functions that don't exhibit Goodhart effects under extreme optimization" might be a promising area to look into. What does it mean for a function to behave as expected under extreme optimization? Can you give a toy example?

1Gordon Seidoh Worley
I'm actually not really sure. We have some vague notion that, for example, my preference for eating pizza shouldn't result in attempts at unbounded pizza eating maximization, and I would probably be unhappy from my current values if a maximizing agent saw I liked pizza the best of all foods and then proceeded to feed me only pizza forever, even if it modified me such that I would maximally enjoy the pizza each time and not get bored of it. Thinking more in terms of regressional Goodharting, maybe something like not deviating from the true target because of optimizing for the measure of it. Consider the classic rat extermination example of Goodharting. We already know collecting rat tails as evidence of extermination is a function that leads to weird effects. Does there exist a function that measures rat exterminations that, when optimized for, produces the intended effect (extermination of rats) without doing anything "weird", e.g. generating unintended side-effects, maximizing rat reproduction so we can exterminate more of them, just straightforwardly leads to the extinction of rats and nothing else.

I agree that 4 needs to be taken seriously, as 1 and 2 are hard to succeed at without making a lot of progress on 4, and 3 is just a catch-all for every other approach. It is also the hardest, as it probably requires breaking a lot of new ground, so people tend to work on what appears solvable. I thought some people are working on it though, no? There is also a chance of proving that "An actual grounded definition of human preferences" is impossible in a self-consistent way, and we would have to figure out what to do in this case. The latter feels like a real possibility to me,

2Gordon Seidoh Worley
My impression continues to be that (4) is neglected. Stuart has been the most prolific person I can think of to work on this question, and it's a fast falling power distribution after that with myself having done some work and then not much else that comes to mind in terms of work to address (4) in a technical manner that might lead to solutions useful for AI safety. I have no doubt others have done things (Alexey has thought (and maybe published?) some on this), and others could probably forget my work or Stuart's as easily as I've forgotten there because we don't have a lot of momentum on this problem right now to keep it fresh in our minds. Or so is my impression of things now. I've had some good conversations with folks and a few seem excited about working on (4) and they seem qualified in ways to do it, but no one but Stuart has yet produced very much published work on it. (Yes, there is Eliezer's work on CEV, which is more like a placeholder and wishful thinking than anything more serious, and it has probably accidentally been the biggest bottleneck to work on (4) because so many people I talk to say things like "oh, we can just do CEV and be done with this, so let's worry about the real problems".) I agree there is a risk it is an impossible problem, and I actually think it's quite high in that we may not be able to adequately aggregate human preferences in ways that result in something coherent. In that case I view safety and alignment as more about avoiding catastrophe and cutting down aligned AI solution space to remove the things that clearly don't work rather than building towards things that clearly do. I hope I'm being too pessimistic.
0David Xu
It's not obvious to me why this ought to be the case. Could you elaborate?

I still don't understand the whole deal about counterfactuals, exemplified as "If Oswald had not shot Kennedy, then someone else would have". Maybe MIRI means something else by the counterfactuals?

If it's the counterfactual conditionals, then the approach is pretty simple, as discussed with jessicata elsewhere: there is the macrostate of the world (i.e. a state known to a specific observer, which consists of many possible substates, or microstates) of the world, one of these micr

... (read more)

When you think about the problem this way, there are no counterfactuals, only state evolution. It can be applied to the past, to the present or to the future.

This doesn't give very useful answers when the state evolution is nearly deterministic, such as an agent made of computer code.

For example, consider an agent trying to decide whether to turn left or turn right. Suppose for the sake of argument that it actually turns left, if you run physics forward. Also suppose that the logical uncertainty has figured that out, so that the best-estimate macrosta... (read more)