G Gordon Worley III's Shortform

Gordon Seidoh Worley

AI ALIGNMENT FORUM
AF

G Gordon Worley III's Shortform — AI Alignment Forum

G Gordon Worley III's Shortform

by Gordon Seidoh Worley

6th Aug 2019

1 min read

155

4

This is a special post for quick takes by Gordon Seidoh Worley. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

15Proxy misspecification and the capabilities vs. value learning race

G Gordon Worley III's Shortform

4Gordon Seidoh Worley

3Gordon Seidoh Worley

3Vladimir_Nesov

1Gordon Seidoh Worley

3Gordon Seidoh Worley

2Gordon Seidoh Worley

1Gordon Seidoh Worley

2Sam Marks

1Gordon Seidoh Worley

1Sam Marks

0Gordon Seidoh Worley

0RHollerith

0Gordon Seidoh Worley

0RHollerith

14 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:45 PM

[-]Gordon Seidoh Worley6y40

I get worried about things like this article that showed up on the Partnership on AI blog. Reading it there's nothing I can really object to in the body of post: it's mostly about narrow AI alignment and promotes a positive message of targeting things that benefit society rather than narrowly maximize a simple metric. How it's titled "Aligning AI to Human Values means Picking the Right Metrics" and that implies to me a normative claim that reads in my head something like "to build aligned AI it is necessary and sufficient to pick the right metrics" which is something I think few would agree with. Yet if I was a casual observer just reading the title of this post I might come away with the impression that AI alignment is as easy as just optimizing for something prosocial, not that there are lots of hard problems to be solved to even get AI to do what you want, let alone to pick something beneficial to humanity to do.

To be fair this article has a standard "not necessarily the views of PAI, etc." disclaimer, but then the author is a research fellow at PAI.

This makes me a bit nervous about the effect of PAI on promoting AI safety in industry, especially if it effectively downplays it or makes it seem easier than it is in ways that either encourages or fails to curtail risky behavior in the use of AI in industry.

[-]Gordon Seidoh Worley8mo3-4

We've spent years talking about "aligned" AI, and "Friendly" AI before that, but maybe we should have spent that time talking about "noble" AI.

To be noble is, in part, to act in the best interests of one's domain of responsibility. Historically this meant the peasants living and working on an estate. Today this might mean being a responsible leader of a business who prioritizes people as well as profits and owns the fallout of hard decisions, or being a responsible political leader, though those seem few and far between these days.

We've lost touch with the idea of nobility, but a noble AI might exhibit these traits we think as positive for alignment:

cares for organic life
doesn't take actions that would harm others except to defend itself
is patient and understanding of the fear others may have at its power
finds solutions that are not deceptive and don't undermine the agency of others, while also not shying away from stopping others from doing seriously dangerous things
honors a commitment to helping life flourish
holds itself back when exercising more power would cause harm

and many more. I'm just starting to think about this idea as an alternative framing for what we've been calling alignment, so curious for folks thoughts.

[-]Vladimir_Nesov8mo31

Peasants, when considered altogether, were crucial for the economy. So the intuitions about the idealized concept of a noble don't transfer given this disanalogy. And the actual historical nobles are not a robust prototype for the concept:

England at that time was conducting enclosures. Basically, rich people put up fences around common land to graze sheep on it. The poor were left with no land to grow food on, and had to go somewhere else. They ended up in cities, living in slums, trying to find scarce work and giving their last pennies to slumlords.

[-]Gordon Seidoh Worley8mo10

I don't disagree that those who we called nobels frequently acted badly. But I do see idealized noble values as worth looking at. Think less real kings and lords and more valorized archetypes like Robin Hood, King Richard the Lionhearted, of course King Arthur and his Knights. I think this fiction captures a kind of picture of the expectations we set for what good leaders look like who have power over others, and that's the version I'm suggesting is worth using as a starting point for what we want "good" AI to look like.

I'm also not very concerned about the economic reality of what made the need for idealized nobility norms exist in feudal societies. I don't see that as a key part of what I'm pointing at. Nobility has a larger and longer tradition than the one used in Medieval Europe, though it is the expression of it that I and most folks on Less Wrong are probably familiar with.

[-]Gordon Seidoh Worley5y30

I recently watched all 7 seasons of HBO's "Silicon Valley" and the final episode (or really the final 4 episodes leading up into the final one) did a really great job of hitting on some important ideas we talk about in AI safety.

Now, the show in earlier seasons has played with the idea of AI with things like an obvious parody of Ben Goertzel and Sophia, discussion of Roko's Basilisk, and of course AI that Goodharts. In fact, Goodharting is a pivotal plot point in how the show ends, along with a Petrov-esque ending where hard choices have to be made under uncertainty to protect humanity and it has to be kept a secret due to an information hazard.

Goodhart, Petrov, and information hazards are not mentioned by name in the show, but the topics are clearly present. Given that the show was/is popular with folks in the SF Bay Area tech scene because it does such a good job of mirroring back what it's like to live in that scene, even if it's a hyperbolic characterization, I wonder if and hope that this will helpfully nudge folks towards normalizing taking AI safety seriously and seeing it as virtuous to forgo personal gain in exchange for safeguarding humanity.

I don't expect for things to change dramatically because of the show, but on the margin it might be working to make us a little bit safer. For that reason I think it's likely a good idea to encourage folks not already dedicated to AI safety to watch the show, so long as the effort involved in minimal.

[-]Gordon Seidoh Worley6y20

As I work towards becoming less confused about what we mean when we talk about values, I find that it feels a lot like I'm working on a jigsaw puzzle where I don't know what the picture is. Also all the pieces have been scattered around the room and I have to find the pieces first, digging between couch cushions and looking under the rug and behind the bookcase, let alone figure out how they fit together or what they fit together to describe.

Yes, we have some pieces already and others think they know (infer, guess) what the picture is from those (it's a bear! it's a cat! it's a woman in a fur coat!), and as I work I find it helpful to keep updating my own guess because even when it's wrong it sometimes helps me think of new ways to try combining the pieces or to know what pieces might be missing that I should go look for, but it also often feels like I'm failing all the time because I'm updating rapidly based on new information and that keeps changing my best guess.

I suspect this is a common experience for folks working on problems in AI safety and many other complex problems, so I figured I'd share this metaphor I recently hit on for making sense of what it is like to do this kind of work.

[-]Gordon Seidoh Worley4y10

I'm fairly pessimistic on our ability to build aligned AI. My take is roughly that it's theoretically impossible and at best we might build AI that is aligned well enough that we don't lose. I've not written one thing to really summarize this or prove it, though.

The source of my take comes from two facts:

Goodharting is robust. That is, the mechanism of Goodharting seems impossible to overcome. Goodharting is just a fact of any control system.
It's impossible to infer the inner experience (and thus values) of another being perfectly without making normative assumptions.

Stuart Armstrong has made a case for (2) with his no free lunch theorem. I've not seen anyone formally make the case for (1), though.

Is this something worth trying to prove? That Goodharting is unavoidable and at most we can try to contain its effects?

I'm many years out from doing math full time so I'm not sure if I could make a rigorous proof of it, but this seems to be something that people disagree on sometimes (arguing that Goodharting can be overcome) but I think most of those discussions don't get very precise about what that means.

[-]Sam Marks4y*20

This paper gives a mathematical model of when Goodharting will occur. To summarize: if

(1) a human has some collection of things which she values,

(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and

(3) the robot can freely vary how much of $s_{1}, \dots, s_{n}$ there are in the world, subject only to resource constraints that make the $s_{i}$ trade off against each other,

then when the robot optimizes for its proxy utility, it will minimize all $s_{i}$ 's which its proxy utility function doesn't take into account. If you impose a further condition which ensures that you can't get too much utility by only maximizing some strict subset of the $s_{i}$ 's (e.g. assuming diminishing marginal returns), then the optimum found by the robot will be suboptimal for the human's true utility function.

That said, I wasn't super-impressed by this paper -- the above is pretty obvious and the mathematical model doesn't elucidate anything, IMO.

Moreover, I think this model doesn't interact much with the skeptical take about whether Goodhart's Law implies doom in practice. Namely, here are some things I believe about the world which this model doesn't take into account:

(1) Lots of the things we value are correlated with each other over "realistically attainable" distributions of world states. Or in other words, for many pairs $s_{i}, s_{j}$ of things we care about, it is hard (concretely, requires a very capable AI) to increase the amount of $s_{i}$ without also increasing the amount of $s_{j}$ .

(2) The utility functions of future AIs will be learned from humans in such a way that as the capabilities of AI systems increase, so will their ability to model human preferences.

If (1) is true, then for each given capabilities level, there is some room for error for our proxy utility functions (within which an agent at that capabilities level won't be able to decouple our proxy utility function from our true utility function); this permissible error margin shrinks with increasing capabilities. If you buy (2), then you might additionally think that the actual error margin between learned proxy utility functions and our true utility function will shrink more rapidly than the permissible error margin as AI capabilities grow. (Whether or not you actually do believe that value learning will beat capabilities in this race probably depends on a whole lot of other empirical beliefs, or so it seems to me.)

This thread (which you might have already seen) has some good discussion about whether Goodharting will be a big problem in practice.

[-]Gordon Seidoh Worley4y10

I actually don't think that model is general enough. Like, I think Goodharting is just a fact of control system's observing.

Suppose we have a simple control system with output and a governor $G$ . $G$ takes a measurement $m (X)$ (an observation) of $X$ . So long as $m (X)$ is not error free (and I think we can agree that no real world system can be actually error free), then $X = m (X) + ϵ$ for some error factor $ϵ$ . Since $G$ uses $m (X)$ to regulate the system to change $X$ , we now have error influencing the value of $X$ . Now applying the standard reasoning for Goodhart, in the limit of optimization pressure (i.e. $G$ regulating the value of $X$ for long enough), $ϵ$ comes to dominate the value of $X$ .

This is a bit handwavy, but I'm pretty sure it's true, which means in theory any attempt to optimize for anything will, under enough optimization pressure, become dominated by error, whether that's human values or something else. The only interesting question is can we control the error enough, either through better measurement or less optimization pressure, such that we can get enough signal to be happy with the output.

[-]Sam Marks4y10

Hmm, I'm not sure I understand -- it doesn't seem to me like noisy observations ought to pose a big problem to control systems in general.

For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we'll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don't see a sense in which the error "comes to dominate" the thing we're optimizing.

One concern which does make sense to me (and I'm not sure if I'm steelmanning your point or just saying something completely different) is that under extreme optimization pressure, measurements might become decoupled from the thing they're supposed to measure. In the mosquito example, this would look like us bribing the surveyors to report artificially low mosquito counts instead of actually trying to affect real-world mosquito counts.

If this is your primary concern regarding Goodhart's Law, then I agree the model above doesn't obviously capture it. I guess it's more precisely a model of proxy misspecification.

[-]Gordon Seidoh Worley4y00

"Error" here is all sources of error, not just error in the measurement equipment. So bribing surveyors is a kind of error in my model.

[-]RHollerith4y00

Can you explain where there is an error term in AlphaGo or where an error term might appear in hypothetical model similar to AlphaGo trained much longer with much more numerous parameters and computational resources?

[-]Gordon Seidoh Worley4y00

AlphaGo is fairly constrained in what it's designed to optimize for, but it still has the standard failure mode of "things we forgot to encode". So for example AlphaGo could suffer the error of instrumental power grabbing in order to be able to get better at winning Go because we misspecified what we asked it to measure. This is a kind of failure introduced into the systems by humans failing to make adequately evaluate $X$ as we intended, since we cared about winning Go games while also minimizing side effects, but maybe when we constructed $m (X)$ we forgot about minimizing side effects.

[-]RHollerith4y*00

At least one person here disagrees with you on Goodharting. (I do.)

You've written before on this site if I recall correctly that Eliezer's 2004 CEV proposal is unworkable because of Goodharting. I am granting myself the luxury of not bothering to look up your previous statement because you can contradict me if my recollection is incorrect.

I believe that the CEV proposal is probably achievable by humans if those humans had enough time and enough resources (money, talent, protection from meddling) and that if it is not achievable, it is because of reasons other than Goodhart's law.

(Sadly, an unaligned superintelligence is much easier for humans living in 2022 to create than a CEV-aligned superintelligence is, so we are probably all going to die IMHO.)

Perhaps before discussing the CEV proposal we should discuss a simpler question, namely, whether you believe that Goodharting inevitably ruins the plans of any group setting out intentionally to create a superintelligent paperclip maximizer.

Another simple goal we might discuss is a superintelligence (SI) whose goal is to shove as much matter as possible into a black hole or an SI that "shuts itself off" within 3 months of its launching where "shuts itself off" means stops trying to survive or to affect reality in any way.

Moderation Log