All of Isnasene's Comments + Replies

So, silly question that doesn't really address the point of this post (this may very well be just a point of clarity thing but it would be useful for me to have an answer due to earning-to-give related reasons off-topic for this post) --

Here you claim that CDT is a generalization of decision-theories that includes TDT (fair enough!):

Here, "CDT" refers -- very broadly -- to using counterfactuals to evaluate expected value of actions. It need not mean physical-causal counterfactuals. In particular, TDT counts as "a CDT" in this sen
... (read more)
3Abram Demski
Ah, yeah, I'll think about how to clear this up. The short answer is that, yes, I slipped up and used CDT in the usual way rather than the broader definition I had set up for the purpose of this post. On the other hand, I also want to emphasize that EDT two-boxes (and defects in twin PD) much more easily than I see commonly supposed. And, thus, to the extent one wants to apply the arguments of this post to TDT, TDT would also. Specifically, an EDT agent can only see something as correlated with its action if that thing has more information about the action than the EDT agent itself. Otherwise, the EDT agents own knowledge about its action screens off any correlation. This means that in Newcomb with a perfect predictor, EDT one-boxes. But in Newcomb where the predictor is only moderately good, in particular knows as much or less than the agent, EDT two-boxes. So, similarly, TDT must two-box in these situations, or be vulnerable to the Dutch Book argument of this post.

Thanks! This is great.

A year ago, Joaquin Phoenix made headlines when he appeared on the red carpet at the Golden Globes wearing a tuxedeo with a paper bag over his head that read, "I am a shape-shifter. I can't change the world. I can only change myself."

-- GPT-3 generated news article humans found easiest to distinguish from the real deal.

... I haven't read the paper in detail but we may have done it; we may be on the verge of superhuman skill at absurdist comedy! That's not even completely a joke. Look at the sentence "I am a shape-shifter. I c... (read more)

6William Saunders
Google's Meena (2.6 billion parameters, February 2020) creates original puns in "Conversation D", and I think "Cross-turn Repetition Example 2" is absurdist comedy, but maybe more as a result of the way the model fails.

I thought about this for longer than expected so here's an elaboration on inverse-inverse problems in the examples you provided:

Partial Differential Equations

Finding solutions to partial differential equations with specific boundary conditions is hard and often impossible. But we know a lot of solutions to differential equations with particular boundary conditions. If we match up those solutions with the problem at hand, we can often get a decent answer.

The direct problem: you have a function; figure out what relationships its derivatives have and it... (read more)

Can we switch to the interpolation regime early if we, before reaching the peak, tell it to keep the loss constant? Aka we are at loss l* and replace the loss function l(theta) with |l(theta)-l*| or (l(theta)-l*)^2.

Interesting! Given that stochastic gradient descent (SGD) does provide an inductive bias towards models that generalize better, it does seem like changing the loss function in this way could enhance generalization performance. Broadly speaking, SGD's bias only provides a benefit when it is searching over many possible models: it performs ba... (read more)

But secondly, I’m not sure about the fragility argument: that if there is basically any distance between your description and what is truly good, you will lose everything. 
This seems to be a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something, and b) assuming that there is a fast takeoff so that the relevant AI has its values forever, and takes over the world.

When I think of the fragility argument, I usually think in terms of Goodhart's Taxonomy. In ... (read more)

If the heuristics are optimized for "be able to satisfy requests from humans" and those requests sometimes require long-term planning, then the skill will develop. If it's only good at satisfying simple requests that don't require planning, in what sense is it superintelligent?

Yeah, that statement is wrong. I was trying to make a more subtle point about how an AI that learns long-term planning on a shorter time-frame is not necessarily going to be able to generalize to longer time-frames (but in the context of superintelligent AIs capable of doing human leve tasks, I do think it will generalize--so that point is kind of irrelevant). I agree with Rohin's response.




Thanks for replying!

This is not my belief. I think that powerful AI systems, even if they are a bunch of well developed heuristics, will be able to do super-long-term planning (in the same way that I'm capable of it, and I'm a bunch of heuristics, or Eliezer is to take your example).

Yeah, I intended that statement to be more of an elaboration on my own perspective than to imply that it represented your beliefs. I also agree that its wrong in the context of superintelligent AI we are discussing.

Should "I don't think" be "I do
... (read more)

Thanks for recording this conversation! Some thoughts:

AI development will be relatively gradual and AI researchers will correct safety issues that come up.

I was pretty surprised to read the above--most of my intuitions about AI come down to repeatedly hearing the point that safety issues are very unpredictable and high variance, and that once a major safety issue happens, it's already too late. The arguments I've seen for this (many years of Eliezer-ian explanations of how hard it is to come out on top against superintelligent agents who care a... (read more)

3Rohin Shah
I enjoyed this comment, thanks for thinking it through! Some comments: This is not my belief. I think that powerful AI systems, even if they are a bunch of well developed heuristics, will be able to do super-long-term planning (in the same way that I'm capable of it, and I'm a bunch of heuristics, or Eliezer is to take your example). Obviously this depends on how good the heuristics are, but I do think that heuristics will get to the point where they do super-long-term planning, and my belief that we'll be safe by default doesn't depend on assuming that AI won't do long-term planning. Yup, that's correct. Should "I don't think" be "I do think"? Otherwise I'm confused. With that correction, I basically agree. I would be very surprised if this worked in the near term. Like, <1% in 5 years, <5% in 20 years, and really I want to say < 1% that this is the first way we get AGI (no matter when), but I can't actually be that confident. My impression is that many researchers at MIRI would qualitatively agree with me on this, though probably with less confidence.
3Erick Ball
I've seen various people make the argument that this is not how AI works and it's not how AGI will work--it's basically the old "tool AI" vs "agent AI" debate. But I think the only reason current AI doesn't do this is because we can't make it do this yet: the default customer requirement for a general intelligence is that it should be able to do whatever task the user asks it to do. So far the ability of AI to understand a request is very limited (poor natural language skills). But once you have an agent that can understand what you're asking, of course you would design it to optimize new objectives on request, bounded of course by some built-in rules about not committing crimes or manipulating people or seizing control of the world (easy, I assume). Otherwise, you'd need to build a new system for every type of goal, and that's basically just narrow AI. If the heuristics are optimized for "be able to satisfy requests from humans" and those requests sometimes require long-term planning, then the skill will develop. If it's only good at satisfying simple requests that don't require planning, in what sense is it superintelligent?
Well, they’re anti-correlated across different agents. But from the same agent’s perspective, they may still be able to maximize their own red-seeing, or even human red-seeing - they just won’t

Just making sure I can parse this... When I say that they're anti-correlated, I mean that the policy of maximizing X is akin to the policy of minimizing X to the extent that X and not X will at some point compete for the same instrumental resources. I will agree with the statement that an agent maximizing X who possesses many instrumental ... (read more)

2Alex Turner
I am using #2, but I'm aware that there's a separate #1 meaning (and thank you for distinguishing between them so clearly, here!).

Oh I see where you're coming from now. I'll admit that, when I made my earlier post, I forgot about the full implications of instrumental convergence. Specifically, the part where:

Maximizing X minimizes alll Not X insofar as they both compete for the same resource pool.

Even if your resources are unusually low relative to where you're positioned in the universe, an AI will still take that away from you. Optimizing one utility function doesn't just randomly affect the optimization of other utility functions; they are anti-correlated in g... (read more)

2Alex Turner
Well, they’re anti-correlated across different agents. But from the same agent’s perspective, they may still be able to maximize their own red-seeing, or even human red-seeing - they just won’t. (This will be in the next part of my sequence on impact).

-------------------------------------Part 1: I Respond to Your Actual Comment----------------------------------------

The explanation is a bit simpler than this. The agent has one goal, and we have other goals. It gains power to best complete its goal by taking power away from us

I don't think this explanation is in conflict with mine. Much of my explanation (ie, the "optimizing a proxy too aggressively will invalidate the assumptions that the proxy was built on") is focused on explaining why we expect proxies to become mis-specified. In the... (read more)

2Alex Turner
Let me clarify the distinction I'm trying to point at: First, Goodhart's law applies to us when we're optimizing a goal for ourselves, but we don't know the exact goal. For example, if I'm trying to make myself happy, I might find a proxy of dancing, even though dancing isn't literally the global optimum. This uses up time I could have used on the actual best solution. This can be bad, but it doesn't seem that bad. I'm pretty corrigible to myself. Second, Goodhart's law applies to other agents who are instructed to maximize some proxy of what we want. This is bad. If it's maximizing the proxy, then it's ensuring it's most able to maximize the proxy, which means it's incentivized to stop us from doing things (unless the proxy specifically includes that - which safeguard is also vulnerable to misspecification; or is somehow otherwise more intelligently designed than the standard reward-maximization model). The agent is pursuing the proxy from its own perspective, not from ours. I don't think the intial value has much to do with what you label the "AIS version" of Goodhart (neither does the complexity of human values in particular). Imagine we had a reward function that gave one point of reward for each cone detecting red; reward is dispensed once per second. Imagine that the universe is presently low-value; for whatever reason, red stimulation is hard to find. Goodhart's law still applies to agents we build to ensure we can see red forever, but it doesn't apply to us directly - we presumably deduce our true reward function, and no longer rely on proxies to maximize it. The reason it applies to agents we build is that not only do you have to encode the reward function, but we have to point to people! This does not have a short description length. With respect to hard maximizers, a single misstep means the agent is now showing itself red, or something. How proxies interact is worth considering, but (IMO) it's far from the main reason for Goodhart's law being really

[Retracted my other reply due to math errors]

This is only true for the kind of things humans typically care about; this is not true for utility functions in general. That's the extra info we have.

While I generally agree that there can be utility functions that aren't subject to Goodhart, I don't think that this strictly pertains to humans. I expect that when the vast majority of agents (human or not) use scientific methods to develop a proxy for the thing it wants to optimize, they will found that proxy to break down upon intense optimizatio... (read more)

2Alex Turner
The explanation is a bit simpler than this. The agent has one goal, and we have other goals. It gains power to best complete its goal by taking power away from us. Therefore, any universe where we have an effective maximizer of something misspecified is a universe where we're no longer able to get what we want. That's why instrumental convergence is so bad.

Let me see if I have this...

1. Agents blindly maximize the proxies they pick if the expected value of maximizing the proxy is higher than doing anything else.

2. Goodhart's Law tells us that, in general, blindly maximizing the proxy has lower expected value than other methods that involves not doing that

3. Because of this, we expect the difference between what we want and what we get to be bigger if we're optimizing the proxy instead of following some non-optimizing default strategy. Thus, there's a lower bound on how bad optimizing the pro... (read more)

2Stuart Armstrong
This is only true for the kind of things humans typically care about; this is not true for utility functions in general. That's the extra info we have.

Don't mind me; just trying to summarize some of the stuff I just processed.

If you're choosing a strategy of predicting the future based on how accurate it turns out to be, the strategy who's output influences the future in ways that make its prediction more likely will outperform a strategy that doesn't (all else being equal). Thus, one might think that the strategy you chose will be the strategy that most effectively balances its prediction between a) how accurate that prediction (unconditioned on the prediction being given) and b) how... (read more)

I'm actually trying to be somewhat agnostic about the right conclusion here. I could have easily added another chapter discussing why the maximizing-surprise idea is not quite right. The moral is that the questions are quite complicated, and thinking vaguely about 'optimization processes' is quite far from adequate to understand this. Furthermore, it'll depend quite a bit on the actual details of a training procedure!