User Comment Replies — AI Alignment Forum

Evaluating Stability of Unreflective Alignment

I'd be curious about why it isn't changing the picture quite a lot, maybe after you've chewed on the ideas. From my perspective it makes the entire non-reflective-AI-via-training pathway not worth pursuing. At least for large scale thinking.

2james.lucassen3mo

It doesn't change the picture a lot because the proposal for preventing misaligned goals from arising via this mechanism was to try and get control over when the AI does/doesn't step back, in order to allow it in the capability-critical cases but disallow it in the dangerous cases. This argument means you'll have more attempts at dangerous stepping back that you have to catch, but doesn't break the strategy. The strategy does break if when we do this blocking, the AI piles on more and more effort trying to unblock it until it either succeeds or is rendered useless for anything else. There being more baseline attempts probably raises the chance of that or some other problem that makes prolonged censorship while maintaining capabilities impossible. But again, just makes it harder, doesn't break it. I don't think you need to have that pile-on property to be useful. Consider MTTR(n), the mean time an LLM takes to realize it's made a mistake, parameterized by how far up the stack the mistake was. By default you'll want to have short MTTR for all n. But if you can get your MTTR short enough for small n, you can afford to have MTTR long for large n. Basically, this agent tends to get stuck/rabbit-hole/nerd-snipe but only when the mistake that caused it to get stuck was made a long time ago. Imagine a capabilities scheme where you train MTTR using synthetic data with an explicit stack and intentionally introduced mistakes. If you're worried about this destabilization threat model, there's a pretty clear recommendation: only train for small-n MTTR, treat large-n MTTR as a dangerous capability, and you pay some alignment tax in the form of inefficient MTTR training and occasionally rebooting your agent when it does get stuck in a non dangerous case. Figured I should get back to this comment but unfortunately the chewing continues. Hoping to get a short post out soon with my all things considered thoughts on whether this direction has any legs

Evaluating Stability of Unreflective Alignment

Jeremy Gillen5mo*10

Extremely underrated post, I'm sorry I only skimmed it when it came out.

I found 3a,b,c to be strong and well written, a good representation of my view.

In contrast, 3d I found to be a weak argument that I didn't identify with. In particular, I don't think internal conflicts are a good way to explain the source of goal misgeneralization. To me it's better described as just overfitting or misgeneralization.^[1] Edge cases in goals are clearly going to be explored by a stepping back process, if initial attempts fail. In particular if attempted pathwa... (read more)

2james.lucassen5mo

Yup not obvious. I do in fact think a lot more humans would be helpful. But I also agree that my mental picture of "transformative human level research assistant" relies heavily on serial speedup, and I can't immediately picture a version that feels similarly transformative without speedup. Maybe evhub or Ethan Perez or one of the folks running a thousand research threads at once would disagree.

4james.lucassen5mo

Hmm. This is a good point, and I agree that it significantly weakens the analogy. I was originally going to counter-argue and claim something like "sure total failure forces you to step back far but it doesn't mean you have to step back literally all the way". Then I tried to back that up with an example, such as "when I was doing alignment research, I encountered total failure that forced me to abandon large chunks of planning stack, but this never caused me to 'spill upward' to questioning whether or not I should be doing alignment research at all". But uh then I realized that isn't actually true :/ On consideration, yup this obviously matters. The thing that causes you to step back from a goal is that goal being a bad way to accomplish its supergoal, aka "too difficult". Can't believe I missed this, thanks for pointing it out. I don't think this changes the picture too much, besides increasing my estimate of how much optimization we'll have to do to catch and prevent value-reflection. But a lot of muddy half-ideas came out of this that I'm interested in chewing on.

Safe Predictive Agents with Joint Scoring Rules

Jeremy Gillen6mo20

To me it seems like one important application of this work is to understanding and fixing the futachy hack in FixDT and in Logical Inductor decision theory. But I'm not sure whether your results can transfer to these settings, because of the requirement that the agents have the same beliefs.

Is there a reason we can't make duplicate traders in LI and have their trades be zero-sum?

I'm generally confused about this. Do you have thoughts?

2Rubi Hudson6mo

Having re-read the posts and thought about it some more, I do think zero-sum competition could be applied to logical inductors to resolve the futarchy hack. It would require minor changes to the formalism to accommodate, but I don't see how those changes would break anything else.

2Rubi Hudson6mo

I'll take a look at the linked posts and let you know my thoughts soon!

Decomposing Agency — capabilities without desires

Jeremy Gillen8mo1-2

What task? All the tasks I know of that are sufficient to reduce x-risk are really hard.

1Thomas Kwa8mo

I'm not thinking of a specific task here, but I think there are two sources of hope. One is that humans are agentic above and beyond what is required to do novel science, e.g. we have biological drives, goals other than doing the science, often the desire to use any means to achieve our goals rather than whitelisted means, and the ability and desire to stop people from interrupting us. Another is that learning how to safely operate agents at a slightly superhuman level will be progress towards safely operating nanotech-capable agents, which could also require control, oversight, steering, or some other technique. I don't think limiting agency will be sufficient unless the problem is easy, and then it would have other possible solutions.

TurnTrout's shortform feed

Jeremy Gillen1y65

I think the term is very reasonable and basically accurate, even more so with regard to most RL methods. It's a good way of describing a training process without implying that the evolving system will head toward optimality deliberately. I don't know a better way to communicate this succinctly, especially while not being specific about what local search algorithm is being used.

Also, evolutionary algorithms can be used to approximate gradient descent (with noisier gradient estimates), so it's not unreasonable to use similar language about both.

I'm not a huge fan of the way you imply that it was chosen for rhetorical purposes.

2Alex Turner1y

Without commenting on the rest for now— To be clear, I'm not alleging mal-intent or anything. I'm more pointing out memetic dynamics. The situation can look as innocent as "You genuinely believe X, and think it's important for people to get X, and so you iterate over explanations until you find an effective one." And maybe that explanation just happens to involve analogizing that ML "selects for low loss."

Instrumental Convergence? [Draft]

Jeremy Gillen2y*104

I read about half of this post when it came out. I didn't want to comment without reading the whole thing, and reading the whole thing didn't seem worth it at the time. I've come back and read it because Dan seemed to reference it in a presentation the other day.

The core interesting claim is this:

My conclusion will be that most of the items on Bostrom's laundry list are not 'convergent' instrumental means, even in this weak sense. If Sia's desires are randomly selected, we should not give better than even odds to her making choices which promote her own su

... (read more)

4J. Dmitri Gallow2y

Thanks for the read and for the response. >None of your models even include actions that are analogous to the convergent actions on that list. I'm not entirely sure what you mean by "model", but from your use in the penultimate paragraph, I believe you're talking about a particular decision scenario Sia could find herself in. If so, then my goal wasn't to prove anything about a particular model, but rather to prove things about every model. >The non-sequential theoretical model is irrelevant to instrumental convergence, because instrumental convergence is about putting yourself in a better position to pursue your goals later on. Sure. I started with the easy cases to get the main ideas out. Section 4 then showed how those initial results extend to the case of sequential decision making. >Section 4 deals with sequential decisions, but for some reason mainly gets distracted by a Newcomb-like problem, which seems irrelevant to instrumental convergence. I don't see why you didn't just remove Newcomb-like situations from the model? I used the Newcomb problem to explain the distinction between sophisticated and resolute choice. I wasn't assuming that Sia was going to be facing a Newcomb problem. I just wanted to help the reader understand the distinction. The distinction is important, because it makes a difference to how Sia will choose. If she's a resolute chooser, then sequential decisions reduce to a single non-sequential decisions. She just chooses a contingency plan at the start, and then sticks to that contingency plan. Whereas if she's a sophisticated chooser, then she'll make a series of non-sequential decisions. In both cases, it's important to understand how she'll choose in non-sequential decisions, which is why I started off thinking about that in section 3. >It seems clear to me that for the vast majority of the random utility functions, it's very valuable to have more control over the future world state. So most sampled agents will take the instrument

Soft optimization makes the value target bigger

Jeremy Gillen2y30

Thanks for clarifying, I misunderstood your post and must have forgotten about the scope, sorry about that. I'll remove that paragraph. Thanks for the links, I hadn't read those, and I appreciate the pseudocode.

I think most likely I still don't understand what you mean by grader-optimizer, but it's probably better to discuss on your post after I've spent more time going over your posts and comments.

My current guess in my own words is: A grader-optimizer is something that approximates argmax (has high optimization power)?
And option (1) acts a bit like a soft optimizer, but with more specific structure related to shards, and how it works out whether to continue optimizing?

2Alex Turner2y

Thanks for registering a guess! I would put it as: a grader optimizer is something which is trying to optimize the outputs of a grader as its terminal end (either de facto, via argmax, or intent-alignment, as in "I wanna search for plans which make this function output a high number"). Like, the point of the optimization is to make the number come out high. (To help you checksum: It feels important to me that "is good at achieving its goals" is not tightly coupled to "approximating argmax", as I'm talking about those terms. I wish I had fast ways of communicating my intuitions here, but I'm not thinking of something more helpful to say right now; I figured I'd at least comment what I've already written.)

AI ALIGNMENT FORUM
AF

All of Jeremy Gillen's Comments + Replies