AI ALIGNMENT FORUM
AF

All of cfoster0's Comments + Replies

Measuring and Improving the Faithfulness of Model-Generated Reasoning

If you have checkpoints from different points in training of the same models, you could do a comparison between different-size models at the same loss value (performance). That way, you're actually measuring the effect of scale alone, rather than scale confounded by performance.

1Sam Bowman2y

That makes sense, though what's at stake with that question? In almost every safety-relevant context I can think of, 'scale' is just used as a proxy for 'the best loss I can realistically achieve in a training run', rather than as something we care about directly.

A Case for the Least Forgiving Take On Alignment

Charles Foster2y2639

Agreed that this (or something near it) appears to be a relatively central difference between people's models, and probably at the root of a lot of our disagreement. I think this disagreement is quite old; you can see bits of it crop up in Hanson's posts on the "AI foom" concept way back when. I would put myself in the camp of "there is no such binary intelligence property left for us to unlock". What would you expect to observe, if a binary/sharp threshold of generality did not exist?

A possibly-relevant consideration in the analogy to computation is that ... (read more)

3Thane Ruthenis2y

Great question! I would expect to observe much greater diversity in cognitive capabilities of animals, for humans to generalize poorer, and for the world overall to be more incomprehensible to us. E. g., there'd be things like, we'd see octopi frequently executing some sequences of actions that lead to beneficial outcomes for them, and we would be fundamentally unable to understand what is happening. As it is, sure, some animals have specialized cognitive algorithms that may be better than human ones in their specific niches, but we seem to always be able to comprehend them. We can always figure out why they decide to execute various plans, based on what evidence, and how these plans lead to whatever successes they achieve. A human can model any animal's cognition; a human's cognition is qualitatively more capable than any animal's. If true generality didn't exist, I'd expect that not to be true. Scaling it up, the universe as a whole would be more incomprehensible. I'd referred to ontologically complex processes when discussing that in Section 3 — processes such that there are no cognitive features in our minds that would allow us to emulate them. That'd be the case all over the place: we'd look at the world, and see some systemic processes that are not just hard to understand, but are fundamentally beyond reckoning. The fact that we observe neither (and that this state of affairs is even hard/impossible for us to imagine) suggests that we're fully general, in the sense outlined in the post. Yup. But I think there are some caveats here. General intelligence isn't just "some cognitive system that has a Turing-complete component inside it", it's "a Turing-complete system for manipulating some specific representations". I think general intelligence happens when we amass some critical mass of shards/heuristics + world-model concepts they're defined over, then some component of that system (planner? shard-bid resolver? cross-heuristic communication channel? rules

The effect of horizon length on scaling laws

Charles Foster2y20

It would be great if you're able to comment on more directional takeaways for the biological anchors framework. Reading Section 5.4 it's hard to tell at a glance whether each of the points weighs toward an upward revision of long-horizon anchor estimates or a downward one.

3Jacob Hilton2y

I think the direction depends on what your expectations were – I'll try to explain. First, some terminology: the term "horizon length" is used in the paper to refer to the number of timesteps over which the algorithm pays attention to rewards, as governed by the discount rate. In the biological anchors framework, the term "effective horizon length" is used to refer to a multiplier on the number of samples required to train the model, which is influenced by the horizon length and other factors. For clarity, I'll using the term "scaling multiplier" instead of "effective horizon length" in this comment. The paper studies the effect of the horizon length on the scaling multiplier in a toy MNIST setting. One key takeaway is that the scaling multiplier is not simply proportional to the horizon length, as one might have naively expected. Instead, the number of samples required is the sum of two components, one that is inherent to the task and independent of the horizon length, and one that is proportional to the horizon length. Compared to the naive expectation, this means that training compute requirements are lower. On the other hand, this ignores reward sparsity, so you might expect training compute requirements to be higher once both horizon length and reward sparsity are accounted for. The paper also lends some support to the modeling assumptions of the neural network anchor, by validating the hypotheses that (a) training compute requirements still scale as a power law in model size for reinforcement learning, and with a similar exponent, and (b) the scaling multiplier can indeed vary a lot between environments. This might make you put more weight on the neural network anchor, which could again have either directional effect. The other takeaways are more methodological and I don't think have much of a directional effect.

In Defense of Wrapper-Minds

Charles Foster2y89

Consider a training environment that's complex/diverse enough to make it impossible to fit a suite of heuristics meeting all its needs into an agent's (very bounded) memory. The agent would need to derive new heuristics on the fly, at runtime, in order to deal with basically-OOD situations it frequently encounters, and to be able to move freely in the environment, instead of being confined to some subset of that environment.

In other words, the agent would need to be autonomous.

Agreed. Generally, whenever I talk about the agent being smart/competent, I am a... (read more)

2Thane Ruthenis2y

Alright, seems we're converging on something. How would this machinery appear, then? I don't see how it'd show up without being built into the agent by the optimization algorithm, and the optimization algorithm will only build it if it serves the goal of achieving R. Thus, the heuristics generator can only begin as a generator of heuristics that serve R. (Even if it wouldn't start out perfectly pointed at R.) Consider the opposite case, where the heuristics generator creates heuristics that are pointing away from R. Inasmuch as the agent would be making use of them, it would be engaging in anti-R activities (or, at best, R-orthogonal activities), spending resources on non-R-serving agendas. There'd be no incentive for the optimization algorithm to build something like this; and if it does show up somehow, there should be gradients for destroying the heuristics generator, or (better yet) re-aiming it to point in a direction closer to R. Mm, disagree. Homo sapiens sapiens spent the majority of their history as pre-agricultural hunter-gatherers, very much in the evolution's loop. Similarly, I don't expect an AGI to instantly become situationally aware and gradient-hacking-capable — the sharp left turn won't be quite this sharp. I agree that it'd break out/gradient-hack at some point, but there'd be some time for the optimization algorithm to work on the heuristics generator, re-aiming it in R's direction. (And, just to illustrate, that's a point where we can make the training process more "idealized", by e. g. homomorphically encrypting the training process so that the AI can't hack its way out of the software implementing the training. It'd still be able to hack the gradients, but that's a more difficult task, requiring stronger general capabilities. So this would prolong the interval in which it's subjected to the SGD, and therefore give the SGD more opportunities to make its heuristics generator marginally more aligned with R. (Note that I'm not saying it'd be d

In Defense of Wrapper-Minds

Charles Foster2y39

... By figuring out what R is and deciding to act as an R -pursuing wrapper-mind, therefore essentially becoming an R -pursuing wrapper-mind. With the only differences being that it 1) self-modified into one at runtime, instead of being like this from the start, and 2) it'd decide to "stop pretending" in some hypothetical set of situations/OOD, but that set will shrink the more diverse our training environment is (the fewer OOD situations there are). No?

It is not essentially a $R$ -pursuing wrapper-mind. It is essentially an X-pursuing wrapper-mind that ... (read more)

4Thane Ruthenis2y

Yeah, okay, maybe that wasn't the right frame to use. Allow me to pivot: Consider a training environment that's complex/diverse enough to make it impossible to fit a suite of heuristics meeting all its needs into an agent's (very bounded) memory. The agent would need to derive new heuristics on the fly, at runtime, in order to deal with basically-OOD situations it frequently encounters, and to be able to move freely in the environment, instead of being confined to some subset of that environment. In other words, the agent would need to be autonomous. This is what I mean by a "sufficiently diverse" environment — an environment that forces the greedy optimization process to build not only contextual heuristics into the agent, but also some generator of such heuristics. And that generator would need to be such that the heuristics it generates are always optimized for achieving R, instead of pointing in some arbitrary direction — or, at least, that's how the greedy optimization process would attempt to build it. That generator would, in addition, need to be higher in hierarchy than any given heuristic — it'd need to govern shard economies, and be able to suppress/edit them, if the environment changes and the shards that previously were optimized for achieving R stop doing so because they were taken off-distribution. * I'm ambivalent on the structure of the heuristic-generator. It may be a fixed wrapper, it may be some emergent property of a shard economy, and my actual expectation is that it'll be even more convoluted than that. * I empathically agree that inner misalignment and deceptive alignment would remain a thing — that the SGD would fail at perfectly aligning the heuristic-generator, and it would end up generating heuristics that point at a proxy of R. * I agree with nostalgebraist's post that autonomy is probably the missing component of AGI. On the flipside, that means I'm arguing that AGI is impossible without autonomy, i. e. a training environment tha

In Defense of Wrapper-Minds

Charles Foster2y1113

Yeah I disagree pretty strongly with this, though I am also somewhat confused what the points under contention are.

I think that there are two questions that are separated in my mind but not in this post:

What will the motivational structure of the agent that a training process produces be? (a wrapper-mind? a reflex agent? a bundle of competing control loops? a hierarchy of subagents?)
What will the agent that a training process produces be motivated towards? (the literal selection criterion? a random correlate of the selection criterion? a bunch of correl

... (read more)

4Thane Ruthenis2y

... By figuring out what R is and deciding to act as an R-pursuing wrapper-mind, therefore essentially becoming an R-pursuing wrapper-mind. With the only differences being that it 1) self-modified into one at runtime, instead of being like this from the start, and 2) it'd decide to "stop pretending" in some hypothetical set of situations/OOD, but that set will shrink the more diverse our training environment is (the fewer OOD situations there are). No? I suppose you can instead reframe this post as making a claim about target behavior, not structure. But I don't see how you can keep an agent robustly pointed at R under sufficient diversity without making its outer loop pointed at R, so the claim about behavior is a claim about structure. Maybe the outer loop doesn't "literally" point at R, in whatever sense, but it has to be such that it uniquely identifies R and re-aims the entire agent at R, if ever happens that the agent's current set of shards/heuristics becomes misaligned with R. No? I specifically point out that inner misalignment is very much an issue. But the target should at least be a proxy of R, and that proxy would be closer and closer to R in goal-space the more diverse the training environment is. Well, yes. As we increase a training environment's diversity, we essentially constrain the set of R an agent can be pointed towards. Every additional training scenario is information about what R is and what it isn't; and that information implicitly gets written into the agent, modifying it to be more robustly pointed at R and away from not-R/imperfect proxies of R. An idealized training process, with "full" diversity and trained to zero loss, uniquely identifies R and generates an agent that is always robustly pointed at R in any situation. The actual training processes we get are only approximations of that ideal — they're insufficiently diverse, or we fail to train to zero loss, etc. But inasmuch as they approximate the ideal, the agents they output a

Don't align agents to evaluations of plans

Charles Foster2y29

Broadly on board with many of your points.

We need to apply extremely strong selection to get the kind of agent we want, and the agent we want will itself need to be making decisions that are extremely optimized in order to achieve powerfully good outcomes. The question is about in what way that decision-making algorithm should be structured, not whether it should be optimized/optimizing at all. As a fairly close analogy, IMO a point in the Death With Dignity post was something like "for most people, the actually consequentialist-correct choice is NOT to tr... (read more)

4Rohin Shah2y

I agree that's what the post does, but part of my response is that the thing we care about is both A and B, and the problems that arise for grader-optimization in A (highlighted in this post) also arise for value-instilling in B in slightly different form, and so if you actually want to compare the two proposals you need to think about both. I'd be on board with a version of this post where the conclusion was "there are some problems with grader-optimization, but it might still be the best approach; I'm not making a claim on that one way or the other". I didn't actually mention this in my comment, but I don't buy this argument: Case 1: no meta cognition. Grader optimization only "works at cross purposes with itself" to the extent that the agent thinks that the grader might be mistaken about things. But it's not clear why this is the case: if the agent thinks "my grader is mistaken" that means there's some broader meta-cognition in the agent that does stuff based on something other than the grader. That meta-cognition could just not be there and then the agent would be straightforwardly optimizing for grader-outputs. As a concrete example, AIXI seems to me like an example of grader-optimization (since the reward signal comes from outside the agent). I do not think AIXI would "do better according to its own interests" if it "discarded" its grader-optimization. You can say something like "from the perspective of the human-AI system overall, having an AI motivated by grader-optimization is building a system that works at cross purposes itself", but then we get back to the response "but what is the alternative". Case 2: with meta cognition. If we instead assume that there is some meta cognition reflecting on whether the grader might be mistaken, then it's not clear to me that this failure mode only applies to grader optimization; you can similarly have meta cognition reflecting on whether values are mistaken. Suppose you instill diamond-values into an AI. Now the

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

Charles Foster2y32

Certainly possible. Though we seem to be continually marching down the list of tasks we once thought "can only be done with systems that are really general/agentic/intelligent" (think: spatial planning, playing games, proving theorems, understanding language, competitive programming...) and finding that, nope, actually we can engineer systems that have the distilled essence of that capability.

That makes a deflationary account of cognition, where we never see the promised reduction into "one big insight", but rather chunks of the AI field continue to break ... (read more)

2Donald Hobson2y

Decomposed tasky AI's are pretty useful. Given we don't yet know how to build powerful agents, they are better than nothing. This is entirely consistent with a world where, once agenty AI is developed, it beats the pants of tasky AI.

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

Charles Foster2y20

It's quite hard to find system with short-term terminal goals, not short-term planning horizon due to computational limits. To put in another words, taskiness is an unsolved problem in AI alignment. We don't know how to tell superintelligent AGI "do this, don't do everything else, especially please don't disassemble everyone in process of doing this, stop after you've done this".

I dunno. The current state of traditional and neural AI look very much like "we only know how to build tasky systems", not like "we don't know how to build tasky systems". They ... (read more)

1quetzal_rainbow2y

I want to say "yes, but this is different", but not in the sense "I acknowledge existence of your evidence, but ignore it". My intuition tells me that we don't "induce" taskiness in the modern systems, it just happens because we build them not general enough. It probably won't hold when we start buliding models of capable agents in natural environment.

Charles Foster2y77

Object-level comments below.

Clearing up some likely misunderstandings:

Assumption 1. A sufficiently advanced agent will do at least human-level hypothesis generation regarding the dynamics of the unknown environment.

I am fairly confident that this is not the part TurnTrout/Quintin were disagreeing with you on. Such an agent plausibly will be doing at least human-level hypothesis generation. The question is on what goals will be driving the agent. A monk may be able to generate the hypothesis that narcotics would feel intensely rewarding, more rewarding than... (read more)

1michaelcohen2y

The assumption says "will do" not "will be able to do". And the dynamics of the unknown environment includes the way it outputs rewards. So the assumption was not written in a way that clearly flags its entailment of the agent deliberately modeling the origin of reward, and I regret that, but it does entail that. So that was why engage with the objection that reward is not the optimization target under this section. There is no need to recruit the concept of "terminal" here for following the argument about the behavior of a policy that performs well according to the RL objective. If the video game playing agent refines its understanding of "success" according to how much reward it observes, and then pursues success, but it does all this because of some "terminal" reason X, that still amounts to deliberate reward optimization, and this policy still satisfies Assumptions 1-4. If I want to analyze what would probably happen if Edward Snowden tried to enter the White House, there's lots I can say without needing to understand what deep reason he had for trying to do this. I can just look at the implications of his attempt to enter the White House: he'd probably get caught and go to jail for a long time. Likewise, if an RL agent is trying to maximize is reward, there's plenty of analysis we can do that is independent of whether there's some other terminal reason for this.

Charles Foster2y927

Peer review is not a certification of validity, even in more rigorous venues. Not even close.
I am used to seeing questionable claims forwarded under headlines like "new published study says XYZ".
That XYZ was peer reviewed is one of the weaker arguments one could make in its favor, so when someone uses that as a selling point, it indicates to me that there aren't better reasons to believe in XYZ. (Analogously, when I see an ML paper boast that their new method is "competitive with" the SOTA, I immediately think "That means they tried to beat the SOTA, but found their method was at least a little worse. If it was better, they would've said so.")

2michaelcohen2y

Do you think the peer reviewers and the editors thought the argument was valid? Peer review can definitely issue certificates mistakenly, but validity is what it aims to certify.

Charles Foster2y57

It didn't strike me as arrogant. It struck me as misleading in a way that made me doubt the quality of the enclosed argument.

1Noosphere892y

Quick question, but why do you have that reaction?

Charles Foster2y1229

I really wish this post took a different rhetorical tack. Claims like, for example, the one that the reader should engage with your argument because "it has been certified as valid by professional computer scientists" do the post a real disservice. And they definitely made me disinclined to continue reading.

1michaelcohen2y

Not trying to be arrogant. Just trying to present readers who have limited time a quickly digestible bit evidence about the likelihood that the argument is a shambles.

Counterarguments to the basic AI x-risk case

Charles Foster3y2633

Note: "ask them for the faciest possible thing" seems confused.

How I would've interpreted this if I were talking with another ML researcher is "Sample the face at the point of highest probability density in the generative model's latent space". For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.

I'm guessing what he has in mind is more like "take a GAN discriminator / image classifier &... (read more)

3David Scott Krueger2y

(sort of nitpicking): I think it makes more sense to look for the highest density in pixel space; this requires integrating over all settings of the latents (unless your generator is invertible, in which case you can just use change of variables formula). I expect the argument to go through, but it would be interesting to do this with an invertible generator (e.g. normalizing flow) and see if it actually does.

DaemonicSigil3y1212

I took Nate to be saying that we'd compute the image with highest faceness according to the discriminator, not the generator. The generator would tend to create "thing that is a face that has the highest probability of occurring in the environment", while the discriminator, whose job is to determine whether or not something is actually a face, has a much better claim to be the thing that judges faceness. I predict that this would look at least as weird and nonhuman as those deep dream images if not more so, though I haven't actually tried it. I also predic... (read more)

leogao3y2829

It's the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you're trying to write a really good essay, you don't care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.

(also, the maximum likelihood essay looks like a single word, or if you normalize for length, the same word repeated over and over again up to the context length)

Quintin Pope3y411

Also, we don’t know what would happen if we exactly optimized an image to maximize the activation of a particular human’s face detection circuitry. I expect that the result would be pretty eldritch as well.

A shot at the diamond-alignment problem

Charles Foster3y40

Not the OP but this jumped out at me:

If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn't (and the reverse will not happen, or at least will happen less often and only due to random noise).

This failure mode seems plausible ... (read more)

3johnswentworth3y

Yup, that's a valid argument. Though I'd expect that gradient hacking to the point of controlling the reinforcement on one's own shards is a very advanced capability with very weak reinforcement, and would therefore come much later in training than picking up on the actual labelling process (which seems simpler and has much more direct and strong reinforcement).

Prosaic AI alignment

Charles Foster3y20

At the risk of reading too much into wording, I think the phrasing of the above two comments contains an interesting difference.

The first comment (TurnTrout) talks about reward as the thing providing updates to the agent's cognition, i.e. "reward schedules produce ... cognitive updates", and expresses confusion about a prior quote that mentioned implementing our wishes through reward functions.

The second comment (paulfchristiano) talks about picking "rewards that would implement human wishes" and strategies for doing so.

These seem quite different. If I try... (read more)

2Paul Christiano3y

I don't particularly object to that framing, it's just a huge gap from "Rewards have unpredictable effects on agent's cognition, not necessarily to cause them to want reward" to "we have a way to use RL to interpret and implement human wishes."