All of davidad's Comments + Replies

Paralysis of the form "AI system does nothing" is the most likely failure mode. This is a "de-pessimizing" agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period.

"Locked into some least-harmful path" is ... (read more)

1Joe Collman
(understood that you'd want to avoid the below by construction through the specification) I think the worries about a "least harmful path" failure mode would also apply to a "below 1 catastrophic event per millennium" threshold. It's not obvious to me that the vast majority of ways to [avoid significant risk of catastrophe-according-to-our-specification] wouldn't be highly undesirable outcomes. It seems to me that "greatly penalize the additional facts which are enforced" is a two-edged sword: we want various additional facts to be highly likely, since our acceptability specification doesn't capture everything that we care about. I haven't thought about it in any detail, but doesn't using time-bounded utility functions also throw out any acceptability guarantee for outcomes beyond the time-bound?

It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system's pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications a... (read more)

2Joe Collman
[again, the below is all in the spirit of "I think this direction is plausibly useful, and I'd like to see more work on it"] Sure, but this seems to say "Don't worry, the malicious superintelligence can only manipulate your mind indirectly". This is not the level of assurance I want from something calling itself "Guaranteed safe". This is one mechanism by which such a system could cause great downstream harm. Suppose that we have a process to avoid this. What assurance do we have that there aren't other mechanisms to cause harm? I don't yet buy the description complexity penalty argument (as I currently understand it - but quite possibly I'm missing something). It's possible to manipulate by strategically omitting information. Perhaps the "penalise heavily biased sampling" is intended to avoid this (??). If so, I'm not sure how this gets us more than a hand-waving argument. I imagine it's very hard to do indirect manipulation without adding much complexity. I imagine that ASL-4+ systems are capable of many very hard things. Similar reasoning leads me to initial skepticism of all [safety guarantee by penalizing some-simple-x] claims. This amounts to a claim that reducing x necessarily makes things safer - which I expect is untrue for any simple x. I can buy that there are simple properties whose reduction guarantees safety if it's done to an extreme degree - but then I'm back to expecting the system to do nothing useful. As an aside, I'd note that such processes (e.g. complexity penalties) seem likely to select out helpful behaviours too. That's not a criticism of the overall approach - I just want to highlight that I don't think we get to have both [system provides helpful-in-ways-we-hadn't-considered output] and [system can't produce harmful output]. Allowing the former seems to allow the latter. That's probably a good idea, but this kind of approach doesn't seem in keeping with a "Guaranteed safe" label. More of a "We haven't yet found a way in which this is

Re footnote 2, and the claim that the order matters, do you have a concrete example of a homogeneous ultradistribution that is affine in one sense but not the other?

3Vanessa Kosoy
Sorry, that footnote is just flat wrong, the order actually doesn't matter here. Good catch! There is a related thing which might work, namely taking the downwards closure of the affine subspace w.r.t. some cone which is somewhat larger than the cone of measures. For example, if your underlying space has a metric, you might consider the cone of signed measures which have non-negative integral with all positive functions whose logarithm is 1-Lipschitz.

For the record, as this post mostly consists of quotes from me, I can hardly fail to endorse it.

I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.

That’s basically correct. OAA is more like a research agenda and a story about how one would put the research outputs together to build safe AI, than an engineering agenda that humanity entirely knows how to build. Even I think it’s only about 30% likely to work in time.

I would love it if humanity had a plan that was more likely to be feasible, and in my opinion that’s still an open problem!

OAA bypasses the accident version of this by only accepting arguments from a superintelligence that have the form “here is why my proposed top-level plan—in the form of a much smaller policy network—is a controller that, when combined with the cyberphysical model of an Earth-like situation, satisfies your pLTL spec.” There is nothing normative in such an argument; the normative arguments all take place before/while drafting the spec, which should be done with AI assistants that are not smarter-than-human (CoEm style).

There is still a misuse version: someon... (read more)

5Wei Dai
Ok I'll quote 5.1.4-5 to make it easier for others to follow this discussion: I'm not sure how these are intended to work. How do you intend to define/implement "divergence"? How does that definition/implementation combined with "high degree of Knightian uncertainty about human decisions and behaviour" actually cause the AI to "not interfere" but also still accomplish the goals that we give it? In order to accomplish its goals, the AI has to do lots of things that will have butterfly effects on the future, so the system has to allow it to do those things, but also not allow it to "propagandize to humans". It's just unclear to me how you intend to achieve this.

That being said— I don’t expect existing model-checking methods to scale well. I think we will need to incorporate powerful AI heuristics into the search for a proof certificate, which may include various types of argument steps not limited to a monolithic coarse-graining (as mentioned in my footnote 2). And I do think that relies on having a good meta-ontology or compositional world-modeling framework. And I do think that is the hard part, actually! At least, it is the part I endorse focusing on first. If others follow your train of thought to narrow in o... (read more)

I think you’re directionally correct; I agree about the following:

  • A critical part of formally verifying real-world systems involves coarse-graining uncountable state spaces into (sums of subsets of products of) finite state spaces.
  • I imagine these would be mostly if not entirely learned.
  • There is a tradeoff between computing time and bound tightness.

However, I think maybe my critical disagreement is that I do think probabilistic bounds can be guaranteed sound, with respect to an uncountable model, in finite time. (They just might not be tight enough to... (read more)

3davidad (David A. Dalrymple)
That being said— I don’t expect existing model-checking methods to scale well. I think we will need to incorporate powerful AI heuristics into the search for a proof certificate, which may include various types of argument steps not limited to a monolithic coarse-graining (as mentioned in my footnote 2). And I do think that relies on having a good meta-ontology or compositional world-modeling framework. And I do think that is the hard part, actually! At least, it is the part I endorse focusing on first. If others follow your train of thought to narrow in on the conclusion that the compositional world-modeling framework problem, as Owen Lynch and I have laid it out in this post, is potentially “the hard part” of AI safety, that would be wonderful…

Suppose Training Run Z is a finetune of Model Y, and Model Y was the output of Training Run Y, which was already a finetune of Foundation Model X produced by Training Run X (all of which happened after September 2021). This is saying that not only Training Run Y (i.e. the compute used to produce one of the inputs to Training Run Z), but also Training Run X (a “recursive” or “transitive” dependency), count additively against the size limit for Training Run Z.

Less difficult than ambitious mechanistic interpretability, though, because that requires human comprehension of mechanisms, which is even more difficult.

The formal desiderata should be understood, reviewed, discussed, and signed-off on by multiple humans. However, I don't have a strong view against the use of Copilot-style AI assistants. These will certainly be extremely useful in the world-modeling phase, and I suspect will probably also be worth using in the specification phase. I do have a strong view that we should have automated red-teamers try to find holes in the desiderata.

I think formal verification belongs in the "requires knowing what failure looks like" category.

For example, in the VNN competition last year, some adversarial robustness properties were formally proven about VGG16. This requires white-box access to the weights, to be sure, but I don't think it requires understanding "how failure happens".

1Stephen Casper
Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human's comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless. 

Yes—assuming that the pause interrupts any anticipatory gradient flows from the continuing agent back to the agent which is considering whether to pause.

This pattern is instantiated in the Open Agency Architecture twice:

  1. Step 2 generates top-level agents which are time-bounded at a moderate timescale (~days), with the deliberation about whether to redeploy a top-level agent being carried out by human operators.
  2. In Step 4, the top-level agent dispatches most tasks by deploying narrower low-level agents with much tighter time bounds, with the deliberation a
... (read more)

Some direct quantitative comparison between activation-steering and task-vector-steering (at, say, reducing toxicity) is indeed a very sensible experiment for a peer reviewer to ask for and I would like to see it as well.

On the object-level, deriving task vectors in weight-space from deltas in fine-tuned checkpoints is really different from what was done here, because it requires doing a lot of backward passes on a lot of data. Deriving task vectors in activation-space, as done in this new work, requires only a single forward pass on a truly tiny amount of data. So the data-efficiency and compute-efficiency of the steering power gained with this new method is orders of magnitude better, in my view.

Also, taking affine combinations in weight-space is not novel to Schmidt et ... (read more)

Dan H613

It's a good observation that it's more efficient; does it trade off performance? (These sorts of comparisons would probably be demanded if it was submitted to any other truth-seeking ML venue, and I apologize for consistently being the person applying the pressures that generic academics provide. It would be nice if authors would provide these comparisons.)

 

Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabili

... (read more)

In computer science this distinction is often made between extensional (behavioral) and intensional (mechanistic) properties (example paper).

I think there’s something a little bit deeply confused about the core idea of “internal representation” and that it’s also not that hard to fix.

  1. I think it’s important that our safety concepts around trained AI models/policies respect extensional equivalence, because safety or unsafety supervenes on their behaviour as opaque mathematical functions (except for very niche threat models where external adversaries are corrupting the weights or activations directly). If two models have the same input/output mapping, and only one of them has “internally repres

... (read more)

Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility.

In my plan, interpretable world-modeling is a key component of Step 1, but my idea there is to build (possibly just by fine-tuning, but still) a bunch of AI modules specifically for the task of assisting in the construction of interpretable world models. In step 2 we’d throw those AI modules away and construct a completely new AI policy which has no knowledge of the world except via that human-understood world model (no direct access to data, just simulations). This is pretty well covered by your routes numbered 2 and 3 in section 1A, but I worry those poi... (read more)

From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it's not clear whether

  • you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
  • your policy-scoring "function" was actually stochastic and "defined" by the physical process of humans interacting with the AI's actions and clicking Merge buttons, and this incorre
... (read more)
2Rohin Shah
Yup, this is the objective-based categorization, and as you've noted it's ambiguous on the scenarios I mention because it depends on how you choose the "definition" of the design objective (aka policy-scoring function).

I think subnormals/denormals are quite well motivated; I’d expect at least 10% of alien computers to have them.

Quiet NaN payloads are another matter, and we should filter those out. These are often lumped in with nondeterminism issues—precisely because their behavior varies between platform vendors.

1Donald Hobson
I think each little decision is throwing another few bits of info. A few bits for deciding how big the mantisa and exponent should be. A few bits for it being a 64 bit float. A few bits for subnormals. A few bits for inf and Nan.  A few bits for rounding errors. A bit for -0. And it all adds up. Not that we know how many bits the AI needs. If there is one standard computer architecture that all aliens use, then the AI can hack with very little info. If all alien computers have wildly different architectures, then floats carry a fair bit of info. 

I think binary floating-point representations are very natural throughout the multiverse. Binary and ternary are the most natural ways to represent information in general, and floating-point is an obvious way to extend the range (or, more abstractly, the laws of probability alone suggest that logarithms are more interesting than absolute figures when extremely close or far from zero).

If we were still using 10-digit decimal words like the original ENIAC and other early computers, I'd be slightly more concerned. The fact that all human computer makers transitioned to power-of-2 binary words instead is some evidence for the latter being convergently natural rather than idiosyncratic to our world.

2Donald Hobson
Sure, binary is fairly natural, but there are a lot of details of IEEE floats that aren't. https://en.wikipedia.org/wiki/Subnormal_number

The informal processes humans use to evaluate outcomes are buggy and inconsistent (across humans, within humans, across different scenarios that should be equivalent, etc.). (Let alone asking humans to evaluate plans!) The proposal here is not to aim for coherent extrapolated volition, but rather to identify a formal property (presumably a conjunct of many other properties, etc.) such that conservatively implies that some of the most important bad things are limited and that there’s some baseline minimum of good things (e.g. everyone has access to reso... (read more)

Shouldn't we plan to build trust in AIs in ways that don't require humans to do things like vet all changes to its world-model?

Yes, I agree that we should plan toward a way to trust AIs as something more like virtuous moral agents rather than as safety-critical systems. I would prefer that. But I am afraid those plans will not reach success before AGI gets built anyway, unless we have a concurrent plan to build an anti-AGI defensive TAI that requires less deep insight into normative alignment.

In response to your linked post, I do have similar intuitions about “Microscope AI” as it is typically conceived (i.e. to examine the AI for problems using mechanistic interpretability tools before deploying it). Here I propose two things that are a little bit like Microscope AI but in my view both avoid the core problem you’re pointing at (i.e. a useful neural network will always be larger than your understanding of it, and that matters):

  1. Model-checking policies for formal properties. A model-checker (unlike a human interpreter) works with the entire net
... (read more)

I’d say the scientific understanding happens in step 1, but I think that would be mostly consolidating science that’s already understood. (And some patching up potentially exploitable holes where AI can deduce that “if this is the best theory, the real dynamics must actually be like that instead”. But my intuition is that there aren’t many of these holes, and that unknown physics questions are mostly underdetermined by known data, at least for quite a long way toward the infinite-compute limit of Solomonoff induction, and possibly all the way.)

Engineering... (read more)

For most s I agree, but the existence of any adversarial examples for would be an outer alignment problem (you get what you measure). (For outer alignment, it seems necessary that there exist—and that humans discover—natural abstractions relative to formal world models that robustly pick out at least the worst stuff.)

I find Eric Drexler's arguments convincing about how it seems possible to make collusion very unlikely. On the other hand, I do think it requires nontrivial design and large ensembles; in the case of an unconstrained 2-player game (like Safety via Debate), I side with Eliezer that the probability of collusion probably converges toward 1 as capabilities get more superintelligent.

Another key principle that I make use of is algorithms (such as branch-and-bound and SMT solvers) whose performance—but not their correctness—depends on extremely clever heuristics.... (read more)

2davidad (David A. Dalrymple)
I find Eric Drexler's arguments convincing about how it seems possible to make collusion very unlikely. On the other hand, I do think it requires nontrivial design and large ensembles; in the case of an unconstrained 2-player game (like Safety via Debate), I side with Eliezer that the probability of collusion probably converges toward 1 as capabilities get more superintelligent. Another key principle that I make use of is algorithms (such as branch-and-bound and SMT solvers) whose performance—but not their correctness—depends on extremely clever heuristics. Accelerating the computation of more accurate and useful bounds seems to me like a pretty ineffectual causal channel for the AIs playing those heuristic roles to coordinate with each other or to seek real-world power.

Inner misalignment is a story for why one might expect capable but misaligned out-of-distribution behaviour, which is what's actually bad. Model-checking could rule that out entirely (relative to the formal specification)— whether it's "inner misalignment" or "goal misgeneralization" or "deceptive alignment" or "demons in Solmonoff induction" or whatever kind of story might explain such output. Formal verification is qualitatively different from the usual game of debugging whack-a-mole that software engineers play to get software to behave acceptably.

2davidad (David A. Dalrymple)
Inner misalignment is a story for why one might expect capable but misaligned out-of-distribution behaviour, which is what's actually bad. Model-checking could rule that out entirely (relative to the formal specification)— whether it's "inner misalignment" or "goal misgeneralization" or "deceptive alignment" or "demons in Solmonoff induction" or whatever kind of story might explain such output. Formal verification is qualitatively different from the usual game of debugging whack-a-mole that software engineers play to get software to behave acceptably.

Nate [replying to Eric Drexler]: I expect that, if you try to split these systems into services, then you either fail to capture the heart of intelligence and your siloed AIs are irrelevant, or you wind up with enough AGI in one of your siloes that you have a whole alignment problem (hard parts and all) in there. Like, I see this plan as basically saying "yep, that hard problem is in fact too hard, let's try to dodge it, by having humans + narrow AI services perform the pivotal act". Setting aside how I don't particularly expect this to work, we can at lea

... (read more)
2Eric Drexler
GTP-Nate is confusing the features of the AI services model with the argument that “Collusion among superintelligent oracles can readily be avoided”. As it says on the tin, there’s no assumption that intelligence must be limited. It is, instead, an argument that collusion among (super)intelligent systems is fragile under conditions that are quite natural to implement.

if you define the central problem as something like building a system that you'd be happy for humanity to defer to forever.

[I at most skimmed the post, but] IMO this is a more ambitious goal than the IMO central problem. IMO the central problem (phrased with more assumptions than strictly necessary) is more like "building system that's gaining a bunch of understanding you don't already have, in whatever domains are necessary for achieving some impressive real-world task, without killing you". So I'd guess that's supposed to happen in step 1. It's debata... (read more)

GPT-3.5-Nate [prompted using 1061 tokens of Nate's criticism of related ideas]: This is a really interesting proposal! It seems to me that you're trying to solve the hard problem of AI alignment, by proposing a system that would be able to identify the concepts that an AGI should be optimizing for, and then use a combination of model-checking and time-bounded optimization to ensure that the AGI's behavior is robustly directed at those concepts. This is a really ambitious project, and I'm excited to see how it develops.

🙃

How would you respond to predicted objections from Nate Soares?

3davidad (David A. Dalrymple)
I think, in an open agency architecture, the silo that gets "enough AGI" is in step 2, and it is pointed at the desired objective by having formal specifications and model-checking against them. But I also wouldn't object to the charge that an open agency architecture would "route around the central problem," if you define the central problem as something like building a system that you'd be happy for humanity to defer to forever. In the long run, something like more ambitious value learning (or value discovery) will be needed, on pain of astronomical waste. This would be, in a sense, a compromise (or, if you're optimistic, a contingency plan), motivated by short timelines and insufficient theoretical progress toward full normative alignment.
4davidad (David A. Dalrymple)
🙃

There's a lot of similarity. People (including myself in the past) have criticized Russell on the basis that no formal model can prove properties of real-world effects, because the map is not the territory, but I now agree with Russell that it's plausible to get good enough maps. However:

  • I think it's quite likely that this is only possible with an infra-Bayesian (or credal-set) approach to explicitly account for Knightian uncertainty, which seems to be a difference from Russell's published proposals (although he has investigated Halpern-style probability l
... (read more)

Is this basically Stuart Russell's provably beneficial AI?

3davidad (David A. Dalrymple)
There's a lot of similarity. People (including myself in the past) have criticized Russell on the basis that no formal model can prove properties of real-world effects, because the map is not the territory, but I now agree with Russell that it's plausible to get good enough maps. However: * I think it's quite likely that this is only possible with an infra-Bayesian (or credal-set) approach to explicitly account for Knightian uncertainty, which seems to be a difference from Russell's published proposals (although he has investigated Halpern-style probability logics, which have some similarities to credal sets, he mostly gravitates toward frameworks with ordinary Bayesian semantics). * Instead of an IRL or CIRL approach to value learning, I propose to rely primarily on linguistic dialogues that are grounded in a fully interpretable representation of preferences. A crux for this is that I believe success in the current stage of humanity's game does not require loading very much of human values.

As a category theorist, I am confused by the diagram that you say you included to mess with me; I’m not even sure what I was supposed to think it means (where is the cone for ? why does the direction of the arrow between and seem inconsistent?).

I think a “minimal latent,” as you have defined it equationally, is a categorical product (of the ) in the coslice category where is the category of Markov kernels and is the implicit sample space with respect to which all the random variables are defined.

I think it's too easy for someone to skim this entire post and still completely miss the headline "this is strong empirical evidence that mesa-optimizers are real in practice".

2tailcalled
I don't think so. Like technically yes, it shows that there is an internal optimization process that is running in the networks, but much of the meat of optimization such as instrumental convergence/power-seeking depends the structure of the function one is optimizing over. If the function is not consequentialist - if it doesn't attempt to compute the real-world consequences of different outputs and grade things based on those consequences - then much of the discussion about optimizers does not apply.
2Lawrence Chan
Sure, edited the post to clarify.

This is very interesting. I had previously thought the “KL penalty” being used in RLHF was just the local one that’s part of the PPO RL algorithm, but apparently I didn’t read the InstructGPT paper carefully enough.

I feel slightly better about RLHF now, but not much.

It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equivalent to filtering. That could be seen as a lexicographic objective where the binarised reward gets optimised first a... (read more)

1Buck Shlegeris
Firstly, a clarification: I don't want to claim that RL-with-KL-penalty policies are the same as the results of conditioning. I want to claim that you need further assumptions about the joint distribution of (overseer score, true utility) in order to know which produces worse Goodhart problems at a particular reward level (and so there's no particular reason to think of RL as worse). I thought that using linear mixture of reward and global KL penalty is (because of a Lagrange multiplier argument) the same as having a constraint on reward while minimizing KL penalty? Maybe the point you're making is that the KL between the policy and the original generative model is different on different inputs? I agree that this means that the RL policy is different than the best-of-n policy, but I don't see why either has predictably worse Goodhart problems.
2Lawrence Chan
It's also true that maximizing Reward - KL is Bayesian updating as the linked post shows, and it's true that maximizing reward subject to a KL constraint is also equivalent to Bayesian updating as well (by Lagrangian multipliers). You see similar results with Max Ent RL (where you maximize Reward + Entropy, which is equal to a constant minus the KL relative to a uniform distribution), for example.  Sounds like you need to increase the KL penalty, then! I don't see why this argument doesn't also apply to the conditioning case -- if you condition on a proxy reward being sufficiently high, you run into the exact same issues as w/ KL regularized RL with binarized reward.  This seems like a misunderstanding of the post (and the result in general) -- it shows that doing RL with KL constraints is equivalent to Bayesian updating the LM prior with a er(x)/β likelihood (and the update people use in practice is equivalent to variational inference). You wouldn't do this updating explicitly, because computing the normalizing factor Z is too hard (as usual); instead you just optimize RL - KL as you usually would.  (Or use a decoding scheme to skip the training entirely; I'm pretty sure you can just do normal MCMC or approximate it with weight decoding/PPLM.)

That’s not the case when using a global KL penalty—as (I believe) OpenAI does in practice, and as Buck appeals to in this other comment. In the paper linked here a global KL penalty is only applied in section 3.6, because they observe a strictly larger gap between proxy and gold reward when doing so.

[This comment is no longer endorsed by its author]Reply
1Lawrence Chan
This doesn't seem to be what Gao et al found: Figure 9 shows that the KL between RL and initial policy, at a given proxy reward score, still is significantly larger than the equivalent KL for a BoN-policy, as shown in Figure 1.

In RLHF there are at least three different (stochastic) reward functions:

  1. the learned value network
  2. the “human clicks 👍/👎” process, and
  3. the “what if we asked a whole human research group and they had unlimited time and assistance to deliberate about this one answer” process.

I think the first two correspond to what that paper calls “proxy” and “gold” but I am instead concerned with the ways in which 2 is a proxy for 3.

Extremal Goodhart relies on a feasibility boundary in -space that lacks orthogonality, in such a way that maximal logically implies non-maximal . In the case of useful and human-approved answers, I expect that in fact, there exist maximally human-approved answers that are also maximally useful—even though there are also maximally human-approved answers that are minimally useful! I think the feasible zone here looks pretty orthogonal, pretty close to a Cartesian product, so Extremal Goodhart won't come up in either near-term or long-term applications.... (read more)

1Buck Shlegeris
Extremal Goodhart is not differentially a problem for RL vs conditioning, right?
2Erik Jenner
Thanks! Causal Goodhart is a good point, and I buy now that RLHF seems even worse from a Goodhart perspective than filtering. Just unsure by how much, and how bad filtering itself is. In particular: This is the part I'm still not sure about. For example, maybe the simplest/apparently-easiest-to-understand answer that looks good to humans tends to be false. Then if human raters prefer simpler answers (because they're more confident in their evaluations of those), the maximally approved answers might be bad. This is similar to the truths humans can't be convinced of you mention, but with the difference that it's just a matter of how convinced humans are by different answers. We could then be in a situation where both filtering and RLHF suffer a lot from Goodhart's law, and while RLHF might technically be even worse, the difference wouldn't matter in practice since we'd need a solution to the fundamental problem anyway. I feel like the key question here is how much selection pressure we apply. My sense is that for sufficient amounts of selection pressure, we do quite plausibly run into extremal Goodhart problems like that. But it also seems plausible we wouldn't need to select that hard (e.g. we don't need the single most compelling answer), in which case I agree with what you said.

Briefly, the alternative optimisation target I would suggest is performance at achieving intelligible, formally specified goals within a purely predictive model/simulation of the real world.

Humans could then look at what happens in the simulations and say "gee, that doesn't look good," and specify better goals instead, and the policy won't experience gradient pressure to make those evaluations systematically wrong.

This isn't the place where I want to make a case for the "competitiveness" or tractability of that kind of approach, but what I want to claim he... (read more)

4Robert Kirk
Thanks for the answer! I feel uncertain whether that suggestion is an "alignment" paradigm/method though - either these formally specified goals don't cover most of the things we care about, in which case this doesn't seem that useful, or they do, in which case I'm pretty uncertain how we can formally specify them - that's kind of the whole outer alignment problem. Also, there is still (weaker) pressure to produce outputs that look good to humans, if humans are searching over goals to find those that produce good outputs. I agree it's further away, but that seems like it could also be a bad thing, if it makes it harder to pressure the models to actually do what we want in the first place.

Here's my steelman of this argument:

  1. There is some quantity called a "level of performance".
  2. A certain level of performance, , is necessary to assist humans in ending the acute risk period.
  3. A higher level of performance, , is necessary for a treacherous turn.
  4. Any given alignment strategy is associated with a factor , such that it can convert an unaligned model with performance into an aligned model with performance .
  5. The maximum achievable performance of unaligned models increases somewhat gradually as a function of time .
... (read more)
3Rohin Shah
You can weaken the premises a lot: 1. Instead of having a single "level of performance", you can have different levels of performance for "ending the acute risk period" vs "treacherous turn". This allows you to get rid of the "higher" part of premise 3, which does seem pretty sketchy to me. It also allows you to talk about λA factors for particular tasks on particular models rather than for "overall performance" in premise 4, which seems much more realistic (since λA could vary for different tasks or different models). 2. You can get rid of Premise 5 entirely, if you have Premise 6 say "at least as likely" rather than "more likely". I'd rewrite as: Premise 1: There is some quantity called a "level of performance at a given task". Definition 1: Let PS be the level of performance at whatever you want the model to do (e.g. ending the acute risk period), and P∗S, is the level of performance you want. Definition 2: Let PT be the level of performance at executing a treacherous turn, and P∗T be the level of performance required to be successful. Definition 3: Consider a setting where we apply an alignment strategy A to a particular model. Suppose that the model has latent capabilities sufficient to achieve performance (PS,PT). Then, the resulting model must have actual capabilities (λASPS,λATPT) for some factors λAS,λAT∈[0,∞]. (If you assume that A can only elicit capabilities rather than creating new capabilities, then you have λA∗≤1.) Lemma 1: Consider two alignment strategies A and B applied to a model M, where you are uncertain about the latent capabilities (PS,PT), but you know that λAS>λBS, and λAT=λBT. Then, it is at least as likely that λASPS≥P∗S∧λATPT<P∗T than that λBSPS≥P∗S∧λBTPT<P∗T. Premise 2: λRLHFS>λpromptS Premise 3: λRLHFT=λpromptT . This corresponds to Buck's point, which I agree with, here: Conclusion: RLHF is at least as likely to allow you to [reach the desired level of performance at the task while avoiding a treacherous turn] as prompt eng

Should we do research on alignment schemes which use RLHF as a building block? E.g. work on recursive oversight schemes or RLHF with adversarial training?

  • IMO, this kind of research is promising and I expect a large fraction of the best alignment research to look like this.

This seems like the key! It’s probably what people actually mean by the question “is RLHF a promising alignment strategy?”

Most of this post is laying out thoughtful reasoning about related but relatively uncontroversial questions like “is RLHF, narrowly construed, plausibly sufficie... (read more)

2Buck Shlegeris
  It isn't my intention to do this kind of motte and bailey; as I said, I think people really do conflate these questions, and I think that the things I said in response to some of these other questions are actually controversial to some. Hopefully people don't come away confused in the way you describe.

I’m not really aware of any compelling alternatives to this class of plan–“training a model based on a reward signal” is basically all of machine learning

I think the actual concern there is about human feedback, but you phrased the question as about overseer feedback, but then your answer (quoted) is about any reward signal at all.

Is next-token prediction already “training a model based on a reward signal”? A little bit—there’s a loss function! But is it effectively RL on next-token-prediction reward/feedback? Not really. Next-token prediction, by contrast... (read more)

1Buck Shlegeris
  I think that some people actually have the concern I responded to there, rather than the concern you say that they might have instead. I agree that I conflated between overseer feedback and any reward signal at all; I wondered while writing the post whether this conflation would be a problem. I don't think it affects the situation much but it's reasonable for you to ask me to justify that.
3Robert Kirk
I still don't think you've proposed an alternative to "training a model with human feedback". "maintaining some qualitative distance between the optimisation target for an AI model and the human “does this look good?” function" sounds nice, but how do we even do that? What else should we optimise the model for, or how should we make it aligned? If you think the solution is use AI-assisted humans as overseers, then that doesn't seem to be a real difference with what Buck is saying. So even if he actually had written that he's not aware of an alternative to "training a model with human/overseer feedback", I don't think you've refuted that point.
Load More