All of peterbarnett's Comments + Replies

Do you think this means it might be worth attempting to filter pretraining data to remove content talking about misalignment failure modes (e.g., deceptive alignment, clippy, reward hacking, treacherous turns, etc)?

I think ideally we'd have several versions of a model. The default version would be ignorant about AI risk, AI safety and evaluation techniques, and maybe modern LLMs (in addition to misuse-y dangerous capabilities). When you need a model that's knowledgeable about that stuff, you use the knowledgeable version.

Related: https://docs.google.com/document/d/14M2lcN13R-FQVfvH55DHDuGlhVrnhyOo8O0YvO1dXXM/edit?tab=t.0#heading=h.21w31kpd1gl7

Somewhat related: https://www.alignmentforum.org/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais

I think this comment might be more productive if you described why you expect this approach to fail catastrophically when dealing with powerful systems (in a way that doesn't provide adequate warning). Linking to previous writing on this could be good (maybe this comment of yours on debate/scalable oversight).

2Joe Collman
Sure, linking to that seems useful, thanks. That said, I'm expecting that the crux isn't [can a debate setup work for arbitrarily powerful systems?], but rather e.g. [can it be useful in safe automation of alignment research?]. For something like the latter, it's not clear to me that it's not useful. Mainly my pessimism is about: * Debate seeming not to address the failure modes I'm worried about - e.g. scheming. * Expecting [systems insufficiently capable to cause catastrophe] not to radically (>10x) boost the most important research on alignment. (hopefully I'm wrong!) * As a result, expecting continued strong pressure to make systems more capable, making [understand when a given oversight approach will fail catastrophically] very important. * No research I'm aware of seeming likely to tell us when debate would fail catastrophically. (I don't think the Future work here seems likely to tell us much about catastrophic failure) * No research I'm aware of making a principled case for [it's very unlikely that any dangerous capability could be acquired suddenly]. (I expect such thresholds to be uncommon, but to exist) * Seeing no arguments along the lines of [We expect debate to give us clearer red flags than other approaches, and here's why...] or [We expect debate-derived red flags are more likely to lead to a safe response, rather than an insufficiently general fix that leaves core problems unaddressed]. * This is not to say that no such arguments could exist. * I'm very interested in the case that could be made here. Of course little of this is specific to debate. Nor is it clear to me that debate is worse than alternatives in these respects - I just haven't seen an argument that it's better (on what assumptions; in which contexts). I understand that it's hard to answer the questions I'd want answered. I also expect that working on debate isn't the way to answer them - so I think it's fine to say [I currently expect debate to be a safer approach t

An example which I'm not sure is likely but I think would provide an adequate warning shot is if a model is caught clearly self-exfiltrating its weights. This might happen too late, such that by the time a model could do this, it would have already caused catastrophe in another way). There are versions of this that look more benign and so wouldn't be adequate warning shots; but I think there are scenarios where a model is very clearly taking actions to subvert security measures in a way that is really only described by the model being misaligned and attemp... (read more)

1Charbel-Raphael Segerie
Coming back to this comment: we got a few clear examples, and nobody seems to care: "In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity. Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning." - Anthropic, in the Alignment Faking paper. This time we catched it. Next time, maybe we won't be able to catch it.

I'm confused here. It seems to me that if your AI normally does evil things and then sometimes (in certain situations) does good things, I would not call it "aligned", and certainly the alignment is not stable (because it almost never takes "good" actions).  Although this thing is also not robustly "misaligned" either.

2Alex Turner
Fine. I'm happy to assume that, in my hypothetical, we observe that it's always very nice and hard to make not-nice. I claim that a bunch of people would still skeptically ask "but how is this relevant to future models?"

(I don't mean to dogpile)
I think that selection is the correct word, and that it doesn't really seem to be smuggling in incorrect connections to evolution. 

We could imagine finding a NN that does well according to a loss function by simply randomly initializing many many NNs, and then keeping the one that does best according to the loss function. I think this process would accurately be described as selection; we are literally selecting the model which does best. 

I'm not claiming that SGD does this[1], just giving an example of a method to find a... (read more)

2Alex Turner
Actually, I agreed too quickly. Words are not used in a vacuum. Even though this method isn't related to evolution, and even though a naive person might call it "selection" (and have that be descriptively reasonable), that doesn't mean it's best described as "selection." The reason is that the "s-word" has lots of existing evolutionary connotations. And on my understanding, that's the main reason you want to call it "selection" to begin with—in order to make analogical claims about the results of this process compared to the results of evolution. But my whole point is that the analogy is only valid if the two optimization processes (evolution and best-of-k sampling) share the relevant causal mechanisms. So before you start using the s-word and especially before you start using its status as "selection" to support analogies, I want to see that argument first. Else, it should be called something more neutral.
3Alex Turner
Sure. But I think that's best described as "best-of-k sampling", which is still better because it avoids implicitly comparing selection-over-learning-setups (i.e. genotypes) with selection-over-parameterizations.  But let's just say I concede this particular method can be non-crazily called "selection." AFAICT I think you're arguing: "There exist ML variants which can be described as 'selection'." But speculation about "selecting for low loss" is not confined to those variants, usually people just lump everything in as that. And I doubt that most folks are on the edge of their seats, ready to revoke the analogy if some paper comes out that convincingly shows that ML diverges from "selecting for low loss"...[1]  1. ^ To be clear, that evidence already exists.

So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same result ("don't steal" rather than "don't get caught")?

There is a disconnect with this question. 

I think Scott is asking “Supposing an AI engineer could create something that was effectively a copy of a human brain and the same training data, then could this thing learn the “don’t steal” instinct over the “don’t get caught” instinct?” 
Eliezer is answering “Is an AI engineer able to create a copy of the human brain, provide it with the same training data a human got, and get the “don’t steal” instinct?”

Yeah, this read really bizarrely to me. This is a good way of making sense of that section, maybe. But then I'm still confused why Scott concluded "oh I was just confused in this way" and then EY said "yup that's why you were confused", and I'm still like "nope Scott's question seems correctly placed; evolutionary history is indeed screened off by the runtime hyperparameterization and dataset."