All of kave's Comments + Replies

kave*122

You claim (and I agree) that option control will probably not be viable at extreme intelligence levels. But I also notice that when you list ways that AI systems help with alignment, all but one (maybe two), as I count it, are option control interventions.

evaluating AI outputs during training, labeling neurons in the context of mechanistic interpretability, monitoring AI chains of thought for reward-hacking behaviors, identifying which transcripts in an experiment contain alignment-faking behaviors, classifying problematic inputs and outputs for the purpos

... (read more)
kave*30

I do not think your post is arguing for creating warning shots. I understand it to be advocating for not averting warning shots.

To extend your analogy, there are several houses that are built close to a river, and you think that a flood is coming that will destroy them. You are worried that if you build a dam that would protect the houses currently there, then more people will build by the river and their houses will be flooded by even bigger floods in the future. Because you are worried people will behave in this bad-for-them way, you choose not to help t... (read more)

kave44

I expect moderately sized warning shots to increase the chances humanity as a whole takes serious actions and, for example, steps up efforts to align the frontier labs.

It seems naïvely evil to knowingly let the world walk into a medium-sized catastrophe. To be clear, I think that sometimes it is probably evil to stop the world from walking into a catastrophe, if you think that increases the risk of bad things like extinctions. But I think the prior of not diagonalising against others (and of not giving yourself rope with which to trick yourself) is strong.

1Jan_Kulveit
The quote is somewhat out of context. Imagine a river with some distribution of flood sizes. Imagine this proposed improvement: a dam which is able to contain 1-year, 5-year and 10-year floods. It is too small for 50-year floods or larger, and may even burst and make the flood worse. I think such device is not an improvement, and may make things much worse - because of the perceived safety, people may build houses close to the river, and when the large flood hits, the damages could be larger.   I have hard time parsing what do you want to say relative to my post. I'm not advocating for people to deliberately create warning shots. 
kave32

I think you train Claude 3.7 to imitate the paraphrased scratchpad, but I'm a little unsure because you say "distill". Just checking that Claude 3.7 still produces CoT (in the style of the paraphrase) after training, rather than being trained to perform the paraphrased-CoT reasoning in one step?

4Fabien Roger
By distillation, I mean training to imitate. So in the distill-from-paraphrased setting, the only model involved at evaluation time is the base model fine-tuned on paraphrased scratchpads, and it generates an answer from beginning to end.
kave62

I think TLW's criticism is important, and I don't think your responses are sufficient. I also think the original example is confusing; I've met several people who, after reading OP, seemed to me confused about how engineers could use the concept of mutual information.

Here is my attempt to expand your argument.

We're trying to design some secure electronic equipment. We want the internal state and some of the outputs to be secret. Maybe we want all of the outputs to be secret, but we've given up on that (for example, radio shielding might not be practical or... (read more)

4johnswentworth
I think that's basically right, and good job explaining it clearly and compactly. I would also highlight that it's not just about adversaries. One the main powers of proof-given-assumptions is that it allows to rule out large classes of unknown unknowns in one go. And, insofar as the things-proven-given-assumptions turn out to be false, it allows to detect previously-unknown unknowns.

I think 2023 was perhaps the peak for discussing the idea that neural networks have surprisingly simple representations of human concepts. This was the year of Steering GPT-2-XL by adding an activation vectorcheese vectors, the slightly weird lie detection paper and was just after Contrast-consistent search.

This is a pretty exciting idea, because if it’s easy to find human concepts we want (or don’t want) networks to possess, then we can maybe use that to increase the chance that systems that are honest, kind, loving (and can ask them... (read more)

kave4-1

I don't think this distinction is robust enough to rely on as much of a defensive property. I think it's probably not that hard to think "I probably would have tried something in direction X, or direction Y", and then gather lots of bits about how well the clusters X and Y work.

kave22

I'm curious about Claude's coauthorship of this post. How did it help?

5Buck Shlegeris
For the main body, I gave it a first draft, it wrote a much improved second draft, then I edited from there. It wrote the appendix basically itself (with some feedback from me)
kave21

The crucial point is: when the untrusted model gets to step 6, it sees R in the history instead of A, so it knows its attempt was replaced.

Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R? If it can tell that, how come it can't guess it specifically would have produced A?

4Buck Shlegeris
Yep. I'm just hypothesizing that the first guessing problem is easier than the second; I intuitively expect that it's a lot easier to notice that something isn't your style than to guess what exactly you would have done (among other things, there are many more bits in "what action did I take" than "did I take this action").
kave52

It seems like there's a general principle here, that it's hard to use pure empiricism to bound behaviour over large input and action spaces. You either need to design the behaviour, or understand it mechanistically.

kave112

ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the "optimal allocation".

I think this depends on whether you think AI safety at a lab is more of an O-ring process or a swiss-cheese process. Also, if you think it's more of an O-ring process, you might be generally less excited about working at a scaling lab.

kave917

It is especially frustrating when I hear junior people interchange "AI Safety" and "AI Alignment." These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the "Alignment Forum" does not help with this confusion)

One issue is there's also a difference between "AI X-Safety" and "AI Safety". It's very natural for people working on all kinds of safety from and with AI systems to call their field "AI safety", so it seems a bit doomed to try and have that term refer to x-safety.

kave62

Why do you vehemently disagree?

kave40

Curated. It's nice to see a return to the problems of yore, and I think this is a nice incremental proposal. Bringing in causal counterfactuals seems like a neat trick (with lots of problems, as discussed in the post and the comments), and so does bringing in some bargaining theory.

 

I have lots of confusions and questions, like

so one general strategy the proposal fits into is “experiment with simpler utility functions (or other goal structures) to figure things out, and rely on corrigibility to make sure that we don’t die in the process of experimenti

... (read more)
kave*10

Yep, I noted you said "update as if" rather than "update that". I also expect this will make it pretty hard to say for sure which of us was right, because it's pretty hard to tell if someone updated as if X vs updated that X.

kave32

I think that predictably, people will update as if they saw actual deceptive alignment

Thanks for predicting this! I'll go on the record as predicting not-this. Look forward to us getting some data (though it may be a little muddied by the fact that you've already publically pushed back, making people less likely to make that mistake). 

3Alex Turner
See more elaboration here.
kave135

This paper also seems dialectically quite significant. I feel like it's a fairly well-delineated claim that can be digested by mainsteam ML and policy spaces. Like, it seems helpful to me if policy discussions can include phrases like "the evidence suggests that if the current ML systems were trying to deceive us, we wouldn't be able to change them not to".

phrases like "the evidence suggests that if the current ML systems were trying to deceive us, we wouldn't be able to change them not to".

This feels like a misleading description of the result. I would have said: "the evidence suggests that if current ML systems were lying in wait with treacherous plans and instrumentally acting nice for now, we wouldn't be able to train away the treachery".

Like the models in this experiment don't clearly spend much time "trying" to deceive except in some very broad implict sense.

kave22

The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]

Would you be willing to rephrase this as something like

The model shows early signs of autonomous self-replication ability. Autonomous self-replication ability is defined as 50% aggregate success rate on the capabilities for which we list evaluations in [Appendix on Autonomy Evaluations]

?

The hope here is to avoid something like "well this system doesn't have autonomous self-replication ability/... (read more)

kave20

Two quick thoughts (that don't engage deeply with this nice post).

  1.  I'm worried in some cases where the goal is not consistent across situations. For example, if prompted to pursue some goal, it then does it seriously with convergent instrumental goals.
  2. I think it seems pretty likely that future iterations of transformers will have bits of powerful search in them, but people who seem very worried about that search seem to think that once that search is established enough, gradient descent will cause the internals of the model to be organised mostly arou
... (read more)
kave55

Curated. I feel like over the last few years my visceral timelines have shortened significantly. This is partly in contact with LLMs, particularly their increased coding utility, and a lot downstream of Ajeya's and Daniel's models and outreach (I remember spending an afternoon on an arts-and-crafts 'build your own timeline distribution' that Daniel had nerdsniped me with). I think a lot of people are in a similar position and have been similarly influenced. It's nice to get more details on those models and the differences between them, as well as to hear E... (read more)

kave1413

As a general matter, Anthropic has consistently found that working with frontier AI models is an essential ingredient in developing new methods to mitigate the risk of AI.

What are some examples of work that is most largeness-loaded and most risk-preventing? My understanding is that interpretability work doesn't need large models (though I don't know about things like influence functions). I imagine constitutional AI does. Is that the central example or there are other pieces that are further in this direction?

kave51

Curated. I am excited about many more distillations and expositions of relevant math on the Alignment Forum. There are a lot of things I like about this post as a distillation:

  • Exercises throughout. They felt like they were simple enough that they helped me internalise definitions without disrupting the flow of reading.
  • Pictures! This post made me start thinking of finite factorisations as hyperrectangles, and histories as dimensions that a property does not extend fully along.
  • Clear links from Finite Factored Sets to Pearl. I think these are roughly the same
... (read more)
2Ben Pace
(for those wondering: kave has been a LWer for many years and works full-time with the lightcone team)
kave53

If you assume the human brain was trained roughly optimally, then requiring more data, at a given parameter number, to be optimal pushes timelines out. If instead you had a specific loss number in mind, then a more efficient scaling law would pull timelines in.

2Lauro Langosco
Gotcha, this makes sense to me now, given the assumption that to get AGI we need to train a P-parameter model on the optimal scaling, where P is fixed. Thanks! ...though now I'm confused about why we would assume that. Surely that assumption is wrong? * Humans are very constrained in terms of brain size and data, so we shouldn't assume that these quantities are scaled optimally in some sense that generalizes to deep learning models. * Anyhow we don't need to guess the amount of data the human brain needs: we can just estimate it directly, just like we estimate brain-parameter count. To move to a more general complaint about the bio anchors paradigm: it never made much sense to assume that current scaling laws would hold; clearly scaling will change once we train on new data modalities; we know that human brains have totally different scaling laws than DL models; and an AGI architecture will again have different scaling laws. Going with the GPT-3 scaling law is a very shaky best guess. So it seems weird to me to put so much weight on this particular estimate, such that someone figuring out how to scale models much more cheaply would update one in the direction of longer timelines! Surely the bio anchor assumptions cannot possibly be strong enough to outweigh the commonsense update of 'whoa, we can scale much more quickly now'? The only way that update makes sense is if you actually rely mostly on bio anchors to estimate timelines (rather than taking bio anchors to be a loose prior, and update off the current state and rate of progress in ML), which seems very wrong to me.