kave

Hello! I work at Lightcone and like LessWrong :-). I have made some confidentiality agreements I can't leak much metadata about (like who they are with). I have made no non-disparagement agreements.

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by
kave32

I don't think this distinction is robust enough to rely on as much of a defensive property. I think it's probably not that hard to think "I probably would have tried something in direction X, or direction Y", and then gather lots of bits about how well the clusters X and Y work.

kave22

I'm curious about Claude's coauthorship of this post. How did it help?

kave21

The crucial point is: when the untrusted model gets to step 6, it sees R in the history instead of A, so it knows its attempt was replaced.

Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R? If it can tell that, how come it can't guess it specifically would have produced A?

kave52

It seems like there's a general principle here, that it's hard to use pure empiricism to bound behaviour over large input and action spaces. You either need to design the behaviour, or understand it mechanistically.

kave112

ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the "optimal allocation".

I think this depends on whether you think AI safety at a lab is more of an O-ring process or a swiss-cheese process. Also, if you think it's more of an O-ring process, you might be generally less excited about working at a scaling lab.

kave917

It is especially frustrating when I hear junior people interchange "AI Safety" and "AI Alignment." These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the "Alignment Forum" does not help with this confusion)

One issue is there's also a difference between "AI X-Safety" and "AI Safety". It's very natural for people working on all kinds of safety from and with AI systems to call their field "AI safety", so it seems a bit doomed to try and have that term refer to x-safety.

kave62

Why do you vehemently disagree?

kave40

Curated. It's nice to see a return to the problems of yore, and I think this is a nice incremental proposal. Bringing in causal counterfactuals seems like a neat trick (with lots of problems, as discussed in the post and the comments), and so does bringing in some bargaining theory.

 

I have lots of confusions and questions, like

so one general strategy the proposal fits into is “experiment with simpler utility functions (or other goal structures) to figure things out, and rely on corrigibility to make sure that we don’t die in the process of experimenting

doesn't make sense to me yet, as it seems easy for the utility functions / belief states to all prefer killing humans quickly, even if the humans don't affect the shutdown button exactly. Or the aside on bargaining with non-causally-counterfacting agents. But they're confusions and questions that afford some mulling, which is pretty cool!

kave10

Yep, I noted you said "update as if" rather than "update that". I also expect this will make it pretty hard to say for sure which of us was right, because it's pretty hard to tell if someone updated as if X vs updated that X.

kave32

I think that predictably, people will update as if they saw actual deceptive alignment

Thanks for predicting this! I'll go on the record as predicting not-this. Look forward to us getting some data (though it may be a little muddied by the fact that you've already publically pushed back, making people less likely to make that mistake). 

Load More