User Comment Replies — AI Alignment Forum

“Sharp Left Turn” discourse: An opinionated review

Your idea of “using instruction following AIs to implement a campaign of persuasion” relies (I claim) on the assumption that the people using the instruction-following AIs to persuade others are especially wise and foresighted people, and are thus using their AI powers to spread those habits of wisdom and foresight.
It’s fine to talk about that scenario, and I hope it comes to pass! But in addition to the question of what those wise people should do, if they exist, we should also be concerned about the possibility that the people with instruction-following

... (read more)

2Steve Byrnes2mo

I don’t think the average person would be asking AI what are the best solutions for preventing existential risks. As evidence, just look around: There are already people with lots of money and smart human research assistants. How many of those people are asking those smart human research assistants for solutions to prevent existential risks? Approximately zero. Here’s another: The USA NSF and NIH are funding many of the best scientists in the world. Are they asking those scientists for solutions to prevent existential risk? Nope. Demis Hassabis is the boss of a bunch of world-leading AI experts, with an ability to ask them to do almost arbitrary science projects. Is he asking them to do science projects that reduce existential risk? Well, there’s a DeepMind AI alignment group, which is great, but other than that, basically no. Instead he’s asking his employees to cure diseases (cf Isomorphic Labs), and to optimize chips, and do cool demos, and most of all to make lots of money for Alphabet. You think Sam Altman would tell his future powerful AIs to spend their cycles solving x-risk instead of making money or curing cancer? If so, how do you explain everything that he’s been saying and doing for the past few years? How about Mark Zuckerberg and Yann LeCun? How about random mid-level employees in OpenAI? I am skeptical. Also, even if the person asked the AI that question, then the AI would (we’re presuming) respond: “preventing existential risks is very hard and fraught, but hey, what if I do a global mass persuasion campaign…”. And then I expect the person would reply “wtf no, don’t you dare, I’ve seen what happens in sci-fi movies when people say yes to those kinds of proposals.” And then the AI would say “Well I could try something much more low-key and norm-following but it probably won’t work”, and the person would say “Yeah do that, we’ll hope for the best.” (More such examples in §1 here.)

Alignment Faking in Large Language Models

Satron4mo02

Have there been any proposals for detecting alignment faking LLMs in AI control literature?

Ryan Greenblatt4mo86

Roughly speaking, the hope is to make is so the AI can't do something very problematic without having a very high chance of being caught (or stopped). This doesn't guarantee catching it (as the AI could just never end up trying to do something very problematic), but at least can (if it works) avoid problems with this AIs.

We do explore this type of setting in prior work, e.g. here.

Alignment Faking in Large Language Models

Satron4mo16

Very interesting results. Something that I, unfortunately, was expecting to see as LLMs got better.

Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?

Ryan Greenblatt4mo*1117

Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?

My general answer would be:

AI Control^[1]^[2]
There are some training time interventions which might reduce the chance of alignment faking by a moderate factor (maybe 1.5-4x) with some costs: training AIs to seem very corrigible and seem as myopic as possible, reducing situational awareness (at least until the end of training), improving oversight (and reducing reward hacking), better understanding AI character/personality in general and trying to shape this, and adjusti

... (read more)

Buck Shlegeris4mo78

This concern is why we are interested in AI control.

Sabotage Evaluations for Frontier Models

Satron5mo00

"One thing I appreciate about Buck/Ryan's comms around AI control is that they explicitly acknowledge that they believe control will fail for sufficiently intelligent systems."

Does that mean that they believe that after a certain point we would lose control over AI? I am new to this field, but doesn't this fact spell doom for humanity?

4Evan Hubinger5mo

The usual plan for control as I understand it is that you use control techniques to ensure the safety of models that are sufficiently good at themselves doing alignment research that you can then leverage your controlled human-ish-level models to help you align future superhuman models.

AI ALIGNMENT FORUM
AF

All of Satron's Comments + Replies