Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Satron00

Your idea of “using instruction following AIs to implement a campaign of persuasion” relies (I claim) on the assumption that the people using the instruction-following AIs to persuade others are especially wise and foresighted people, and are thus using their AI powers to spread those habits of wisdom and foresight.

It’s fine to talk about that scenario, and I hope it comes to pass! But in addition to the question of what those wise people should do, if they exist, we should also be concerned about the possibility that the people with instruction-following AIs will not be spreading wisdom and foresight in the first place.

I don't think that whoever is using these AI powers (let's call him Alex) needs to be that wise (beyond the wiseness of an average person who could get their hands on a powerful AI, which is probably higher-than-average).

Alex doesn't need to come up with @Noosphere89's proposed solution of persuasion campaigns all by himself. Alex merely needs to ask his AI what are the best solutions for preventing existential risks. If Noosphere's proposal is indeed wise, then AI would suggest it. Alex could then implement this solution.

Alex doesn't necessarily need to want to spread wisdom and foresight in this scheme. He merely needs to want to prevent existential risks.

Satron02

Have there been any proposals for detecting alignment faking LLMs in AI control literature?

Satron16

Very interesting results. Something that I, unfortunately, was expecting to see as LLMs got better.

Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?

Satron00

"One thing I appreciate about Buck/Ryan's comms around AI control is that they explicitly acknowledge that they believe control will fail for sufficiently intelligent systems."

Does that mean that they believe that after a certain point we would lose control over AI? I am new to this field, but doesn't this fact spell doom for humanity?