FireStormOOO

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

WRT non-manipulation, I don't suppose there's an easy way to have the AI track how much potentially manipulative influence it's "supposed to have" in the context and avoid exercising more than that influence?

Or possibly better, compare simple implementations of the principle's instructions, and penalize interpretations with large/unusual influence on the principle's values.  Preferably without prejudicing interventions straightforwardly protecting the principle's safety and communication channels.

Principle should, for example, be able to ask the AI to "teach them about philosophy", without it either going out of it's way to ensure Principle doesn't change their mind about anything as a result of the instruction, nor unduly influencing them with subtly chosen explanations or framing.  The AI should exercise an "ordinary" amount of influence typical of the ways AI could go about implementing the instruction.

Presumably there's a distribution around how manipulative/anti-manipulative(value-preserving) any given implementation of the instruction is, and we may want AI to prefer central implementations rather than extremely value-preserving ones.

Ideally AI should also worry that it's contemplating exercising more or less influence than desired, and clarify that as it would any other aspect of the task.

That does seem worth looking at and there's probably ideas worth stealing from biology.  I'm not sure you can call that a robustly aligned system that's getting bootstrapped though.  Existing in a society of (roughly) peers and the lack of a huge power disparity between any given person and the rest of humans is anologous to the AGI that can't take over the world yet.  Humans that aquire significant power do not seem aligned wrt what a typical person would profess to and outwardly seem to care about.

I think your point still mostly follows despite that; even when humans can be deceptive and power seeking, there's an astounding amount of regularity in what we end up caring about.