User Comment Replies — AI Alignment Forum

WRT non-manipulation, I don't suppose there's an easy way to have the AI track how much potentially manipulative influence it's "supposed to have" in the context and avoid exercising more than that influence?

Or possibly better, compare simple implementations of the principle's instructions, and penalize interpretations with large/unusual influence on the principle's values. Preferably without prejudicing interventions straightforwardly protecting the principle's safety and communication channels.

Principle should, for example, be able to ask the AI to... (read more)

1Max Harms8mo

That's an interesting proposal! I think something like it might be able to work, though I worry about details. For instance, suppose there's a Propogandist who gives resources to agents that brainwash their principals into having certain values. If "teach me about philosophy" comes with an influence budget, it seems critical that the AI doesn't spend that budget trading with Propagandist, and instead does so in a more "central" way. Still, the idea of instructions carrying a degree of approved influence seems promising.

AGI Ruin: A List of Lethalities

FireStormOOO3y20

That does seem worth looking at and there's probably ideas worth stealing from biology. I'm not sure you can call that a robustly aligned system that's getting bootstrapped though. Existing in a society of (roughly) peers and the lack of a huge power disparity between any given person and the rest of humans is anologous to the AGI that can't take over the world yet. Humans that aquire significant power do not seem aligned wrt what a typical person would profess to and outwardly seem to care about.

I think your point still mostly follows despite that; even when humans can be deceptive and power seeking, there's an astounding amount of regularity in what we end up caring about.

2Alex Turner3y

Yes, this is my claim. Not that eg >95% of people form values which we would want to form within an AGI.

AI ALIGNMENT FORUM
AF

All of FireStormOOO's Comments + Replies