Alignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last one—manipulation—points to a challenge for all these desiderata: a human’s goals are themselves under-determined and manipulable, and it’s awfully hard to pin down a principled distinction between changing people’s goals in a good way (“providing counsel”, “providing information”, “sharing ideas”) versus a bad way (“manipulating”, “brainwashing”).
The manipulability of human desires is hardly a new observation in the alignment literature, but it remains unsolved (see lit review in §3 below).
In this post I will propose an explanation of how we humans intuitively conceptualize the distinction between guidance (good) vs manipulation (bad), in case it...
My (low confidence) understanding of the proposal is something like:
"The AI takes an action A if and only if {long-term future-self if the AI takes action a | a in the action space} on aggregate like A"
where "long-term future self" is defined by some recursive process where you locally choose what entity counts as your near-term future self (it can be some other entity that you trust more - e.g. a future aligned AI), where these future selves all have access to an AI that honestly answers questions that are already meaningful to the human when the right an...
We have developed some relatively general methods for mechanistic estimation competitive with sampling by studying problems that are expressible as expectations of random products. This includes several different estimation problems, such as random halfspace intersections, random #3-SAT and random permanents. In this post, we will give a high-level introduction to these methods before sharing some more detailed notes. This is intended as an interim technical update and will be relatively light on motivation: for a broader discussion of this line of research, see our prior post.
All of the problems discussed in this post can be thought of particular choices of "architecture"
Alignment researchers often propose clever-sounding solutions without citing much evidence that their solution should help. Such arguments can mislead people into working on dead ends. Instead, Turntrout argues we should focus more on studying how human intelligence implements alignment properties, as it is a real "existence proof" of aligned intelligence.
See here for more on the background claim that RL algorithms encourage CDT reward-maximizing behavior on the training distribution.