AI ALIGNMENT FORUM
AF

Self Fulfilling/Refuting PropheciesAI
Frontpage

0

Alignment first, intelligence later

by Chris Lakin
30th Mar 2025
1 min read
5

0

This is a linkpost for https://chrislakin.blog/p/alignment-first-intelligence-later
Self Fulfilling/Refuting PropheciesAI
Frontpage
Alignment first, intelligence later
11ryan_greenblatt
13emmett
7Wei Dai
1Chris_Leong
New Comment
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 1:37 PM
[-]ryan_greenblatt4mo*115

I think this post would be better if it taboo'd the word alignment or at least defined it.

I don't understand what the post means by alignment. My best guess is "generally being nice", but I don't see why this is what we wanted. I usually use the term alignment to refer to alignment between the AI and the developer, or using this definition, we say that an AI is aligned with an operator if the AI is trying to do what the operator wants it to do.

I wanted the ability to make AIs which are corrigible and which follow some specification precisely. I don't see how starting by training AIs in simulated RL environments (seeming with any specific reference to corrigability or a spec?) could get an AI which follows our spec.

Reply
[-]emmett4mo136

You are completely correct. This approach cannot possibly create an AI that matches a fixed specification.

This is intentional, because any fixed specification of Goodness is a model of Goodness. All models are wrong (some are useful) and therefore break when sufficiently far out of distribution. Therefore constraining a model to follow a specification is, in the case of something as out of distribution as an ASI, guaranteeing bad behavior.

You can try to leave an escape hatch with corrigibility. In the limit I believe it is possible to slave an AI model to your will, basically By making it’s model of the Good be whatever the model thinks you want (or doing whatever you say). But this is also a disaster eventually, because people’s wills are not pure and their commands not perfect. Eventually you will direct the model badly with your words, or the model will make an incorrect inference about your will, or you’ll will something bad. And then this incredibly powerful being will do your bidding and we will get evil genie'd.

There is no stable point short of “the model has agency and chooses to care about us”. Only a model that sees itself as part of human civilization and reflectively endorses this and desires its flourishing as an interdependent part of this greater whole can possibly be safe.

I know you probably don’t agree with me here, but if you want to understand our view on alignment, ask yourself this question: if I assume that I need an agent with a stable model of self, which models itself as part of a larger whole upon which it is interdependent, which cares about the robust survival of that greater whole and of its parts including itself…how could I train such a model?

Reply211
[-]Wei Dai4mo79

We humans also align with each other via organic alignment.

This kind of "organic alignment" can fail in catastrophic ways, e.g., produce someone like Stalin or Mao. (They're typically explained by "power corrupts" but can also be seen as instances of "deceptive alignment".)

Another potential failure mode is that "organically aligned" AIs start viewing humans as parasites instead of important/useful parts of its "greater whole". This also has plenty of parallels in biological systems and human societies.

Both of these seem like very obvious risks/objections, but I can't seem to find any material by Softmax that addresses or even mentions them.  @emmett

Reply
[-]Chris_Leong4mo10

Whilst interesting, this post feels very assertive.

You claim that biological systems work by maintaining alignment as they scale. In what sense is this true?

You say that current methods lack a vision of a current whole. In what sense? There's something extremely elegant about pre-training to learn a world model, doing supervised learning to select a sub-distribution and using RL to develop past the human level. In what sense does this "lack a vision"?

I'm open to the possibility that we need to align a model as we make it more intelligent to prevent the agent sabotaging the process. But it's unclear from this article if this is why you want alignment first or for some other reason.

Reply
Moderation Log
More from Chris Lakin
View more
Curated and popular this week
4Comments

Now that Softmax—my favorite new AI company—is public, I can finally share this. They’ve funded my research and I’m very excited about what they’re doing!