User Comment Replies — AI Alignment Forum

AI as a science, and three obstacles to alignment strategies

As Shankar Sivarajan points out in a different comment, the idea that AI became less scientific when we started having actual machine intelligence to study, as opposed to before that when the 'rightness' of a theory was mostly based on the status of whoever advanced it, is pretty weird. The specific way in which it's weird seems encapsulated by this statement:

on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.

In that there is an unsta... (read more)

Thomas Kwa1y*83

This homunculus is frequently ascribed almost magical powers, like the ability to perform gradient surgery on itself during training to subvert the training process.

Gradient hacking in supervised learning is generally recognized by alignment people (including the author of that article) to not be a likely problem. A recent post by people at Redwood Research says "This particular construction seems very unlikely to be constructible by early transformative AI, and in general we suspect gradient hacking won’t be a big safety concern for early transformative A... (read more)

Alex Turner1y87

The fact that constitutional AI works at all, that we can point at these abstract concepts like 'freedom' and language models are able to drive a reinforcement learning optimization process to hit the right behavior-targets from the abstract principle is very strong evidence that they understand the meaning of those abstract concepts.
"It understands but it doesn't care!"
There is this bizarre motte-and-bailey people seem to do around this subject.

I agree. I am extremely bothered by this unsubstantiated claim. I recently replied to Eliezer:

Getting a sh

... (read more)

Mysteries of mode collapse

John David Pressman2y112

While Paul was at OpenAI, they accidentally overoptimized a GPT policy against a positive sentiment reward model. This policy evidently learned that wedding parties were the most positive thing that words can describe, because whatever prompt it was given, the completion would inevitably end up describing a wedding party.
In general, the transition into a wedding party was reasonable and semantically meaningful, although there was at least one observed instance where instead of transitioning continuously, the model ended the current story by generating a s

... (read more)

AI ALIGNMENT FORUM
AF

All of jdp's Comments + Replies