...While Paul was at OpenAI, they accidentally overoptimized a GPT policy against a positive sentiment reward model. This policy evidently learned that wedding parties were the most positive thing that words can describe, because whatever prompt it was given, the completion would inevitably end up describing a wedding party.
In general, the transition into a wedding party was reasonable and semantically meaningful, although there was at least one observed instance where instead of transitioning continuously, the model ended the current story by generating a s
As Shankar Sivarajan points out in a different comment, the idea that AI became less scientific when we started having actual machine intelligence to study, as opposed to before that when the 'rightness' of a theory was mostly based on the status of whoever advanced it, is pretty weird. The specific way in which it's weird seems encapsulated by this statement:
In that there is an unsta... (read more)
Gradient hacking in supervised learning is generally recognized by alignment people (including the author of that article) to not be a likely problem. A recent post by people at Redwood Research says "This particular construction seems very unlikely to be constructible by early transformative AI, and in general we suspect gradient hacking won’t be a big safety concern for early transformative A... (read more)
I agree. I am extremely bothered by this unsubstantiated claim. I recently replied to Eliezer:
... (read more)