User Comment Replies — AI Alignment Forum

Anthropic Fall 2023 Debate Progress Update

Thanks for doing this! I think a lot of people would be very interested in the debate transcripts if you posted them on GitHub or something.

1Ansh Radhakrishnan1y

Just pasted a few transcripts into the post, thanks for the nudge!

1Sam Bowman1y

Is there anything you'd be especially excited to use them for? This should be possible, but cumbersome enough that we'd default to waiting until this grows into a full paper (date TBD). My NYU group's recent paper on a similar debate setup includes a data release, FWIW.

AI and Evolution

David Schneider-Joseph2y10

In §3.1–3.3, you look at the main known ways that altruism between humans has evolved — direct and indirect reciprocity, as well as kin and group selection^[1] — and ask whether we expect such altruism from AI towards humans to be similarly adaptive.

However, as observed in R. Joyce (2007). The Evolution of Morality (p. 5),

Evolutionary psychology does not claim that observable human behavior is adaptive, but rather that it is produced by psychological mechanisms that are adaptations. The output of an adaptation need not be adaptive.

This is a subtle dist... (read more)

Reward is not the optimization target

David Schneider-Joseph2y80

A similar point is (briefly) made in K. E. Drexler (2019). Reframing Superintelligence: Comprehensive AI Services as General Intelligence, §18 “Reinforcement learning systems are not equivalent to reward-seeking agents”:

Reward-seeking reinforcement-learning agents can in some instances serve as models of utility-maximizing, self-modifying agents, but in current practice, RL systems are typically distinct from the agents they produce … In multi-task RL systems, for example, RL “rewards” serve not as sources of value to agents, but as signals that guide trai

... (read more)

The shard theory of human values

David Schneider-Joseph2y20

Ever since the discovery that the mammalian dopamine system implements temporal difference learning of reward prediction error, a longstanding question for those seeking a satisfying computational account of subjective experience has been: what is the relationship between happiness and reward (or reward prediction error)? Are they the same thing?

Or if not, is there some other natural correspondence between our intuitive notion of “being happy” and some identifiable computational entity in a reinforcement learning agent?

A simple reflection shows that happin... (read more)

Models Don't "Get Reward"

David Schneider-Joseph2y*3956

I like this post a lot and I agree that much alignment discussion is confused, treating RL agents as if they’re classical utility maximizers, where reward is the utility they’re maximizing.

In fact, they may or may not be “trying” to maximize anything at all. If they are, that’s only something that starts happening as a result of training, not from the start. And in that case, it may or may not be reward that they’re trying to maximize (if not, this is sometimes called inner alignment failure), and it’s probably not reward in future episodes (which seems to... (read more)

Richard Ngo's Shortform

David Schneider-Joseph2y*20

For someone who's read v1 of this paper, what would you recommend as the best way to "update" to v3? Is an entire reread the best approach?

[Edit March 11, 2023: Having now read the new version in full, my recommendation to anyone else with the same question is a full reread.]

Update to Mysteries of mode collapse: text-davinci-002 not RLHF

David Schneider-Joseph2y20

There was a recent Twitter thread about this. See here and here.

Supervise Process, not Outcomes

David Schneider-Joseph2y32

Optimizing for the outcome metric alone on some training distribution, without any insight into the process producing that outcome, runs the risk that the system won’t behave as desired when out-of-distribution. This is probably a serious concern to the system maintainers, even ignoring (largely externalized) X-risks.

Paper: Discovering novel algorithms with AlphaTensor [Deepmind]

David Schneider-Joseph2y113

Note that their improvement over Strassen on 4x4 matrices is for finite fields only, i.e. modular arithmetic, not what most neural networks use.

4Paul Christiano2y

That's a very important correction. For real arithmetic they only improve for rectangular matrices (e.g. 3x4 multiplied by 4x5) which is less important and less well studied.

4Lawrence Chan2y

In fact, the 47 multiplication result is on Z/2Z, so it's not even general modular arithmetic. That being said, there are still speedups on standard floating point arithmetic both in terms of number of multiplications, but also wall clock time.

A Mechanistic Interpretability Analysis of Grokking

David Schneider-Joseph3y*94

[Edit Jan 19, 2023: I no longer think the below is accurate. My argument rests on an unstated assumption: that when weight decay kicks in, the counter-pressure against it is stronger for the 101th weight (the "bias/generalizer") than the other weights (the "memorizers") since the gradient is stronger in that direction. In fact, this mostly isn't true, for the same reason Adam(W) moved towards the $\frac{1}{2} M + \frac{1}{2} G$ solution to begin with before weight decay strongly kicked in: each dimension of the gradient is normalized relative to its typical magnitudes i... (read more)

AI ALIGNMENT FORUM
AF

All of dsj's Comments + Replies