User Comment Replies — AI Alignment Forum

Reward Hacking from a Causal Perspective

One definition of manipulation is intentional and covert influence. Content recommenders can satisfy this definition, as they are typically trained to influence the user by any means, including “covert” ones like appealing to the user’s biases and emotions.

I don't think that "covert" is a coherent thing an (e.g.) content recommender could optimize against. For example, everything could appeal to the biases and emotions of the wrong person. Anything can be rude/triggering/bias-inducing to the right person. In which case, how do you classify w... (read more)

Davidad's Bold Plan for Alignment: An In-Depth Explanation

Chipmonk2y00

For the deontic feasibility hypothesis, do you have any expectation for whether formalizing the moral desiderata (specifically: «boundaries») will or should ultimately be done by 1) humans; or 2) automated AI alignment assistants? @davidad

1davidad (David A. Dalrymple)2y

The formal desiderata should be understood, reviewed, discussed, and signed-off on by multiple humans. However, I don't have a strong view against the use of Copilot-style AI assistants. These will certainly be extremely useful in the world-modeling phase, and I suspect will probably also be worth using in the specification phase. I do have a strong view that we should have automated red-teamers try to find holes in the desiderata.

AI ALIGNMENT FORUM
AF

All of Chipmonk's Comments + Replies