User Comment Replies — AI Alignment Forum

Validating against a misalignment detector is very different to training against one

Flagging for posterity that we had a long discussion about this via another medium and I was not convinced.

Validating against a misalignment detector is very different to training against one

Agreed, this only matters in the regime where some but not all of your ideas will work. But even in alignment-is-easy worlds, I doubt literally everything will work, so testing would still be helpful.

A list of core AI safety problems and how I hope to solve them

Matt MacDermott9mo10

5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).

5.1.5. The incentives under 5.1.4 by default are to take control over irrelevant variables so as to ensure that they proceed as in the anticipated "status quo". Infrabayesian uncertainty about the dynamics is the final component that remove

... (read more)

1davidad (David A. Dalrymple)7mo

Nice, thanks for the pointer!

Towards Measures of Optimisation

Matt MacDermott2y10

Nice, I'd read the first but didn't realise there were more. I'll digest later.

I think agents vs optimisation is definitely reality-carving, but not sure I see the point about utility functions and preference orderings. I assume the idea is that an optimisation process just moves the world towards states, but an agent tries to move the world towards certain states i.e. chooses actions based on how much they move the world towards certain states, so it make sense to quantify how much of a weighting each state gets in its decision-making. But it's not obviou... (read more)

Normative vs Descriptive Models of Agency

Matt MacDermott2y20

Nice, thanks. It seems like the distinction the authors make between 'building agents from the ground up' and 'understanding their behaviour and predicting roughly what they will do' maps to the distinction I'm making, but I'm not convinced by the claim that the second one is a much stronger version of the first.

The argument in the paper is that the first requires an understanding of just one agent, while the second requires an understanding of all agents. But it seems like they require different kinds of understanding, especially if the agent being built ... (read more)

Seeking Power is Often Convergently Instrumental in MDPs

Matt MacDermott2y40

I've been thinking about whether these results could be interpeted pretty differently under different branding.

The current framing, if I understand it correctly, is something like, 'Powerseeking is not desirable. We can prove that keeping your options open tends to be optimal and tends to meet a plausible definition of powerseeking. Therefore we should expect RL agents to seek power, which is bad.'

An alternative framing would be, 'Making irreversible changes is not desirable. We can prove that keeping your options open tends to be optimal. Therefore we sho... (read more)

2Alex Turner2y

Consider an agent navigating a tree MDP, with utility on the leaf nodes. At any internal node in the tree, ~most utility functions will have the agent retain options by going towards the branch with the most leaves. But all policies use up all available options -- they navigate to a leaf with no more power. I agree that we shouldn't update too hard for other reasons. EG this post's focus on optimal policies seems bad because reward is not the optimization target.

AI ALIGNMENT FORUM
AF

All of mattmacdermott's Comments + Replies