User Comment Replies — AI Alignment Forum

Categorizing failures as “outer” or “inner” misalignment is often confused

2y10

This is a good post, definitely shows that these concepts are confused. In a sense both examples are failures of both inner and outer alignment -

Training the AI with reinforcement learning is a failure of outer alignment, because it does not provide enough information to fully specify the goal.
The model develops within the possibilities allowed by the under-specified goal, and has behaviours misaligned with the goal we intended.

Also, the choice to train the AI on pull requests at all is in a sense an outer alignment failure.

Maxwell Clarke

2y43

Hey, wanted to chip into the comments here because they are disappointingly negative.

I think your paper and this post are extremely good work. They won't push forward the all-things-considered viewpoint, but they surely push forward the lower bound (or adversarial) viewpoint. Also because Open Phil and Future Fund use some fraction of lower-end risk in their estimate, this should hopefully wipe that put. Together they much more rigorously lay out classic x-risk arguments.

I think that getting the prior work peer reviewed is also a massive win at least in a ... (read more)

1michaelcohen2y

Thank you very much for saying that. I was feeling disappointed about the lack of positive comments, and I realized recently I should probably go around commenting on posts that I think are good, since right now, I mostly only comment on posts when I feel I have an important disagreement. So it's hard to complain when I'm on the receiving end of that dynamic.

Distilled Representations Research Agenda

Maxwell Clarke

2y00

Hey - reccommend looking at this paper: https://arxiv.org/abs/1807.07306

It shows a more elegant way than KL regularization for bounding the bit-rate of an auto-encoder bottleneck. This can be used to find the representations which are most important at a given level of information.

Coordinate-Free Interpretability Theory

Maxwell Clarke

3y74

I think we can get additional information from the topological representation. We can look at the relationship between the different level sets under different cumulative probabilities. Although this requires evaluating the model over the whole dataset.

Let's say we've trained a continuous normalizing flow model (which are equivalent to ordinary differential equations). These kinds of model require that the input and output dimensionality are the same, but we can narrow the model as the depth increases by directing many of those dimensions to isotropic gaus... (read more)

Against Time in Agent Models

Maxwell Clarke

3y00

(Edited a lot from when originally posted)

(For more info on consistency see the diagram here: https://jepsen.io/consistency )

I think that the prompt to think about partially ordered time naturally leads one to think about consistency levels - but when thinking about agency, I think it makes more sense to just think about DAGs of events, not reads and writes. Low-level reality doesn't really have anything that looks like key-value memory. (Although maybe brains do?) And I think there's no maintaining of invariants in low-level reality, just cause and effect... (read more)

AI ALIGNMENT FORUM
AF

All of Maxwell Clarke's Comments + Replies