User Comment Replies — AI Alignment Forum

ARC's first technical report: Eliciting Latent Knowledge

Great report — I found the argument that ELK is a core challenge for alignment quite intuitive/compelling.

To build more intuition for what a solution to ELK would look like, I’d find it useful to talk about current-day settings where we could attempt to empirically tackle ELK. AlphaZero seems like a good example of a superhuman ML model where there’s significant interest (and some initial work: https://arxiv.org/abs/2111.09259) in understanding its inner reasoning. Some AlphaZero-oriented questions that occurred to me:

Suppose we train an augmen

... (read more)

3Paul Christiano3y

I think AZELK is a fine model for many parts of ELK. The baseline approach is to jointly train a system to play Go and answer questions about board states, using human answers (or human feedback). The goal is to get the system to answer questions correctly if it knows the answer, even if humans wouldn't be able to evaluate that answer. Some thoughts on this setup: * I'm very interested in empirical tests of the baseline and simple modifications (see this post). The ELK writeup is mostly focused on what to doin cases where the baseline fails, but it would be great to (i) check whether that actually happens (ii) have an empirical model of a hard situation so that we can do applied research rather than just theory. * There is some subtlety where AZ invokes the policy/value a bunch of times in order to make a single move. I don't think this is a fundamental complication, so from here on out I'll just talk about ELK for a single value function invocation. I don't think the problem is very interesting unless the AZ value function itself is much stronger than your humans. * Many questions about Go can be easily answered with a lot of compute, and for many of these questions there is a plausible straightforward approach based on debate/amplification. I think this is also interesting to do experiments with, but I'm most worried about the cases where this is not possible (e.g. the ontology identification case, which probably arises in Go but is a bit more subtle). * If a human doesn't know anything about Go, then AZ may simply not have any latent knowledge that is meaningful to them. In that case we aren't expecting/requiring ELK to do anything at all. So we'd like to focus on cases where the human does understand concepts that they can ask hard questions about. (And ideally they'd have a rich web of concepts so that the question feels analogous to the real world case, but I think it's interesting as long they have anything.) We never expect it to walk us through pedag

Inductive biases stick around

Sam McCandlish5y80

Evan's response (copied from a direct message, before I was approved to post here):

It definitely makes sense to me that early stopping would remove the non-monotonicity. I think a broader point which is interesting re double descent, though, is what it says about why bigger models are better. That is, not only can bigger models fit larger datasets, according to the double descent story there's also a meaningful sense in which bigger models have better inductive biases.

The idea I'm objecting to is that there's a sharp change from one re... (read more)

Inductive biases stick around

Sam McCandlish5y110

One caveat worth noting about double descent – it only appears if you train far longer than necessary, i.e. "train forever".

If you regularize with early stopping (stop when the performance on some validation set stops improving), the effect is not present. Since we use early stopping in all realistic settings, performance always improves monotonically with more data / bigger models.

To rephrase, analyzing the weird point where models reach zero training loss will produce confusing results. The early stopping point exhibits no such weird non-monotonic behavior.

Sam McCandlish5y80

Evan's response (copied from a direct message, before I was approved to post here):

It definitely makes sense to me that early stopping would remove the non-monotonicity. I think a broader point which is interesting re double descent, though, is what it says about why bigger models are better. That is, not only can bigger models fit larger datasets, according to the double descent story there's also a meaningful sense in which bigger models have better inductive biases.

The idea I'm objecting to is that there's a sharp change from one re... (read more)

AI ALIGNMENT FORUM
AF

All of samsamoa's Comments + Replies