Charlie Steiner

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Reducing Goodhart

Wiki Contributions

Comments

Nice. I tried to do something similar (except making everything leaky with polynomial tails, so 

y = (y+torch.sqrt(y**2+scale**2)) * (1+(y+threshold)/torch.sqrt((y+threshold)**2+scale**2)) / 4

where the first part (y+torch.sqrt(y**2+scale**2)) is a softplus, and the second part (1+(y+threshold)/torch.sqrt((y+threshold)**2+scale**2)) is a leaky cutoff at the value threshold.

But I don't think I got such clearly better results, so I'm going to have to read more thoroughly to see what else you were doing that I wasn't :)

Wouldn't other people also like to use an AI that can collaborate with them on complex topics? E.g. people planning datacenters, or researching RL, or trying to get AIs to collaborate with other instances of themselves to accurately solve real-world problems?

I don't think people working on alignment research assistants are planning to just turn it on and leave the building, they on average (weighted by money) seem to be imagining doing things like "explain an experiment in natural language and have an AI help implement it rapidly."

So I think both they and this post are describing the strategy of "building very generally useful AI, but the good guys will be using it first." I hear you as saying you want a slightly different profile of generally-useful skills to be targeted.

Yeah, I don't know where my reading comprehension skills were that evening, but they weren't with me :P

Oh well, I'll just leave it as is as a monument to bad comments.

I think it's pretty tricky, because what matters to real networks is the cost difference between storing features pseudo-linearly (in superposition), versus storing them nonlinearly (in one of the host of ways it takes multiple nn layers to decode), versus not storing them at all. Calculating such a cost function seems like it has details that depend on the particulars of the network and training process, making it a total pain to try to mathematize (but maybe amenable to making toy models).

Neat, thanks. Later I might want to rederive the estimates using different assumptions - not only should the number of active features L be used in calculating average 'noise' level (basically treating it as an environment parameter rather than a design decision), but we might want another free parameter for how statistically dependent features are. If I really feel energetic I might try to treat the per-layer information loss all at once rather than bounding it above as the sum of information losses of individual features.

I feel like there's a somewhat common argument about RL not being all that dangerous because it generalizes the training distribution cautiously - being outside the training distribution isn't going to suddenly cause an RL system to make multi-step plans that are implied but never seen in the training distribution, it'll probably just fall back on familiar, safe behavior.

To me, these arguments feel like they treat present-day model-free RL as the "central case," and model-based RL as a small correction.

Anyhow, good post, I like most of the arguments, I just felt my reaction to this particular one could be made in meme format.

I hear you as saying "If we don't have to worry about teaching the AI to use human values, then why do sandwiching when we can measure capabilities more directly some other way?"

One reason is that with sandwiching, you can more rapidly measure capabilities generalization, because you can do things like collect the test set ahead of time or supervise with a special-purpose AI.

But if you want the best evaluation of a research assistant's capabilities, I agress using it as a research assistant is more reliable.

A separate issue I have here is the assumption that you don't have to worry about teaching an AI to make human-friendly decisions if you're using it as a research assistant, and therefore we can go full speed ahead trying to make general-purpose AI as long as we mean to use it as a research assistant. A big "trust us, we're the good guys" vibe.

Relative to string theory, getting an AI to help use do AI alignment is much more reliant on teaching the AI to give good suggestions in the first place - and not merely "good" in the sense of highly rated, but good in the contains-hard-parts-of-outer-alignment kinda way. So I disagree with the assumption in the first place.

And then I also disagree with the conclusion. Technology proliferates, and there are misuse opportunities even within an organization that's 99% "good guys." But maybe this is a strategic disagreement more than a factual one.

Non-deceptive failures are easy to notice, but they're not necessarily easy to eliminate - and if you don't eliminate them, they'll keep happening until some do slip through. I think I take them more seriously than you.

Or if you buy a shard-theory-esque picture of RL locking in heuristics, what heuristics can get locked in depends on what's "natural" to learn first, even when training from scratch.

Both of these hypotheses probably should come with caveats though. (About expected reliability, training time, model-free-ness, etc.)

Load More