If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
Thanks for the detailed post!
I personally would have liked to see some mention of the classic 'outer' alignment questions that are subproblems of robustness and ELK. E.g. What counts as 'generalizing correctly'? -> How do you learn how humans want the AI to generalize? -> How do you model humans as systems that have preferences about how to model them?
Just riffing on this rather than starting a different comment chain:
If alignment is "get AI to follow instructions" (as typically construed in a "good enough" sort of way) and alignment is "get AI to do good things and not bad things," (also in a "good enough" sort of way, but with more assumed philosophical sophistication) I basically don't care about anyone's safety plan to get alignment except insofar as it's part of a plan to get alignment.
Philosophical errors/bottlenecks can mean you don't know how to go from 1 to 2. Human safety problems are what stop you from going from 1 to 2 even if you know how, or stop you from trying to find out how.
The checklist has a space for "nebulous future safety case for alignment," which is totally fine. I just also want a space for "nebulous future safety case for alignment" at the least (some earlier items explicitly about progressing towards that safety case can be extra credit). Different people might have different ideas about what form a plan for alignment takes (will it focus on the structure of the institution using an aligned AI, or will it focus on the AI and its training procedure directly?), and where having it should come in the timeline, but I think it should be somewhere.
Part of what makes power corrupting insidious is it seems obvious to humans that we can make everything work out best so long as we have power - that we don't even need to plan for how to get from having control to actually getting good things control was supposed to be an instrumental goal for.
The fact that latents are often related to their neighbors definitely seems to support your thesis, but it's not clear to me that you couldn't train a smaller, somewhat-lossy meta-SAE even on an idealized SAE, so long as the data distribution had rare events or rare properties you could thow away cheaply.
You could also play a similar game showing that latents in a larger SAE are "merely" compositions of latents in a smaller SAE.
So basically, I was left wanting a more mathematical perspective of what kinds of properties you're hoping for SAEs (or meta-SAEs) and their latents to have.
It would be interesting to meditate in the question "What kind of training procedure could you use to get a meta-SAE directly?" And I think answering this relies in part on mathematical specification of what you want.
When you showed the decomposition of 'einstein', I also kinda wanted to see what the closest latents were in the object-level SAE to the components of 'einstein' in the meta-SAE.
Did you ever read Lara Buchak's book? Seems related.
Also, I'm not really intuition-pumped by the repeated mugging example. It seems similar to a mugging where Omega only shows up once, but asks you for a recurring payment.
A related issue might be asking if UDT-ish agents who use a computable approximation to the Solomonoff prior are reflectively stable - will they want to "lock out" certain hypotheses that involve lots of computation (e.g. universes provably trying to simulate you via search for simple universes that contain agents who endorse Solomonoff induction). And probably the answer us going to be "it depends," and you can do verbal argumentation for either option.
I worry about erasing self-other distinction of values. If I want an AI to get me a sandwich, I don't want the AI to get itself a sandwich.
It's easy to say "we'll just train the AI to have good performance (and thereby retain some self-other distinctions), and getting itself a sandwich would be bad performance so it won't learn to do that." But this seems untrustworthy for any AI that's learning human values and generalizing them to new situations. In fact the entire point is that you hope it will affect generalization behavior.
I also worry that the instant you try this on sufficiently interesting and general domains, the safety benefit doesn't last - sort of like optimality is the tiger. If some AI needs to learn general problem-solving strategies to deal with tricky and novel tasks, the application of those strategies to a problem where deception is incentivized will rediscover deception, without needing it to be stored in the representation of the model.
Fun read!
This seems like it highlights that it's vital for current fine-tuned models to change the output distribution only a little (e.g. small KL divergence between base model and finetuned model). If they change the distribution a lot, they'll run into unintended optima, but the base distribution serves as a reasonable prior / reasonable set of underlying dynamics for the text to follow when the fine-tuned model isn't "spending KL divergence" to change its path.
Except it's still weird how bad the reward model is - it's not like the reward model was trained based on the behavior it produced (like humans' genetic code was), it's just supervised learning on human reviews.
This was super interesting. I hadn't really thought about the tension between SLT and superposition before, but this is in the middle of it.
Like, there's nothing logically inconsistent with the best local basis for the weights being undercomplete while the best basis for the activations is overcomplete. But if both are true, it seems like the relationship to the data distribution has to be quite special (and potentially fragile).
Nice! There's definitely been this feeling with training SAEs that activation penalty+reconstruction loss is "not actually asking the computer for what we want," leading to fragility. TopK seems like it's a step closer to the ideal - did you subjectively feel confident when starting off large training runs?
Confused about section 5.3.1:
To mitigate this issue, we sum multiple TopK losses with different values of k (Multi-TopK). For
example, using L(k) + L(4k)/8 is enough to obtain a progressive code over all k′ (note however
that training with Multi-TopK does slightly worse than TopK at k). Training with the baseline ReLU
only gives a progressive code up to a value that corresponds to using all positive latents.
Why would we want a progressive code over all hidden activations? If features have different meanings when they're positive versus when they're negative (imagining a sort of Toy Model of Superposition picture where features are a bunch of rays squeezed in around a central point), it seems like if your negative hidden activations are informative something weird is going on.
> [Tells complicated, indirect story about how to wind up with a corrigible AI]
> "Corrigibility is, at its heart, a relatively simple concept"
I'm not saying the default strategy of bumbling forward and hoping that we figure out tool AI as we go has a literal 0% chance of working. But from the tone of this post and the previous table-of-contents post, I was expecting a more direct statement of what sort of functional properties you mean by "corrigibility," and I feel like I got more of a "we'll know it when we see it" approach.
I agree with many of these criticisms about hype, but I think this rhetorical question should be non-rhetorically answered.
How does a per-datapoint learning rate modifier inherently incentivize the trained artifact to try to maximize the per-datapoint learning rate multiplier?
For readers familiar with markov chain monte carlo, you can probably fill in the blanks now that I've primed you.
For those who want to read on: if you have an energy landscape and you want to find a global minimum, a great way to do it is to start at some initial guess and then wander around, going uphill sometimes and downhill sometimes, but with some kind of bias towards going downhill. See the AlphaPhoenix video for a nice example. This works even better than going straight downhill because you don't want to get stuck in local minima.
The typical algorithm for this is you sample a step and then always take it if it's going downhill, but only take it with some probability if it leads uphill (with smaller probability the more uphill it is). But another algorithm that's very similar is to just take smaller steps when going uphill than when going downhill.
If you were never told about the energy landscape, but you are told about a pattern of larger and smaller steps you're supposed to take based on stochastically sampled directions, than an interesting question is: when can you infer an energy function that's implicitly getting optimized for?
Obviously, if the sampling is uniform and the step size when going uphill looks like it could be generated by taking the reciprocal of the derivative of an energy function, you should start getting suspicious. But what if the sampling is nonuniform? What if there's no cap on step size? What if the step size rule has cycles or other bad behavior? Can you still model what's going on as a markov chain monte carlo procedure plus some extra stuff?
I don't know, these seem like interesting questions in learning theory. If you search for questions like "under what conditions does the REINFORCE algorithm find a global optimum," you find papers like this one that don't talk about MCMC, so maybe I've lost the plot.
But anyhow, this seems like the shape of the answer. If you pick random steps to take but take bigger steps according to some rule, then that rule might be telling you about an underlying energy landscape you're doing a markov chain monte carlo walk around.