Based on this, my general sense is that quantilizers don't make generative models much more useful for alignment
Right, the point of quantilizers is not to make generative models safer. It's to be safer than non-generative models (in cases where the training distribution is in fact safe and you don't need to filter very hard to succeed at the task).
I expect the purely statistical safety/filtering tradeoff to actually be pretty unimportant. More important are the vulnerabilities that come from the training distribution actually not being safe in the first place. The performance cost also does seem pretty important, but could potentially be sidestepped (maybe train a student model on filtered data) if safety was actually solved.
My recent post on generative models has some related discussion; see especially remark 1 on the satisficer, quantilizer, and optimizer approaches to making agents with generative models.
Two interesting differences between the approaches discussed here and in my linked post:
This is helpful, thanks for summarizing the differences! I definitely agree on the first one.
On the second one, my concern is basically that all the safety guarantees that quantilizers provide have an inherent power/safety tradeoff (modulo whatever I'm missing from the "Targeted Impact" section).
That said, it's possible that your nested approach may avoid the 'simulate a deceptive AGI' failure mode. At least, if it's a continuous trajectory of improvement from median human performance up to very superhuman performance you might hope that that trajectory doesn't involve suddenly switching from human-like to AGI-like models. I don't personally find this very comforting (it seems totally plausible to me that there's a continuous path from "median human" to "very dangerous misaligned model" in model-space), but it does at least seem better than directly asking for a superhuman model.
Thanks to Evan Hubinger for discussions about quantilizers, and to James Lucassen for discussions about conditioned generative models. Many of these ideas are discussed in Jessica Taylor's Quantilizers: A Safer Alternative to Maximizers for Limited Optimization: this post just expands on a particular thread of ideas in that paper. Throughout I’ll refer to sections of the paper. I have some remaining confusion about the “targeted impact” section, and would appreciate clarifications/corrections!
Abstract
This post explores the relationship between quantilizers and generative models. My main takeaways are:
Definitions
I’ll follow the setup in “Quantilizers: A Safer Alternative to Maximizers for Limited Optimization” (Section 1, Definition 1) and say that a quantilizer is a model with:
A quantilizer returns a random output from the top q fraction of outputs from γ according to the ordering. So for instance a 0-quantilizer is a utility maximizer, a 0.1-quantilizer returns outputs in the best 10%, etc.
Where does the base distribution come from?
One challenge with quantilizers is that they require a base distribution γ that we trust. That is, we want to be able to sample from γ many times and never hit bad outputs. Unfortunately, we don’t know how to explicitly write this distribution down.
A natural alternative to explicit construction is to train a generative model on the distribution of actions a humans takes (Section 3.1). Then for any given task γ can be written as
γtask=P(action|good human attempts task)This seems quite doable. Train GPT-N on text written by humans, then prompt it with the task:
The output will probably be text describing how a good person would accomplish that task!
Unfortunately we might get unlucky and sample a person who isn’t so good (but thinks of themselves as good). That still satisfies the prompt. Robustly getting the agent we want seems hard, but might be possible in the limit where we ask for agents that are generally plausible in the training distribution of the generative model.
Power vs Safety
Armed with our base distribution, we can start quantilizing. Naively we can do this by sampling q−1 times from γtask, then returning the top output. This produces a tradeoff between capabilities and safety.
Single Sampler
In the limit q→1 we’re just returning actions a good human would take. That is, our quantilizer just samples once and returns that sample as output, so it’s no more powerful than a random good person.
Such a model is really unlikely to take over the world, but it’s also really unlikely to solve quantum chemistry problems. This is because both of these outcomes unlikely according to the base distribution.
We can improve things a bit by conditioning on being an unusually capable person, e.g.
Now we can get top-end human performance out, though at the cost that we have increased the odds of bad outcomes.
For instance, the model might decide to role-play a biologist who invents some extremely positive technology that happens to have unintended negative consequence (e.g. making engineered pandemics more likely). That’s not a unique problem to AI though, just a “giving humans more capabilities” problem.
Many Samples
In the limit q≪1 we’re sampling many times and returning the best action we find. Quantilizing like this makes actions more powerful, but at the cost of making them less likely according to the base distribution. This potentially makes them more dangerous.
For instance, instead of role-playing an extremely capable biologist who is likely to be alive today, the model can now role-play even more capable biologists who are unlikely to exist. Or it can role-play that same biologist on their best day, having the best insight of their life. Or, even worse, it can role-play a deceptive AI that’s pretending to be an extremely capable (and good) human biologist.
We now have at least two dangers:
Targeted Impact
The general tradeoff above is that as we make the quantilizer more powerful (e.g. reduce q, or condition on more capabilities) we make it more dangerous by pulling it away from the action-space we think of as generally safe.
Section 2.2 refers to a technique for quantilizing that avoids this tradeoff called Targeted Impact, but I have not been able to follow the argument. It seems to produce the same quantilizing algorithm (because the condition at the end of the section is exactly the same as the one in Section 2.1), and so I’m confused about where the improvement comes from.
I’m probably missing something, and would really appreciate hearing/seeing a more detailed explanation of this approach!
Summary
My understanding right now is:
Based on this, my general sense is that quantilizers don't make generative models much more useful for alignment. That said, I suspect I'm missing something important about the "Targeted Impact" approach, and am keen to understand that better.