Thomas Kwa

Engineer at METR.

Previously: Vivek Hebbar's team at MIRI Adrià Garriga-Alonso on various empirical alignment projects → METR.

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Catastrophic Regressional Goodhart

Wiki Contributions

Comments

Sorted by

I was at the NeurIPS many-shot jailbreaking poster today and heard that defenses only shift the attack success curve downwards, rather than changing the power law exponent. How does the power law exponent of BoN jailbreaking compare to many-shot, and are there defenses that change the power law exponent here?

For context, I just trialed at METR and talked to various people there, but this take is my own.

I think further development of evals is likely to either get effective evals (informal upper bound on the future probability of catastrophe) or exciting negative results ("models do not follow reliable scaling laws, so AI development should be accordingly more cautious").

The way to do this is just to examine models and fit scaling laws for catastrophe propensity, or various precursors thereof. Scaling laws would be fit to elicitation quality as well as things like pretraining compute, RL compute, and thinking time.

  • In a world where elicitation quality has very reliable scaling laws, we would observe that there are diminishing returns to better scaffolds. Elicitation quality is predictable, ideally an additive term on top of model quality, but more likely requiring some more information about the model. It is rare to ever discover a new scaffold that can 2x the performance of an already well-tested models.
  • In a world where elicitation quality is not reliably modelable, we would observe that different methods of elicitation routinely get wildly different bottom-line performance, and sometimes a new elicitation method makes models 10x smarter than before, making error bars on the best undiscovered elicitation method very wide. Different models may benefit from different elicitation methods, and some get 10x benefits while others are unaffected.

It is NOT KNOWN what world we are in (worst-case assumptions would put us in 2 though I'm optimistic we're closer to 1 in practice), and determining this is just a matter of data collection. If our evals are still not good enough but we don't seem to be in World 2 either, there are endless of tricks to add that make evals more thorough, some of which are already being used. Like evaluating models with limited human assistance, or dividing tasks into subtasks and sampling a huge number of tries for each.

What's the most important technical question in AI safety right now?

This post and the remainder of the sequence were turned into a paper accepted to NeurIPS 2024. Thanks to LTFF for funding the retroactive grant that made the initial work possible, and further grants supporting its development into a published work including new theory and experiments. @Adrià Garriga-alonso was also very helpful in helping write the paper and interfacing with the review process.

But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives.

You have to be a pretty committed scope-sensitive consequentialist to disagree with this. What if they actually risked torturing 1M or 1B people? That seems terrible and unacceptable, and by assumption AI suffering is equivalent to human suffering. I think our societal norms are such that unacceptable things regularly become acceptable when the stakes are clear so you may not even lose much utility from this emphasis on avoiding suffering.

It seems perfectly compatible with good decision-making that there are criteria A and B, A is much more important and therefore prioritized over B, and 2 out of 19 sections are focused on B. The real question is whether the organization's leadership is able to make difficult tradeoffs, reassessing and questioning requirements as new information comes in. For example, in the 1944 Norwegian sabotage of a Nazi German heavy water shipment, stopping the Nazi nuclear program was the first priority. The mission went ahead with reasonable effort to minimize casualties and 14 civilians died anyway, less than it could have been. It would not really have alarmed me to see a document discussing 19 efforts with 2 being avoidance of casualties, nor to know that the planners regularly talked with the vibe that 10-100 civilian casualties should be avoided, as long as someone had their eye on the ball.

I'm not thinking of a specific task here, but I think there are two sources of hope. One is that humans are agentic above and beyond what is required to do novel science, e.g. we have biological drives, goals other than doing the science, often the desire to use any means to achieve our goals rather than whitelisted means, and the ability and desire to stop people from interrupting us. Another is that learning how to safely operate agents at a slightly superhuman level will be progress towards safely operating nanotech-capable agents, which could also require control, oversight, steering, or some other technique. I don't think limiting agency will be sufficient unless the problem is easy, and then it would have other possible solutions.

I'm glad to see this post curated. It seems increasingly likely that we need it will be useful to carefully construct agents that have only what agency is required to accomplish a task, and the ideas here seem like the first steps.

I agree, there were some good papers, and mechinterp as a field is definitely more advanced. What I meant to say was that many of the mechinterp papers accepted to the conference weren't very good.

Quick takes from ICML 2024 in Vienna:

  • In the main conference, there were tons of papers mentioning safety/alignment but few of them are good as alignment has become a buzzword. Many mechinterp papers at the conference from people outside the rationalist/EA sphere are no more advanced than where the EAs were in 2022. [edit: wording]
  • Lots of progress on debate. On the empirical side, a debate paper got an oral. On the theory side, Jonah Brown-Cohen of Deepmind proves that debate can be efficient even when the thing being debated is stochastic, a version of this paper from last year. Apparently there has been some progress on obfuscated arguments too.
  • The Next Generation of AI Safety Workshop was kind of a mishmash of various topics associated with safety. Most of them were not related to x-risk, but there was interesting work on unlearning and other topics.
    • The Causal Incentives Group at Deepmind developed a quantitative measure of goal-directedness, which seems promising for evals.
    • Reception to my Catastrophic Goodhart paper was decent. An information theorist said there were good theoretical reasons the two settings we studied-- KL divergence and best-of-n-- behaved similarly.
    • OpenAI gave a disappointing safety presentation at NGAIS touting their new technique of rules-based rewards, which is a variant of constitutional AI and seems really unambitious.
  • The mechinterp workshop often had higher-quality papers than the main conference. It was completely full. Posters were right next to each other and the room was so packed during talks they didn't let people in.
    • I missed a lot of the workshop, so I need to read some posters before having takes.
  • My opinions on the state of published AI safety work:
    • Mechinterp is progressing but continues to need feedback loops, either from benchmarks (I'm excited about people building on our paper InterpBench) or downstream tasks where mechinterp outperforms fine-tuning alone.
    • Most of the danger from AI comes from goal-directed agents and instrumental convergence. There is little research now because we don't have agents yet. In 1-3 years, foundation model agents will be good enough to study, and we need to be ready with the right questions and theoretical frameworks.
    • We still do not know enough about AI safety to make policy recommendations about specific techniques companies should apply.
Reply42111

How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.

Load More