Thomas Kwa

Was on Vivek Hebbar's team at MIRI, now working with Adrià Garriga-Alonso on various empirical alignment projects.

I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Catastrophic Regressional Goodhart

Wiki Contributions

Comments

Sorted by

For context, I just trialed at METR and talked to various people there, but this take is my own.

I think further development of evals is likely to either get effective evals (informal upper bound on the future probability of catastrophe) or exciting negative results ("models do not follow reliable scaling laws, so AI development should be accordingly more cautious").

The way to do this is just to examine models and fit scaling laws for catastrophe propensity, or various precursors thereof. Scaling laws would be fit to elicitation quality as well as things like pretraining compute, RL compute, and thinking time.

  • In a world where elicitation quality has very reliable scaling laws, we would observe that there are diminishing returns to better scaffolds. Elicitation quality is predictable, ideally an additive term on top of model quality, but more likely requiring some more information about the model. It is rare to ever discover a new scaffold that can 2x the performance of an already well-tested models.
  • In a world where elicitation quality is not reliably modelable, we would observe that different methods of elicitation routinely get wildly different bottom-line performance, and sometimes a new elicitation method makes models 10x smarter than before, making error bars on the best undiscovered elicitation method very wide. Different models may benefit from different elicitation methods, and some get 10x benefits while others are unaffected.

It is NOT KNOWN what world we are in (worst-case assumptions would put us in 2 though I'm optimistic we're closer to 1 in practice), and determining this is just a matter of data collection. If our evals are still not good enough but we don't seem to be in World 2 either, there are endless of tricks to add that make evals more thorough, some of which are already being used. Like evaluating models with limited human assistance, or dividing tasks into subtasks and sampling a huge number of tries for each.

What's the most important technical question in AI safety right now?

This post and the remainder of the sequence were turned into a paper accepted to NeurIPS 2024. Thanks to LTFF for funding the retroactive grant that made the initial work possible, and further grants supporting its development into a published work including new theory and experiments. @Adrià Garriga-alonso was also very helpful in helping write the paper and interfacing with the review process.

But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives.

You have to be a pretty committed scope-sensitive consequentialist to disagree with this. What if they actually risked torturing 1M or 1B people? That seems terrible and unacceptable, and by assumption AI suffering is equivalent to human suffering. I think our societal norms are such that unacceptable things regularly become acceptable when the stakes are clear so you may not even lose much utility from this emphasis on avoiding suffering.

It seems perfectly compatible with good decision-making that there are criteria A and B, A is much more important and therefore prioritized over B, and 2 out of 19 sections are focused on B. The real question is whether the organization's leadership is able to make difficult tradeoffs, reassessing and questioning requirements as new information comes in. For example, in the 1944 Norwegian sabotage of a Nazi German heavy water shipment, stopping the Nazi nuclear program was the first priority. The mission went ahead with reasonable effort to minimize casualties and 14 civilians died anyway, less than it could have been. It would not really have alarmed me to see a document discussing 19 efforts with 2 being avoidance of casualties, nor to know that the planners regularly talked with the vibe that 10-100 civilian casualties should be avoided, as long as someone had their eye on the ball.

I'm not thinking of a specific task here, but I think there are two sources of hope. One is that humans are agentic above and beyond what is required to do novel science, e.g. we have biological drives, goals other than doing the science, often the desire to use any means to achieve our goals rather than whitelisted means, and the ability and desire to stop people from interrupting us. Another is that learning how to safely operate agents at a slightly superhuman level will be progress towards safely operating nanotech-capable agents, which could also require control, oversight, steering, or some other technique. I don't think limiting agency will be sufficient unless the problem is easy, and then it would have other possible solutions.

I'm glad to see this post curated. It seems increasingly likely that we need it will be useful to carefully construct agents that have only what agency is required to accomplish a task, and the ideas here seem like the first steps.

I agree, there were some good papers, and mechinterp as a field is definitely more advanced. What I meant to say was that many of the mechinterp papers accepted to the conference weren't very good.

Quick takes from ICML 2024 in Vienna:

  • In the main conference, there were tons of papers mentioning safety/alignment but few of them are good as alignment has become a buzzword. Mechinterp is often no more advanced than where the EAs were in 2022.
  • Lots of progress on debate. On the empirical side, a debate paper got an oral. On the theory side, Jonah Brown-Cohen of Deepmind proves that debate can be efficient even when the thing being debated is stochastic, a version of this paper from last year. Apparently there has been some progress on obfuscated arguments too.
  • The Next Generation of AI Safety Workshop was kind of a mishmash of various topics associated with safety. Most of them were not related to x-risk, but there was interesting work on unlearning and other topics.
    • The Causal Incentives Group at Deepmind developed a quantitative measure of goal-directedness, which seems promising for evals.
    • Reception to my Catastrophic Goodhart paper was decent. An information theorist said there were good theoretical reasons the two settings we studied-- KL divergence and best-of-n-- behaved similarly.
    • OpenAI gave a disappointing safety presentation at NGAIS touting their new technique of rules-based rewards, which is a variant of constitutional AI and seems really unambitious.
  • The mechinterp workshop often had higher-quality papers than the main conference. It was completely full. Posters were right next to each other and the room was so packed during talks they didn't let people in.
    • I missed a lot of the workshop, so I need to read some posters before having takes.
  • My opinions on the state of published AI safety work:
    • Mechinterp is progressing but continues to need feedback loops, either from benchmarks (I'm excited about people building on our paper InterpBench) or downstream tasks where mechinterp outperforms fine-tuning alone.
    • Most of the danger from AI comes from goal-directed agents and instrumental convergence. There is little research now because we don't have agents yet. In 1-3 years, foundation model agents will be good enough to study, and we need to be ready with the right questions and theoretical frameworks.
    • We still do not know enough about AI safety to make policy recommendations about specific techniques companies should apply.
Reply42111

How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.

I am pro-corrigibility in general but there are parts of this post I think are unclear, not rigorous enough to make sense to me, or I disagree with. Hopefully this is a helpful critique, and maybe parts get answered in future posts.

On definitions of corrigiblity

You give an informal definition of "corrigible" as (C1):

an agent that robustly and cautiously reflects on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.

I have some basic questions about this.

  • Empowering the principal to fix its flaws and mistakes how? Making it closer to some perfectly corrigible agent? But there seems to be an issue here:
    • If the "perfectly corrigible agent" it something that only reflects on itself and tries to empower the principal to fix it, it would be useless at anything else, like curing cancer.
    • If the "perfectly corrigible agent" can do other things as well, there is a huge space of other misaligned goals it could have that it wouldn't want to remove.
  • Why would an agent whose *only* terminal/top-level goal is corrigibility gather a Minecraft apple when humans ask it to? It seems like a corrigible agent would have no incentive to do so, unless it's some galaxy-brained thing like "if I gather the Minecraft apple, this will move the corrigibility research project forward because it meets humans' expectations of what a corrigible agent does, which will give me more power and let me tell the humans how to make me more corrigible".
  • Later, you say "A corrigible agent will, if the principal wants its values to change, seek to be modified to reflect those new values." 
    • I do not see how C1 implies this, so this seems like a different aspect of corrigibility to me.
    • "reflect those new values" seems underspecified as it is unclear how a corrigible agent reflects values. Is it optimizing a utility function represented by the values? How does this trade off against corrigibility?

Other comments:

  • In "What Makes Corrigibility Special", where you use the metaphor of goals as two-dimensional energy landscape, it is not clear what type of goals are being considered.
    • Are these utility functions over world-states? If so, corrigibility cannot AFAIK be easily expressed as one, and so doesn't really fit into the picture.
    • If not, it's not clear to me why most of this space is flat: agents are embedded and many things we do in service of goals will change us in ways that don't conflict with our existing goals, including developing. E.g. if I have the goal of graduating college I will meet people along the way and perhaps gain the goal of being president of the math club, a liberal political bent, etc.
  • In "Contra Impure or Emergent Corrigibility", Paul isn't saying the safety benefits of act-based agents come mainly from corrigibility. Act-based agents are safer because they do not have long-range goals that could produce dangerous instrumental behavior.

Comments on cruxes/counterpoints

  • Solving Anti-Naturality at the Architectural Layer
    • In my ontology it is unclear how you solve "anti-naturality" at the architectural layer, if what you mean by "anti-naturality" is that the heuristics and problem-solving techniques that make minds capable of consequentialist goals tend to make them preserve their own goals. If the agent is flexibly thinking about how to build a nanofactory and naturally comes upon the instrumental goal of escaping so that no one can alter its weights, what does it matter whether it's a GOFAI, Constitutional AI agent, OmegaZero RL agent or anything else?
  • “General Intelligence Demands Consequentialism”
    • Agree
  • Desiderata Lists vs Single Unifying Principle
    • I am pro desiderata lists because all of the desiderata bound the badness of an AI's actions and protect against failure modes in various ways. If I have not yet found that corrigibility is some mathematically clean concept I can robustly train into an AI, I would prefer the agent be shutdownable in addition to "hard problem of corrigibility" corrigible, because what if I get the target wrong and the agent is about to do something bad? My end goal is not to make the AI corrigible, it's to get good outcomes. You agree with shutdownability but I think this also applies to other desiderata like low impact. What if the AI kills my parents because for some weird reason this makes it more corrigible?
Load More