William Saunders

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool.
I resigned from OpenAI on February 15, 2024.

Wikitag Contributions

Comments

Sorted by

Maybe there's an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis

I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there'd be snickering news articles about it. So if some individuals could do this independently might be easier

How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.

Would be nice if it was based on "actual robot army was actually being built and you have multiple confirmatory sources and you've tried diplomacy and sabotage and they've both failed" instead of "my napkin math says they could totally build a robot army bro trust me bro" or "they totally have WMDs bro" or "we gotta blow up some Japanese civilians so that we don't have to kill more Japanese civilians when we invade Japan bro" or "dude I'm seeing some missiles on our radar, gotta launch ours now bro".

Relevant paper discussing risks of risk assessments being wrong due to theory/model/calculation error. Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes

Based on the current vibes, I think that suggest that methodological errors alone will lead to significant chance of significant error for any safety case in AI.

IMO it's unlikely that we're ever going to have a safety case that's as reliable as the nuclear physics calculations that showed that the Trinity Test was unlikely to ignite the atmosphere (where my impression is that the risk was mostly dominated by risk of getting the calculations wrong). If we have something that is less reliable, then will we ever be in a position where only considering the safety case gives a low enough probability of disaster for launching an AI system beyond the frontier where disastrous capabilities are demonstrated?
Thus, in practice, decisions will probably not be made on a safety case alone, but also based on some positive case of the benefits of deployment (e.g. estimated reduced x-risk, advancing the "good guys" in the race, CEO has positive vibes that enough risk mitigation has been done, etc.). It's not clear what role governments should have in assessing this, maybe we can only get assessment of the safety case, but it's useful to note that safety cases won't be the only thing informs these decisions.

This situation is pretty disturbing, and I wish we had a better way, but it still seems useful to push the positive benefit case more towards "careful argument about reduced x-risk" and away from "CEO vibes about whether enough mitigation has been done".

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool.
I resigned from OpenAI on February 15, 2024.

Reply161032

Re: hidden messages in neuron explanations, yes it seems like a possible problem. A way to try to avoid this is to train the simulator model to imitate what a human would say given the explanation. A human would ignore the coded message, and so the trained simulator model should also ignore the coded message. (this maybe doesn't account for adversarial attacks on the trained simulator model, so might need ordinary adversarial robustness methods).

Does seem like if you ever catch your interpretability assistant trying to hide messages, you should stop and try to figure out what is going on, and that might be sufficient evidence of deception.

From discussion with Logan Riggs (Eleuther) who worked on the tuned lens: the tuned lens suggests that the residual stream at different layers go through some linear transformations and so aren’t directly comparable. This would interfere with a couple of methods for trying to understand neurons based on weights: 1) the embedding space view 2) calculating virtual weights between neurons in different layers.

However, we could try correcting these using the transformations learned by the tuned lens to translate between the residual stream at different layers, and maybe this would make these methods more effective. By default I think the tuned lens learns only the transformation needed to predict the output token but the method could be adapted to retrodict the input token from each layer as well, we’d need both. Code for tuned lens is at https://github.com/alignmentresearch/tuned-lens

Load More