I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool.
I resigned from OpenAI on February 15, 2024.
Thank you for your work there. Curious what specifically prompted you to post this now, presumably you leaving OpenAI and wanting to communicate that somehow?
Can you confirm or deny whether you signed any NDA related to you leaving OpenAI?
(I would guess a "no comment" or lack of response or something to that degree implies a "yes" with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA's offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board when deciding how to respond here)
I think it is safe to infer from the conspicuous and repeated silence by ex-OA employees when asked whether they signed a NDA which also included a gag order about the NDA, that there is in fact an NDA with a gag order in it, presumably tied to the OA LLC PPUs (which are not real equity and so probably even less protected than usual).
EDIT: Kelsey Piper has confirmed that there is an OA NDA with a gag order, and violation forfeits all equity - including fully vested equity. This implies that since you would assume Ilya Sutskever would have received many PPUs & would be holding them as much as possible, Sutskever might have had literally billions of dollars at stake based on how he quit and what he then, say, tweeted... (PPUs which can only be sold in the annual OA-controlled tender offer.)
It turns out there’s a very clear reason for that. I have seen the extremely restrictive off-boarding agreement that contains nondisclosure and non-disparagement provisions former OpenAI employees are subject to. It forbids them, for the rest of their lives, from criticizing their former employer. Even acknowledging that the NDA exists is a violation of it. If a departing employee declines to sign the document, or if they violate it, they can lose all vested equity they earned during their time at the company, which is likely worth millions of dollars....While nondisclosure agreements aren’t unusual in highly competitive Silicon Valley, putting an employee’s already-vested equity at risk for declining or violating one is. For workers at startups like OpenAI, equity is a vital form of compensation, one that can dwarf the salary they make. Threatening that potentially life-changing money is a very effective way to keep former employees quiet. (OpenAI did not respond to a request for comment.)
By "gag order" do you mean just as a matter of private agreement, or something heavier-handed, with e.g. potential criminal consequences?
I have trouble understanding the absolute silence we seem to be having. There seem to be very few leaks, and all of them are very mild-mannered and are failing to build any consensus narrative that challenges OA's press in the public sphere.
Are people not able to share info over Signal or otherwise tolerate some risk here? It doesn't add up to me if the risk is just some chance of OA trying to then sue you to bankruptcy, especially since I think a lot of us would offer support in that case, and the media wouldn't paint OA in a good light for it.
I am confused. (And I grateful to William for at least saying this much, given the climate!)
From discussion with Logan Riggs (Eleuther) who worked on the tuned lens: the tuned lens suggests that the residual stream at different layers go through some linear transformations and so aren’t directly comparable. This would interfere with a couple of methods for trying to understand neurons based on weights: 1) the embedding space view 2) calculating virtual weights between neurons in different layers.
However, we could try correcting these using the transformations learned by the tuned lens to translate between the residual stream at different layers, and maybe this would make these methods more effective. By default I think the tuned lens learns only the transformation needed to predict the output token but the method could be adapted to retrodict the input token from each layer as well, we’d need both. Code for tuned lens is at https://github.com/alignmentresearch/tuned-lens