Preparing for AI-assisted alignment research: we need data!

Caleb Biddulph

Thanks to Neel Nanda, Madeleine Chang, Michael Chen, Oliver Zhang, and Aidan O’Gara for their early feedback on this post.

Summary: Alignment researchers could record the intermediate outputs of their work in anticipation of using the data to train future AI models. These models could eventually accelerate alignment research by imitating the workflows and thinking styles of these researchers, which could be useful for many applications. The data used to train these models could come from the document and code version histories, screen recordings, keystrokes, recordings of virtual and in-person meetings, and running verbal commentary during the researchers’ work. Although some data collection methods risk being too inconvenient or distracting to the researchers, many are likely worth the cost.

There’s been a fair amount of discussion in the AI safety community about using AI to accelerate AI alignment research. In theory, AI could be used to quickly write up polished articles from rough notes, automate repetitive tasks, write code, and even facilitate idea generation. In the limit of capabilities, an AI model could even act as a fully independent researcher, pursuing novel research directions, reasoning effectively, and communicating its findings comprehensively to humans.

This idea has been covered elsewhere, so I’ll try not to retread too much old ground here. Please see this post for a compilation of many different researchers’ takes on the issue, or section 4 of OpenAI’s recent blog post, Our Approach to Alignment Research.

Personally, I’m fairly optimistic about AI-aided alignment research being a promising line of inquiry. However, I haven’t seen much discussion about preparing for this possibility by acquiring one of the most crucial resources for training AI - data. In this post, I argue that future alignment-boosting AIs would benefit greatly from domain-specific data derived from AI alignment research processes (which I’ll refer to as alignment data for short).

Proposal and rationale

Other than compute and the discovery of useful algorithms, data is one of the most important bottlenecks for training modern AI. If we want to train a model to do alignment research like a human can, I expect that a dataset of humans doing alignment research would make this a lot easier.

To be sure, some forms of alignment data, like blog posts and papers about AI alignment, could be derived easily from existing resources. However, data about the process that AI alignment researchers take to generate their research could be even more valuable. Consider the research from Wei et al. showing that prompting large language models to “think step-by-step” can significantly improve their performance. This is especially relevant in academic research, where years of thought can go into a ten-page paper; we can hardly expect to get good results by simply typing “Write the abstract of a paper that solves the AI alignment problem” into an LLM.

These datasets could be derived from a huge variety of sources, which I’ll discuss more later in this post. A few of the most promising are probably Google Doc edit histories, screen recordings, and conversations between alignment researchers.

Imitative AI may be safer than goal-driven AI

An alternative to training a purely generative model would be to focus mostly on RL techniques by training a model to optimize for receiving positive human feedback on AI alignment research. Perhaps the model would be fine-tuned from a model trained only on a general text corpus. Since a reinforcement learner would likely require much less domain-specific data upfront than a generative model, the recommendations in this post would be less relevant. However, I believe that generative models are less likely to try reward hacking and deception compared to more “goal-directed” AI.

Models in the style of DALL-E 2 or the original GPT-3, which deontologically try to “do things a human would do” rather than trying to maximize reward, are probably unlikely to take highly unusual and exploitative actions, even when scaled up to greater capabilities. The main caveat is that human-imitating AIs would likely asymptote at or below a human level of performance; however, given enough training data, a fully-trained model may still be able to do research orders of magnitude more quickly than human researchers just by virtue of having greater numbers (i.e. arbitrary numbers of copies) and speed.

Even if we do end up having to train a reward-based model to help solve alignment, we could still use these datasets to train the model to get as close as possible to being able to do its own research, then get it over the final hump with human feedback. Intuitively, I expect that the less we train a model on human feedback, the higher its potential capabilities will likely get before it learns to game the reward and starts taking deceptive or dangerous actions.

Creating a comparative advantage for alignment research

As far as I know, there are no researchers collecting data on their own research processes with the intention of training an AI model in the future - not in AI alignment, not in AI capabilities, and not in any other scientific fields either. If alignment researchers took the lead in this area, it could give them a comparative advantage in the quality and quantity of their training data once AI technology has advanced enough to effectively use it.

This could help to address a potential issue with relying on AI to solve the alignment problem: once AI technology has reached a level that could significantly accelerate scientific research, it’s not clear that it would be used to accelerate AI alignment research in particular, and not, say, AI capabilities research. By collecting alignment data, we can make AI-accelerated alignment research more feasible without providing other scientific fields with a similar advantage.

Of course, the more data we manage to collect, the greater our comparative advantage will be. This suggests that those who expect longer AI timelines will be more excited about the strategy in this post, since if they’re right, we’ll have more time to collect data.

Potential uses for the data

Here is a more comprehensive list of potential uses for alignment data, roughly ordered by technical feasibility:^[1]

Sharing techniques among human researchers: This particular application wouldn’t even involve AI. If researchers shared recordings of their work, such as coding, writing, or group discussions, they could help others to understand the job and learn valuable thought processes. In fact, Neel Nanda did something like this recently.
Serving as a writing assistant: An AI trained on alignment data could assist with crafting clear and concise prose, allowing researchers to focus more on the ideas and less on the wording. This could mean editing existing writing or writing up a full post from rough notes. (This is already possible - I used ChatGPT to make many parts of this post more concise and focused.)
Acting as a sounding board: An AI trained on alignment data from experienced researchers could provide feedback and advice to aspiring alignment researchers, helping to alleviate the shortage of mentors in the field. While it may be difficult to achieve the level of complex and thoughtful feedback provided by a real human, it may still be useful to develop AI models that can give more generic feedback. For example, "I think your argument would be clearer if you gave a specific example of this abstract concept," or “you could try using this mechanistic interpretability technique to understand the model’s behavior here.”
Fully automating research: To me, the most promising use of alignment data is to train an AI that can fully automate alignment research, potentially working much faster and more efficiently than a human and able to be copied as many times as necessary. Maybe an AI with these capabilities would already be highly unsafe, but I think the risk goes down if it’s trained to purely imitate human behavior rather than to pursue a goal. Future alignment research may also make it possible to train AI using methods that incorporate human feedback more safely than basic RLHF.

Of course, applications will vary depending on which form of alignment data we use. I’ll go into that in the next section.

Some data collection methods

How might alignment researchers record their work? I list a few possible sources of alignment data below.^[2]

Posts and papers

There are many publicly available documents about alignment written by researchers on platforms such as LessWrong, the Alignment Forum, the EA Forum, arXiv, and other websites. The comments on alignment-related posts could also be useful for teaching an AI to judge and iterate on its own ideas. However, while these documents may provide some useful data for training AI models, they do not offer much insight into the step-by-step process of research. On the plus side, they already exist and would be relatively trivial to collect - in fact, a dataset just like this has already been compiled (see this post).

Version history (code)

Code repositories on platforms like GitHub have already been used to train AI models like OpenAI Codex. While tools like Codex could potentially be useful for alignment research, creating a dataset of code written by alignment researchers is likely less important than creating a dataset of alignment-related communications documents (discussed in the next section). Most crucial insights are likely to be found in the latter, and code completion models like Codex already exist and are likely to continue improving regardless of our efforts. Writing "AI alignment code" and writing "AI capabilities code" may be similar in many ways, and it is likely that AI capabilities in both areas will increase concurrently.

Version history (documents + other forms of communication)

Many alignment researchers use Google Docs and related products like Slides and Sheets on a daily basis. These products have a built-in version history system that tracks edits on a line-by-line basis, which could potentially be used to generate a dataset for training AI models.^[3]

For example, an AI model could be trained to look at the state of a document midway through the editing process and guess what deletions or insertions the author will make next. This approach could encourage the AI to follow a logical series of steps in refining its ideas, rather than simply generating a fully formed document word-by-word like GPT does. We could also include comments from document reviewers in the dataset, which would allow the AI to practice incorporating human feedback into its writing, as a real researcher would.

There are already many Google Docs written by alignment researchers that could be retroactively included in this dataset, making it possible to take advantage of past work. To prepare for the possibility of using document version histories in a training dataset, researchers could be encouraged to do their work in Google Workspace or another product with version history. It may also be helpful to encourage researchers to write down their rough thoughts at the beginning of the writing process and refine them as they develop the document, so that the AI can better replicate their thinking process. However, if these changes would significantly hinder researchers' productivity, they may not be worth it.

To ensure that this data is available when we need it, it may be necessary to develop a secure and convenient way to back up Google edit histories. For example, someone could create a bot that logs the edit history of any document shared with it by a researcher. It might also be possible to track even smaller changes, such as single-character deletions and insertions, which would be a step closer to full keystroke and screen recording (as discussed later).

In addition to Google Docs, researchers could incorporate Slack messages, email, and other forms of written communication into alignment datasets without making any special changes to how they work.

Screen recordings and keystrokes

Of course, besides Google Docs and messaging apps, researchers use a wide range of applications on a daily basis, including web browsers, coding IDEs, and productivity apps such as Asana. Activity on these products isn’t usually recorded. However, if we recorded researchers' screens and/or keystrokes while they worked, we could capture all of this and more, providing an even more detailed and comprehensive view of the research process.^[4]

Screen recordings have already been used to train AIs, including OpenAI’s Video PreTraining model, which learned to play Minecraft from YouTube videos. Recording keystrokes alongside this video would make it even easier for an AI to learn how to use a computer to do effective research.

It’s important to note that collecting screen recordings and keystrokes is likely to be more intrusive than simply gathering document version histories, and some researchers may feel uncomfortable or distracted by this.

Meetings

Meetings could be a valuable source of alignment data, as they often have very high information density. During meetings, researchers ask questions, provide feedback to each other, and try to explain their ideas as comprehensively as they can. This data could provide valuable information for training models to work collaboratively with human or other AI researchers.

Meeting recordings might consist of video (especially if a whiteboard is involved) or just audio. It’s easy to record virtual meetings, and setting up a recording device to document in-person meetings probably wouldn’t be difficult either.

Like the collection methods mentioned in the last section, meeting recordings can be intrusive, and the idea of saving private conversations may feel especially creepy. Researchers definitely shouldn’t expect everybody they meet with to take part in this collection process, and if they do ask other meeting participants for consent to be recorded, they should be extremely clear about what the recording will be used for. I’ll go into this issue in more detail in the section on security and privacy concerns.

Running verbal commentary

As I linked earlier, Neel Nanda recently recorded himself working on a toy research problem in a Python notebook, where he trained a toy transformer without positional embeddings to see if it could rederive them. While this video was originally intended to help other human researchers, I expect that recordings like this could be even more valuable for an AI. The previously-mentioned dataset of screen recordings and/or keystrokes could be even more useful if it were accompanied by explanations of what the researcher is doing. This could even help improve interpretability if an AI were trained to generate similar commentary as it works.

This form of data collection would probably add the most overhead of any others I’ve mentioned, and for most researchers, it would be best to do this only occasionally, if at all. On the other hand, I imagine that some researchers would find it helpful to explain their thoughts verbally, sort of like rubber duck debugging. For his part, Neel estimated that if he did an hour of the work he normally does, but he had to narrate his thoughts, it would take him 10-20% more time. Creating a good tutorial (choosing a good toy problem, making a usable tutorial notebook, and editing the video) takes him longer, but I expect that these would be less important in an ML training dataset.

Security/privacy concerns

One potential objection to this idea is that making extensive recordings of AI alignment research could pose risks to the security and privacy of both researchers and institutions. Starting a program to collect researcher’s data could make it easier for a hacker or internal leaker to do a lot of damage. Especially within the context of AI alignment research, there could be many information security risks - say, research that improves AI interpretability while also heightening a model’s capabilities - that we should take pains to reduce.

These concerns are more relevant for some applications over others. If we were going to release a model on the Internet that any would-be AI safety researcher could use to get advice on their research, it would be bad if someone could prompt the model into explaining Anthropic’s top-secret research, giving them Paul Christiano’s social security number, or realistically simulating dozens of conversations with Eliezer Yudkowsky to find the best way to convince him of something. On the other hand, if we had a group of fully unsupervised, human-aligned, artificially-intelligent researchers without write access to the Internet who would share their research with a small group of highly-trusted human operators, this may not matter quite as much.

I expect that while some alignment researchers would be willing to contribute to alignment datasets in one form or another, others would find the idea of being constantly recorded unsettling or intrusive. It would be important to avoid creating a culture, either within specific organizations or in the AI alignment community at large, where taking part in these measures is expected or required. If AI alignment researchers are perceived as recording themselves excessively or inappropriately, it could harm the field's reputation and discourage potential researchers from joining. (“Not only do AI safety people believe these weird things about technology, they all have to sign up for constant surveillance by a panopticon? No thanks!”)

If the alignment community were to create a centralized data collection program (an idea I’ll explain in the conclusion), it would be important to build trust with the researchers who do choose to opt in by providing clear guidelines on how the collected data will be stored and protected, minimizing the number of people with access to the data, and giving users control over when and under what conditions they would want to release their data to an organization that wants to use it. (This might mean letting users independently handle the storage of their alignment data.)

Conclusion

While the use of AI to accelerate AI alignment research may still be a ways off, it’s not too early for the alignment community to start thinking about what kinds of datasets would be helpful for this purpose and how we can collect them.

If you are an alignment researcher and this post has convinced you that collecting alignment data is worth doing, you could start collecting your own data right now! For instance, you could make recordings of some of your meetings, start recording your screen as you work, or switch to Google Docs if you don’t already use it. Of course, please keep in mind the privacy and safety concerns raised above.

If the AI alignment community really buys into this idea, the next step up from collecting data on a per-researcher basis would be to coordinate the alignment data collection project under the banner of a single organization. Such an organization could benefit researchers by providing dedicated tools to collect alignment data, setting standards and best practices for data collection, and possibly storing some alignment data in a centralized database. If you’d be interested in participating in a large-scale project like this, either as an individual data-collector or as an organizer, please comment below or DM me so I can gauge interest! While I’m not sure I’m in the best position to take the lead on this, I’ll willingly be the point-person to connect interested parties to one another.

I’d love to hear what others think about these ideas. Do you think collecting data now to train future AIs to accelerate alignment research is a viable strategy? Are there other use cases or collection methods for alignment data that I haven’t mentioned here? Do you have concerns about security or privacy that I haven’t addressed? Please share your thoughts in the comments.

^{^}
For more on potential use cases for the data, see Results from a survey on tool use and workflows in alignment research. It goes over the pain points experienced by real alignment researchers in great depth, in an effort to figure out which kinds of tools are most useful.
^{^}
In this list, I try to err on the side of too much data - for instance, I favor video over audio, and audio over text, even though the text is likely what we’ll end up using. We can post-process a data source into a more digestible format for the ML model we happen to be training, but we can’t get back data once we’ve thrown it away, and we can always buy more storage space if we need to.
^{^}
For full disclosure, I happen to currently work at Google (though I don’t work anywhere near the Docs team and as a low-level engineer, I’m not really incentivized to get people to use the product). I haven’t used any inside information from Google to inform this post.
^{^}
Eye-tracking data, synced to screen recordings and/or keystrokes is another potential source of data. Information on which parts of the screen a researcher finds important could conceivably be useful when training an AI, and eye-tracking data may not be very difficult to collect (it can be recorded by an unobtrusive bar at the bottom of your monitor). Still, I relegated eye-tracking to a footnote because the tools are expensive and the data would likely be of minimal use.

AI ALIGNMENT FORUM
AF

6