Why Is No One Trying To Align Profit Incentives With Alignment Research?

Prometheus

A whole lot of Alignment work seems to be resource-constrained. Many funders have talked about how they were only able to give grants to a small percentage of projects and work they found promising. Many researchers also receive a small fraction of what they could make in the for-profit sector (Netflix recently offered $900k for an ML position). The pipeline of recruiting talent, training, and hiring could be greatly accelerated if it wasn’t contingent on continuing to receive nonprofit donations.

Possible Ideas

AI Auditing Companies

We’ve already seen a bit of this with ARC’s eval of GPT4, but why isn’t there more of this? Many companies will/are training their own models, or else using existing models in a way beyond what they were intended. Even starting with non-cutting-edge models could provide insight and train people to have the proper Security Mindset and understanding to audit larger ones. Furthermore, there has been a push to regulate and make this a required practice. The possibility of this regulation being made into law will likely be contingent on the infrastructure for it already existing. And it makes sense to take action toward this now, if we want those auditing teams to be as useful as possible, and not merely satisfy a governmental requirement. Existential concerns would also be taken more seriously by a company that has already built a reputation for auditing models.

Evals reporting

Companies don’t want to see their models doing things that weren’t intended (example, giving people credit card information, as was just recently demonstrated). And as time goes on, companies will want some way of showcasing their models have been rigorously tested. Audit reports covering a large, diverse set of vulnerabilities is something many will probably want.

Red teaming

Jailbreaking has been a common practice, done by a wide number of people after a model is released. Like an Evals Report, many will want a separate entity that can red team their models, the same way many tech companies hire an external cybersecurity company to provide a similar service.

Alignment as a service

This could bring in new talent and incentives toward building better understanding and talent to handle alignment. These services would be smaller scale, and would not tackle some of the “core problems” of alignment, but might provide pieces to the puzzle. Solving alignment may not be one big problem, but actually a thousand smaller problems. This gives market feedback, where the better approaches succeed more often than the worse approaches. Over time, this might steer us in a direction of actually coming up with solutions that can be scaled.

Offer procedures to better align models

Many companies will likely not know how to get their models to do the things they want them to, and they will want assistance to do it. This could start by assisting companies with basic RLHF, but might evolve to developing better methods. The better methods would be adopted by competing Alignment providers, who would also search for even better methods to provide.

Caveat: might accelerate surface-level alignment, but just further a false sense of security.

Alignment as a Product

This isn’t the ideal approach, but one still worth considering. Develop new proprietary strategies for aligning models, but don’t release them to the public. Instead, show the results of what these new strategies can do to companies, and sell them the strategy as a product. This might involve NDAs, which is why it is not an ideal approach. But an alignment strategy existing under an NDA is better than no strategy at all.

Mech Interp as a Service

This is perhaps not yet in reach, but might be in time. Many will want to better understand how their models are working. A team of mechanistic interpretability researchers could be given access to the model, and dive into gaining a better understanding of its architecture and what it’s actually doing, providing a full report of their findings as a service. This might also steer Mech Interp toward methods that have actual predictive value.

Caveat: I’m not too confident about Mech Interp being useful for safety, with the downside that it might be useful for capabilities.

Governance Consultation as a Service

Many politicians and policy makers are currently overwhelmed with a problem they have little technical understanding of. A consultation service would provide them with the expertise and security understanding to offer policy advice that would actually be useful. The current situation seems to be taking experts who are already severely time-constrained, and getting their advice for free. I think many would pay for this service, since there are demands for legislation, and they don’t have the understanding to do it on their own.

Alignment Training as a Service

Offering to train workers currently at AI companies to understand security concerns, alignment strategies, and other problems might be desired by many companies. An independent company could train workers to better understand concepts that many are probably not used to dealing with.

Future Endowment Fund

This is the one that’s the furthest away from normal ideas, but I’d love it if more people tried to hack a solution to this. The biggest issue is that the value from alignment research has a time delay. This solution could be something like a Promise of Future Equity contract. Those that do research would receive a promised future share in the Fund, as would investors. Companies that use anything that was funded by the Endowment would sign something like a Promise of Future Returns, delegating a share of the returns of any model that used the strategy to the fund. This way, people who were also working on alignment strategies that only had a 5% chance of working would still get reimbursement for their work. Those working on strategies with a calculated higher chance of working would get a greater share. The Trustees would be members of the community who are highly credible, and who have deep levels of insight about AI.

If you are interested in making progress on any of these endeavors, feel free to message me. I’ve worked in Cybersecurity, so I have a good understanding of how the auditing pipeline normally works at such companies.

If you have any disagreements with some of these approaches (which I’m sure some do), feel free to tell me why I’m wrong in the comments.

AI ALIGNMENT FORUM
AF

10