Safe Predictive Agents with Joint Scoring Rules

I'm interested in figuring out what a realistic training regime would look like that leverages this. Some thoughts:

Maybe this lends itself nicely to market-making? It's pretty natural to imagine lots of traders competing with each other to predict what the market will believe at the end and rewarding the traders based on their relative performance rather than their absolute performance (in fact that's pretty much how real markets work!). I'd be really interested in seeing a concrete fleshed-out proposal there.
Is there some way to incorporate these ideas into pre-training? The thing that's weird there is that the model in fact has no ability to control anything during the pre-training process itself—it's just a question of whether the model learns to think of its objective as one which involves generalizing to predicting futures/counterfactuals that could then be influenced by its own actions. So the problem there is that the behavior we're worried about doesn't arise from a direct incentive during training, so it's not clear that this is that helpful in that case, though maybe I'm missing something.

I think the tie-in to market-making, and other similar approaches like debate, is in interpreting the predictions. While the examples in this post were only for the two-outcome case, we would probably want predictions over orders of magnitude more outcomes for the higher informational density. Since evaluating distributions over a double digit number of outcomes already starts posing problems (sometimes even high single digits), a process to direct a decision maker's attention is necessary.

I've been thinking of a proposal like debate, where both sides go back and forth proposing clusters of outcomes based on shared characteristics. Ideally, in equilibrium, the first debater should propose the fewest number of clusters such that splitting them further doesn't change the decision maker's mind. This could also be thought of in terms of market-making, where rather than the adversary proposing a string, they propose a further subdivision of existing clusters.

I like the use case of understanding predictions for debate/market-making, because the prediction itself acts as a ground truth. Then, there's no need to ancitipate/reject a ton of counterarguments based on potential lies, rather arguments are limited to selectively revealing the truth. It is probably important that the predictors are separate models from the analyzer to avoid contamination of the objectives. The proof of Theorem 6, which skips to the end of the search process, needs to use a non-zero sum prediction for that result.

As an aside, I also did some early work on decision markets, distinct from your post on market-making, since the Othman and Sandholm had an impossibility result for those too. However, but the results were ultimately trivial. Once you can use zero-sum competition to costlessly get honest conditional predictions, then as soon as you can pair off entrants to the market it becomes efficient. But the question then arises of why use a decision market in the first place instead of just querying experts?

With respect to pre-training, I agree that it's not easy to incorporate. I'm not sure how any training regime that only trains on data where the prediction has no effect can imbue incentives that generalize in the desired way to situations where predictions do affect the outcome. If you do get a performative predictor out of pretraining, then as long as it's myopic you might be able to train the performativity out of it in safely controlled scenarios (and if it's not myopic, it's a risk whether it's performative or not). That was part of my reasoning for the second experiment, checking how well performativity could be trained out.

To incorporate into an ongoing pre-training process, human decisions are likely too expensive, but the human is probably not the important part. Instead, predictions where performativity is possible by influencing simple AI decision makers could be mixed into the pre-training process. Defining a decision problem environment of low or medium complexity is not too difficult, and I suspect previous-generation models would be able to do a good job generating many examples. A danger arises that the model learns only to not predict performatively in those scenarios (same with untraining afterwards only applying to the controlled environments), though I think that's a somewhat unnatural generalization.

[-]Jeremy Gillen1y20

To me it seems like one important application of this work is to understanding and fixing the futachy hack in FixDT and in Logical Inductor decision theory. But I'm not sure whether your results can transfer to these settings, because of the requirement that the agents have the same beliefs.

Is there a reason we can't make duplicate traders in LI and have their trades be zero-sum?

I'm generally confused about this. Do you have thoughts?

[-]Rubi J. Hudson1y20

Having re-read the posts and thought about it some more, I do think zero-sum competition could be applied to logical inductors to resolve the futarchy hack. It would require minor changes to the formalism to accommodate, but I don't see how those changes would break anything else.

[-]Rubi J. Hudson1y20

I'll take a look at the linked posts and let you know my thoughts soon!

[-]harsimony1y10

This is super cool stuff, thank you for posting!

I may have missed this, but do these scoring rules prevent agents from trying to make the environment more un-predictable? In other words, if you're competing against other predictors, it may make sense to influence the world to be more random and harder to understand.

I think this prediction market type issue has been discussed elsewhere but I can't find a name for it.

[-]Rubi J. Hudson1y10

Good question! These scoring rules do also prevent agents from trying to make the environment more unpredictable. In the same way that making the environment more predictable benefits all agents equally and so cancels out, making the environment less predictable hurts all agents equally and so cancels out in a zero-sum competition.

[-]harsimony1y00

Oh that makes sense!

If the predictors can influence the world in addition to making a prediction, they would also have an incentive to change the world in ways that make their predictions more accurate than their opponents right? For example, if everyone else thinks Bob is going to win the presidency, one of the predictors can bribe Bob to drop out and then bet on Alice winning the presidency.

Is there work on this? To be fair, it seems like every AI safety proposal has to deal with something like this.

[-]Rubi J. Hudson1y10

Yes, if predictors can influence the world in addition to making a prediction, they can go make their predictions more accurate. The nice thing about working with predictive models is that by default the only action they can take is making predictions.

AI safety via market making, which Evan linked in another comment, touches on the analogy where agents are making predictions but can also influence the outcome. You might be interested in reading through it.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

24

Safe Predictive Agents with Joint Scoring Rules

24

Predictive Agents and Performative Prediction

Model, Definitions, and Related Literature

Theoretical Results

Different Beliefs

Efficient Search

Stochastic Choice

Unconditional Predictions

Empirical Results

Future Work

Appendix

Main Theorem

Stochastic Choice Conditions