Frontpage

6

4 min read

6

[Epistemic status: the idea is simple and intuitive, so I wouldn't be surprised if it has appeared before. If so, I would appreciate pointers.]

Eliezer defines optimization power as the ability to produce solutions higher on the preference ordering. This feels right, but it would be nice to convert this intuition into a score that can be measured, a universal metric that allows one to meaningfully state that AlphaZero is better at Go than Deep Blue was at chess. In this post I try accomplishing just that. I start by giving a candidate metric and some of its properties. I then discuss how it can be approximately computed in complex settings, and describe a somewhat paradoxical implication of adopting this definition for repeated games. 

For simplicity I will assume a deterministic, finite-action-space environment (extending everything to MDPs with uncountable actions doesn't seem to be a fundamental obstacle). The action space is , the outcome space is , utility of an agent under question is . Because is deterministic I will just write for brevity. Note that we can represent sequential decision problems in this framework (e.g. Sudoku), elements of A would then be vectors of individual actions. We write n(S) for the size of set S.

Definition. Suppose we observe an agent with utility function u(a) take an action , then we say this agent has a C-score of:

Intuitively, C-score is inversely proportional to the fraction of outcomes that are as good for the agent as the one it managed to obtain.  One interpretation of this formula is that the agent is competing against a bot that just picks a random action (random-bot), and then is log probability of losing in this competition. 

Some properties of C(u,a):

  • C-score is independent of computational power employed
  • It's impossible to compute C-score without knowing the utility function
  • Improving from being in top 1% outcomes to top 0.1% is as "hard" as improving from 10% to 1%  
  • One can use Monte-Carlo to generate lower bounds of the form "w.h.p. C-score of this agent is at least Y" 
  • Strategy of taking a random action produces the same score in any setting (under some assumptions it might also be approximately true for the strategy of trying k random moves and picking the best one)

Approximating C-score

Can we estimate C-score for complex environments and agents? For example, suppose we have an agent playing chess against a known distribution of opponents (or, more simply, against random-bot), and utility function is the probability of winning the game (action space then is a set of policies, i.e. mappings from state to action). Can we measure C-score of this agent without using an unrealistic amount of compute?

Clearly, the naive Monte-Carlo approach I mentioned above is a no-go in this scenario: the probability that a randomly sampled action would be in the set is tiny (if the agent is any good), and we will never sample this set.

I have a couple of potential ideas here:

  • In case of having access to a better optimizer than the one being measured,  one can use rare event sampling, that is bias the sampling procedure towards good policies and then account for this bias when computing the C-score. This approach nicely fits with the intuition "you need to be at least about as smart as the agent to measure how smart it is"
  • If the game is DP/MDP and the agent employs value function estimation for picking moves, one can try looking at how much the agent improves when enhanced with MCTS, and then try inferring C-score from it. I don't have good explanations for why and how this can work, only an intuition that this is a meaningful approach to try.

Repeated game paradox

I'll use a very simple game to illustrate this issue. An agent picks a number between 1 and 10 and utility of the agent equals to the number chosen. This game is repeated T times, so agent's total utility is the sum of numbers it returned in all T games. 

If T=1 and the agent picks 8, we have (there are 3/10 actions that yield utility of at least 8)

If T=2 and the agent picks 8 twice, we get

,

Thus an agent that finds a good solution once at t=1, and then repeats the same action until the end, obtains higher and higher scores for bigger T. This feels like an artifact of the definition (coming from how volumes grow with dimensions), rather than the agent being a genuinely better optimizer. Is there a score formula that has similar properties but doesn't suffer from this paradox?

Thanks to Linda, Joar and Chris for helpful discussions that led to this post.

New Comment
5 comments, sorted by Click to highlight new comments since:

Some undesirable properties of C-score:

It depends on how the space of actions are represented. If a set of very similar actions that achieve the same utility for the agent are merged into one action, this will change the agent's C-score.

It does not depend on the magnitudes of the agent's preferences, only on their orderings. Compare 2 agents: the first has 3 available actions, which would give it utilities 0, .9, and 1, respectively, and it picks the action that would give it utility .9. The second has 3 available actions, which would give it utilities 0, .1, and 1, respectively, and it picks the action that would give it utility .1. Intuitively, the first agent is a more successful optimizer, but both agents have the same C-score.

I agree with the first point, and I don't have solid solutions to this. There's also the fact that some games are easier to optimize than others (name a number game I described at the end vs. chess), and this complexity is impossible to capture while staying computation-agnostic. Maybe one can use the length of the shortest proof that taking action a leads to utility u(a) to account for these issues..

The second point is more controversial, my intuition is that first agent is an equally good optimizer, even if it did better in terms of payoffs. Also, at least in the setting of deterministic games, utility functions are arbitrary up to encoding the same preference orderings (once randomness is introduced this stops being true)

I think decision problems with incomplete information are a better model in which to measure optimization power than deterministic decision problems with complete information are. If the agent knows exactly what payoffs it would get from each action, it is hard to explain why it might not choose the optimal one. In the example I gave, the first agent could have mistakenly concluded that the .9-utility action was better than the 1-utility action while making only small errors in estimating the consequences of each of its actions, while the second agent would need to make large errors in estimating the consequences of its actions in order to think that the .1-utility action was better than the 1-utility action.

"Note that we can represent sequential decision problems in this framework (e.g. Sudoku), elements of A would then be vectors of individual actions."

Unless the environment is deterministic, you want to consider policies rather than vectors of actions. On a related note, instead of considering a uniform distribution over actions, we might consider a uniform distribution over programs for a prefix-free universal Turing machine. This solves your repeated game paradox in the sense that, the program that always picks 9 will have some finite probability and will do better than your agent for any T, so your agent's score will be bounded.

The "rare event sampling" link is broken.