[Epistemic status: strong intuitions that I have had for a long time, but not yet compared with other people's intuitions, and I am not sure how well I can convey my intuitions.]
So, a lot of agent foundations research seems to be built on the premise that agenthood is about maximizing the expected value of some utility function, or about having actions that bear some relation to some utility function, or something like that. I think that this premise is wrong except in special cases like where there is only one agent in the world. I don't know exactly how to explain this, so I will just say some stuff.
The Ultimatum Game is a standard example of a game in which there are many Pareto optimal outcomes, and no clear way to choose between them. So if we imagine two "perfectly rational" agents playing the Ultimatum Game against each other, what happens? I think that this question is meaningless. OK, so what happens in real life? People use their notions of fairness to resolve situations like the Ultimatum Game. Fairness is a part of human values, so essentially the answer is that an agent's values are used to resolve the ambiguity of multiple Pareto optimal outcomes.
But wait! In the classical understanding, a utility function is supposed to encode all information about an agent's values. So if there is a notion of fairness relevant to a real-life Ultimatum Game based on monetary payouts, then it supposedly means that the utility function is not just the same as the monetary payouts, and the game is not a true Ultimatum Game at all. But then what is a true Ultimatum Game? Does such a mythical beast even exist? Eliezer had to invent a fairly far-fetched scenario before he found something that he was willing to call the "true Prisoner's dilemma".
But even in the "true Prisoner's dilemma", the utility function does not appear to capture all of Eliezer's values -- he seems to still be motivated to say that "cooperate" is the right answer based on symmetry and hope, which again appear to be human values. So I propose that in fact there is no "true Prisoner's dilemma", simply because calling something a dilemma is asking you to resolve it with your own values, but your own values are allegedly encapsulated by the utility function which is subject to the dilemma.
I propose instead that agenthood is a robustness to these sorts of games, a sort of continual supply of values which are sufficient to escape from any logical argument purporting to prove that there is no determinate answer as to what action you should take. There is no "pure essence of rational agenthood" that we can simply graft onto the "utility function" of human values to make an aligned AI, because human values are not only a function on world-states to be learned but also a part of the decisionmaking process itself. This seems to suggest radically different approaches to the alignment problem than what is present so far, but I am not sure what they are yet.
[Epistemic status: strong intuitions that I have had for a long time, but not yet compared with other people's intuitions, and I am not sure how well I can convey my intuitions.]
So, a lot of agent foundations research seems to be built on the premise that agenthood is about maximizing the expected value of some utility function, or about having actions that bear some relation to some utility function, or something like that. I think that this premise is wrong except in special cases like where there is only one agent in the world. I don't know exactly how to explain this, so I will just say some stuff.
The Ultimatum Game is a standard example of a game in which there are many Pareto optimal outcomes, and no clear way to choose between them. So if we imagine two "perfectly rational" agents playing the Ultimatum Game against each other, what happens? I think that this question is meaningless. OK, so what happens in real life? People use their notions of fairness to resolve situations like the Ultimatum Game. Fairness is a part of human values, so essentially the answer is that an agent's values are used to resolve the ambiguity of multiple Pareto optimal outcomes.
But wait! In the classical understanding, a utility function is supposed to encode all information about an agent's values. So if there is a notion of fairness relevant to a real-life Ultimatum Game based on monetary payouts, then it supposedly means that the utility function is not just the same as the monetary payouts, and the game is not a true Ultimatum Game at all. But then what is a true Ultimatum Game? Does such a mythical beast even exist? Eliezer had to invent a fairly far-fetched scenario before he found something that he was willing to call the "true Prisoner's dilemma".
But even in the "true Prisoner's dilemma", the utility function does not appear to capture all of Eliezer's values -- he seems to still be motivated to say that "cooperate" is the right answer based on symmetry and hope, which again appear to be human values. So I propose that in fact there is no "true Prisoner's dilemma", simply because calling something a dilemma is asking you to resolve it with your own values, but your own values are allegedly encapsulated by the utility function which is subject to the dilemma.
I propose instead that agenthood is a robustness to these sorts of games, a sort of continual supply of values which are sufficient to escape from any logical argument purporting to prove that there is no determinate answer as to what action you should take. There is no "pure essence of rational agenthood" that we can simply graft onto the "utility function" of human values to make an aligned AI, because human values are not only a function on world-states to be learned but also a part of the decisionmaking process itself. This seems to suggest radically different approaches to the alignment problem than what is present so far, but I am not sure what they are yet.