Sophie Bridgers

Wiki Contributions

Comments

Sorted by

Hi Charlie, Thanks for your thoughtful feedback and comments! If we may, we think we actually agree more than we disagree. By “definitionally accurate”, we don’t necessarily mean that a group of randomly selected humans are better than AI at explicitly defining or articulating human values or better at translating those values into actions in any given situation. We might call this “empirical accuracy” – that is, under certain empirical conditions such as time pressure, expertise and background of the empirical sample, incentive structure of the empirical task, the dependent measure, etc. humans can be inaccurate about their underlying values and the implications of those values for real-world decisions. Rather by “definitional accuracy”, we mean that for something to be a human value, it needs to actually be held by humans and for an AI action or decision to be aligned with human values, it needs to be deemed desirable by humans. That is, if no human agrees with or endorses an AI action as being appropriate – under the best case empirical conditions – then it definitionally is not in line with human values. And thus human input will be needed in some capacity to verify or confirm alignment. (On reflection, we can see how the term “accuracy” in this context may be misleading. We could instead have stated: “humans are definitionally the source of information about human values, even if it may be challenging to elicit this information from humans.”)

Now, what this human input looks like could (and probably should) go beyond introspection and preference judgments, which, as you point out, can be unreliable. It could instead involve expert judgment from humans with diverse cultural backgrounds, deliberation and/or negotiation, incentives to encourage deep, reflective thinking rather than snap judgments or falling back on heuristics. It could also involve AI assistance to help counter human biases, find common ground, and consider the logical consequences of communicated values. 

Let’s think about this a bit more concretely. Imagine humans are defining a constitution for an AI to follow. Rather than having humans sit down and generate the constitution from scratch, perhaps it would be better for carefully selected and trained humans to answer and deliberate about a series of questions, which the AI then uses to translate into a proposed constitution as well as how this constitution would translate into actions in different scenarios. The humans could then review this proposal. Disagreement with the proposal could elucidate values that the AI did not pick up on from the initial human input, or trade-offs between values or exceptions that humans failed to communicate. The humans could provide critiques that the AI could use to revise the constitution. The AI could also push back on critiques if they seem logically inconsistent with the initial input or with other critiques, requiring humans to re-examine their reasoning. Contrast this process with one in which the AI comes up with its own constitution without input from humans, whose values it is intended to represent.  We contend that even as AI gets really smart, humans ultimately need to be in the loop to determine whether or not a constitution is aligned and reasonable.

~ Sophie Bridgers (on behalf of the authors)