We may not know precisely what form wise AI will take—but it must surely be preferable to folly.

Paper Authors:

Samuel G. B. Johnson, Amir-Hossein Karimi, Yoshua Bengio, Nick Chater, Tobias Gerstenberg, Kate Larson, Sydney Levine, Melanie Mitchell, Iyad Rahwan, Bernhard Schölkopf, Igor Grossmann

Highlights

• We examine the why and the how of building wise AI
• Wisdom helps humans to navigate intractable problems through object-level strategies (for managing problems) and metacognitive strategies (for managing object-level strategies)
• Wise AI, through improved metacognition, would be more robust to new environments, explainable to users, cooperative in pursuing shared goals, and safe in avoiding both prosaic and catastrophic failures
• We suggest several approaches to benchmarking wisdom, training wise reasoning strategies, and adapting AI architecture for metacognition

Abstract

Recent advances in artificial intelligence (AI) have produced systems capable of increasingly sophisticated performance on cognitive tasks. However, AI systems still struggle in critical ways: unpredictable and novel environments (robustness), lack of transparency in their reasoning (explainability), challenges in communication and commitment (cooperation), and risks due to potential harmful actions (safety). We argue that these shortcomings stem from one overarching failure: AI systems lack wisdom.
Drawing from cognitive and social sciences, we define wisdom as the ability to navigate intractable problems - those that are ambiguous, radically uncertain, novel, chaotic, or computationally explosive - through effective task-level and metacognitive strategies. While AI research has focused on task-level strategies, metacognition - the ability to reflect on and regulate one's thought processes - is underdeveloped in AI systems. In humans, metacognitive strategies such as recognizing the limits of one's knowledge, considering diverse perspectives, and adapting to context are essential for wise decision-making. We propose that integrating metacognitive capabilities into AI systems is crucial for enhancing their robustness, explainability, cooperation, and safety.
By focusing on developing wise AI, we suggest an alternative to aligning AI with specific human values - a task fraught with conceptual and practical difficulties. Instead, wise AI systems can thoughtfully navigate complex situations, account for diverse human values, and avoid harmful actions. We discuss potential approaches to building wise AI, including benchmarking metacognitive abilities and training AI systems to employ wise reasoning. Prioritizing metacognition in AI research will lead to systems that act not only intelligently but also wisely in complex, real-world situations.

Notes on this summary

Some quotes have been reformatted (adding paragraph breaks, dot points, references changed to footnotes, ect.).

I've made a few minor edits to the quotes. For example, I remove the word "secondly" when I'm only quoting one element of a list and adjusting the grammar when converting sentences into a list.

Whilst I've tried as hard as possible to represent the views in the original (including making some updates in response to feedback from one of the authors), I can't guarantee perfect accuracy.

By default, assume that quotes (indicated by a grey bar on the left) are from the paper.

Why I Wrote This Summary

Firstly, I thought the framing of metacognition as a key component of wisdom missing from current AI systems was insightful and the resulting analysis fruitful.

Secondly, this paper contains some ideas similar to those I discussed in Some Preliminary Notes on the Promise of a Wisdom Explosion. In particular, the authors talk about a "virtuous cycle" in relation to wisdom in the final paragraphs:

By simultaneously promoting robust, explainable, cooperative, and safe AI, these qualities are likely to amplify one another:
Robustness will facilitate cooperation (by improving confidence from counterparties in its long-term commitments) and safety (by avoiding novel failure modes^[1]).
Explainability will facilitate robustness (by making it easier to human users to intervene in transparent processes) and cooperation (by communicating its reasoning in a way that is checkable by counterparties).
Cooperation will facilitate explainability (by using accurate theory-of-mind about its users) and safety (by collaboratively implementing values shared within dyads, organizations, and societies).
Wise reasoning, therefore, can lead to a virtuous cycle in AI agents, just as it does in humans.

Summary:

What is wisdom?

Examples of Human Wisdom:

Willa’s children are bitterly arguing about money. Willa draws on her life experience to show them why they should instead compromise in the short term and prioritize their sibling relationship in the long term.
"Daphne is a world-class cardiologist. Nonetheless, she consults with a much more junior colleague when she recognizes that the colleague knows more about a patient’s history than she does"
"Ron is a political consultant who formulates possible scenarios to ensure his candidate will win. To help generate scenarios, he not only imagines best case scenarios, but also imagines that his client has lost the election and considers possible reasons that might have contributed to the loss."

What are some theories of human wisdom?

For a more detailed account, see the table on page 5 of the paper.

Five component theories:

• Balance theory: "Deploying knowledge and skills to achieve the common good by"
• Berlin Wisdom Model: "Expertise in important and difficult matters of life"
• MORE Life Experience Model: "Gaining psychological resources via reflection, to cope with life challenges"
• Three-Dimensional Model: "Acquiring and reflecting on life experience to cultivate personality traits"
• Wise Reasoning Model: "Using context-sensitive reasoning to manage important social challenges"

Two consensus models

• Common wisdom model: "A style of social-cognitive processing" involving morality and metacognition
• Integrative Model: "A behavioural repertoire"

These consensus models attempt to find common themes.

Would AI wisdom resemble human wisdom?

Potential differences:

• AIs have different computational constraints. Humans need to "economize scarce cognitive resources" which incentivizes us to use heuristics more.
• Humans exist in a society that allows us to "outsource... cognition to the social environment" such as through division of labor.

Reasons why human and AI wisdom might converge:

• Resource difference might be "more a matter of degree than kind"
• Heuristics are often about handling a lack of information rather than computational constraints
• AI's might "join our (social) milieu"

Definition of wisdom^[2]:

If life were a series of textbook problems, we would not need to be wise. There would be a correct answer, the requisite information for calculating it would be available, and natural selection would have ruthlessly driven humans to find those answers.

We define wisdom functionally as the ability to successfully navigate intractable problems— those that do not lend themselves to analytic techniques due to unlearnable probability distributions or incommensurable values^[3].

What kinds of intractable problems?

Incommensurable: It features ambiguous goals or values that cannot be compared with one another^[4].
Transformative: The outcome of the decision might change one’s preferences, leading to a clash between one’s present and future values
Radically uncertain. We might not be able to exhaustively list the possible outcomes or assign probabilities to them in a principled way^[5].
Chaotic. The data-generating process may have a strong nonlinearity or dependency on initial conditions, making it fundamentally unpredictable^[6]^[7].
Non-stationary. The underlying process may be changing over time, making the probability distribution unlearnable.
Out-of-distribution. The situation is novel, going beyond one’s experience or available data.
Computationally explosive. The optimal response could be calculated with infinite or infeasibly large computational resources, but this is not possible due to resource constraints

This seems like a reasonable definition to use, though I have to admit I find the term "intractable problems" to be a bit strong for the examples they provided. For example, Daphne putting aside her ego to consult a junior colleague doesn't quite match what I'd describe as overcoming an "intractable" problem^[8].

Two types of strategies for managing this:

1) Task-level strategies ("used to manage the problem itself") such as heuristics or narratives.

2) Metacognitive strategies ("used to flexibly manage those task-level strategies")^[9]

They argue that although AI has made lots of progress with task-level strategies, it often neglects metacognitive strategies^[10]. For this reason, their paper focuses on the latter.

Why do the authors believe current AI falls short in metacognition?

The authors provide some specific examples of where they believe AI systems fall short:

1) Struggling to understand their goals (“mission awareness”^[11])
2) Exhibiting overconfidence^[12]
3) Failing to appreciate the limits of their capabilities and context (e.g., stating they can access real-time information or take actions in the physical world^[11])

They label this "metacognitive myopia"^[13].

Why build wise AI?:

First, it is not clear what the alternative is. Compared to halting all progress on AI, building wise AI may introduce added risks alongside added benefits. But compared to the status quo—advancing task-level capabilities at a breakneck pace with little effort to develop wise metacognition—the attempt to make machines intellectually humble, context-sensitive, and adept at balancing viewpoints seems clearly preferable.

Concrete Benefits

	Why the authors argue that wise AI might provide this benefit
Robustness	Reliability over similar inputs: it'd be unwise to choose "excessively inconsistent" strategies: - Comment: I guess? Unclear how strong we should expect that effect to be though. Bias: identifying deficiencies in the data and either gathering more data or correcting for that bias Inflexibility: adjusting its confidence based on the situation
Co-operation^[14]	Slightly edited quotes Resolving conflicts among (object-level) strategies: e.g., when accuracy cues diverge Assessing the appropriateness (of object-level strategies): e.g., whether one can evaluate a chain of argumentation Seeking appropriate inputs: e.g., knowing the capabilities of the other counterparty This last point is particularly important for cooperative AI, which could overestimate the abilities of humans or lack common ground such as a shared emotional system.
Safety	The authors argue that wise reasoning provides an alternative to aligning AI to values^[15]^[16]: e.g., "one object-level strategy may be a bias toward inaction (not executing an action if it risks harm according to one of several possibly conflicting human norms), which in turn requires metacognitive regulation (learning what those conflicting perspectives are and avoiding overconfidence)" However, they argue that this isn't sufficient, as it doesn't address all the social questions of alignment, both in terms of design decisions ("Who should we align AI to? Should we increase the average human well-being, its sum, or care for the whole biosphere? Why assume today's values are the right ones?") and how these AI systems fit into a broader society (specifically how they can be channeled by institutions like governments and markets to allow our values to evolve towards a "shared reflective equilibrium".
Explainability	Metacognition seem to play a role in assisting humans to justifying their decisions. Presumably it should assist with helping AI to explain its decisions as well^[17]?

Comparison to Alignment:

They identify three main conceptual problems for alignment:

Humans don't uniformly prioritize following norms^[18]
Norms vary sharply across cultures
Even if norms were uniform, they may not be morally correct

Given these conceptual problems, alignment may not be a feasible or even desirable engineering goal. The fundamental challenge is how AI agents can live among us—and for this, implementing wise AI reasoning may be a more promising approach. Aligning AI systems to the right metacognitive strategies rather than to the “right” values might be both conceptually cleaner and more practically feasible.

It seems plausible that there might be more agreement on meta-cognitive strategies than values, however, I still expect there to be sufficient disagreement to make this a challenge.

Inaction example

"Task-level strategies may include heuristics such as a bias toward inaction: When in doubt about whether a candidate action could produce harm according to one of several possibly conflicting human norms, by default do not execute the action. Yet wise metacognitive monitoring and control will be crucial for regulating such task-level strategies. In the ‘inaction bias’ strategy, for example, a requirement is to learn what those conflicting perspectives are and to avoid overconfidence"

Possible effects on instrumental convergence

In the final section they suggest that building machines wiser than humans might prevent instrumental convergence^[19] as "empirically, humans with wise metacognition show greater orientation toward the common good". I have to admit skepticism as I believe in the orthogonality thesis and I see no reason to believe it wouldn't apply to wisdom as well. That said, activating latents that improve wisdom might also improve alignment, even if it is far from a complete solution.

Further comments on alignment

"With respect to the broader goal of AI alignment, we are sympathetic to the goal but question this definition of the problem. Ultimately safe AI may be at least as much about constraining the power of AI systems within human institutions, rather than aligning their goals"

Benchmarking Wisdom:

The paper discusses potential for benchmarking AI wisdom. They seem to be in favor of starting with tasks that measure wise reasoning in humans and scoring their reflections based on predefined criteria. It's worth noting that these criteria can be about reasoning processes rather than the outcome they reach.

This could potentially be fruitful, however, I do worry that it might be fairly easy for AI's to learn to Goodhart here - apply metacognition in a way that is fairly shallow, but sufficient to satisfy the human raters.

The author may not be in disagreement: whilst they see benchmarking as a "crucial start" they also assert that " there is no substitute for interaction with the real world". This leads them to suggest a slow rollout to give us time to evaluate whether their decisions really were wise.

One worry I have is that sometimes wisdom involves just knowing what to do without being able to explain it. This might be problematic for attempts to evaluate wisdom by evaluating the wisdom of a person's reasoning^[20].

Challenges with benchmarks

Memorization: Benchmark results can be inflated by memorizing patterns in a way that doesn't generalize outside of the training distribution

Evaluating the process is hard: They claim wisdom depends on the underlying reasoning rather than just success^[21]. Reasoning is harder to evaluate than the correct answer.

Producing a Realistic Context: It may be challenging to produce artificial examples as the AI might have access to much more information in the real world

Building Wise AI:

Proposal A^[22] (two training steps):
Train a model for wise strategy selection directly (e.g., to correctly identify when to be intellectually humble)
Train them to use those strategies correctly (e.g., to carry out intellectual humble behavior).
Proposal B (one training step):
Evaluate whether models are able to plausibly explain their metacognitive strategies in benchmark cases
If this is the case, then simultaneously train strategies and outputs (e.g., training the model to identify the situation as one that calls for intellectual humility and to reason accordingly^[23]).

Final Thoughts

The authors address the point that attempting to build wise AI have negative externalities either due to malicious use or due to the project failing. In response they write:

If the alternative were halting all AI progress, building wise AI would introduce added risks. But compared to the status quo—advancing capabilities at a breakneck pace without wise metacognition—the attempt to make machines intellectually humble, context-adaptable, and adept at balancing viewpoints seems clearly preferable.

One final point of difference I'd like to mention: The authors seem to primarily imagine wise AIs acting directly in the world^[24]. In contrast, my primary interest is in wise AI advisors working in concert with humans.

Why am I focused specifically on wise AI advisors?

I'm personally focused on cybernetic/centaur systems that combine AI advisors with humans because this allows the humans to compensate for the weaknesses of the AI.

This has a few key advantages:

• It provides an additional layer of safety/security.

• It allows us to benefit from such systems earlier than we would be able to otherwise

• If we decide advisors are insufficient and that we want to train autonomously acting wise agents, AI advisors could help us with that.

Whilst the possibility of training wise AI has been previously discussed in the academic literature, I am hopeful that this paper will turn out to be a landmark. Given the credibility of the authors and the quality of the work, it's plausible to me that it will play a key role in causing artificial wisdom blossom into its own sub-field of ML.

I really hope this is the case because I suspect that worlds where this happens are much more likely to result in good outcomes than worlds where this does not happen. To quote the authors:

We may not know precisely what form wise AI will take—but it must surely be preferable to folly.

Read the full, original paper

Highlights of additional content in the full paper

• A table summarising psychological approaches to wisdom
• A list of ideas of how to engineer wiser AI
• A list of outstanding questions

^{^}
Johnson, B. (2022). Metacognition for artificial intelligence system safety: An approach to safe and desired behavior. Safety Science, 151, 105743.
^{^}
For the purposes of this paper... the authors aren't claiming to make a universal definition.
^{^}
See the collapsable section immediately underneath for a larger list.
^{^}
Walasek, L., & Brown, G. D. (2023). Incomparability and incommensurability in choice: No common currency of value? Perspectives on Psychological Science, 17456916231192828.
^{^}
Kay, J., & King, M. (2020). Radical uncertainty: Decision-making beyond the numbers. New York, NY: Norton.
^{^}
They seem to be pointing to Knightian uncertainty
^{^}
Lorenz, E. (1993). The essence of chaos. Seattle, WA: University of Washington Press.
^{^}
Prof. Grossmann (personal correspondance): "You are right that the term "intractable problem" is complicated and our group has debated it for a while (different disciplines favoured different jargon). Our examples were chiefly for highlighting the metacognitive benefits for wise decision-making."
^{^}
This table is copied from the paper.
^{^}
They use examples we discussed earlier to help justify their focus on metacognition. Whilst the Willa example might not initially appear related to metacognition, I suspect that the authors see this as related to "perspective seeking", one of the six metacognitive processes they highlight.
^{^}
Li, Y., Huang, Y., Lin, Y., Wu, S., Wan, Y., & Sun, L. (2024). I think, therefore I am: Awareness in Large Language Models. arXiv preprint arXiv:2401.17882.
^{^}
Cash, T. N., Oppenheimer, D. M., & Christie, S. Quantifying UncertAInty: Testing the Accuracy of LLMs’ Confidence Judgments. Preprint.
^{^}
Scholten, F., Rebholz, T. R., & Hütter, M. (2024). Metacognitive myopia in Large Language Models. arXiv preprint arXiv:2408.05568.
^{^}
An older version of the paper suggested: "Wisdom could enable the design of structures (such as constitutions, markets, and organizations) that enhance cooperation in society".
^{^}
The original paper said: "It can be incredibly challenging to "exhaustively specify goals in advance". Humans handle this by using goal hierarchies and wisdom could assist AI's in navigating this"
^{^}
A previous version of the paper claimed: "Perhaps the greatest risk is currently systems not working well enough. Machine metacognition could be useful for this. In particular, "AIs with appropriately calibrated confidence can target the most likely safety risks; appropriate self-models would help AIs to anticipate potential failures; and continual monitoring of its performance would facilitate recognition of high-risk moments and permit learning from experience."
^{^}
I agree that metacognition seems important for explanability, but my intuition is that wise decisions are often challenging or even impossible to make legible. See Tentatively against making AIs 'wise', which won a runner up prize in the AI Impacts Essay competition on the Automation of Wisdom and Philosophy.
The authors acknowledge the possibility that most attempts at introspection may fail to observe what really produced the decision, as opposed to merely producing an inference/story. Nonetheless, they assert that these inferences are in fact useful.
^{^}
The first sentence of this section reads "First, humans are not even aligned with each other". This is confusing since the second paragraph seems to suggest that their point is more about humans not always following norms, which is what I've summarised their point as.
^{^}
This paper doesn't use the term "instrumental convergence", so this statement involves a slight bit of interpretation on my part.
^{^}
Prof. Grossmann (personal correspondance): "I also don't think most philosophical or contemporary definitions of human wisdom in behavioural sciences would primarily focus on "intuition" - I even have evidence from a wide range of countries where most cultures consider a "wise" decision strategy to chiefly rely on deliberation"
^{^}
This is less significant in my worldview as I see wisdom as often being just about knowing the right answer without knowing why you know.
^{^}
The labels "Proposal A" and "Proposal B" aren't in the paper.
^{^}
For example, Lampinen, A. K., Roy, N., Dasgupta, I., Chan, S. C., Tam, A., Mcclelland, J., ... & Hill, F. (2022, June). Tell me why! explanations support learning relational and causal structure. In International Conference on Machine Learning (pp. 11868-11890).
^{^}
Prof. Grossmann (personal correspondance): "I like the idea of wise advisors. I don't think the argument in our paper is against it -it all depends on how humans will use the technology (and there are several papers on the role of metacognition for discerning when to rely on decision-aids/AI advisors, too)."
^{^}
Eliezer Yudkowsky's view seems to be that this specification pretty much has to be exhaustive, though others are less pessimistic about partial alignment.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

10

Summary: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.

10

Paper Authors:

Highlights

Abstract

Notes on this summary

Why I Wrote This Summary

Summary:

What is wisdom?

Examples of Human Wisdom:

What are some theories of human wisdom?

Would AI wisdom resemble human wisdom?

Definition of wisdom^[2]:

What kinds of intractable problems?

Two types of strategies for managing this:

Why do the authors believe current AI falls short in metacognition?

Why build wise AI?:

Concrete Benefits

Comparison to Alignment:

Inaction example

Possible effects on instrumental convergence

Further comments on alignment

Benchmarking Wisdom:

Challenges with benchmarks

Building Wise AI:

Final Thoughts

Why am I focused specifically on wise AI advisors?

Highlights of additional content in the full paper

10

Summary: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.

10

Paper Authors:

Highlights

Abstract

Notes on this summary

Why I Wrote This Summary

Summary:

What is wisdom?

Examples of Human Wisdom:

What are some theories of human wisdom?

Would AI wisdom resemble human wisdom?

Definition of wisdom[2]:

What kinds of intractable problems?

Two types of strategies for managing this:

Why do the authors believe current AI falls short in metacognition?

Why build wise AI?:

Concrete Benefits

Comparison to Alignment:

Inaction example

Possible effects on instrumental convergence

Further comments on alignment

Benchmarking Wisdom:

Challenges with benchmarks

Building Wise AI:

Final Thoughts

Why am I focused specifically on wise AI advisors?

Highlights of additional content in the full paper

Definition of wisdom^[2]: