Introduction to Towards Causal Foundations of Safe AGI

tom4everitt; Lewis Hammond; Francis Rhys Ward; RyanCarey; James Fox; mattmacdermott; sbenthall

By Tom Everitt, Lewis Hammond, Rhys Ward, Ryan Carey, James Fox, Sebastian Benthall, Matt MacDermott and Shreshth Malik representing the Causal Incentives Working Group. Thanks also to Toby Shevlane, MH Tessler, Aliya Ahmad, Zac Kenton, Maria Loks-Thompson, and Alexis Bellot.

Over the next few years, society, organisations, and individuals will face a number of fundamental questions stemming from the rise of advanced AI systems:

How to make sure that advanced AI systems do what we want them to (the alignment problem)?
What makes a system safe enough to develop and deploy, and what constitutes sufficient evidence of that?
How do we preserve our autonomy and control as decision making is increasingly delegated to digital assistants?

A causal perspective on agency provides conceptual tools for navigating the above questions, as we’ll explain in this sequence of blog posts. An effort will be made to minimise and explain jargon, to make the sequence accessible to researchers from a range of backgrounds.

Agency

First, with agent we mean a goal-directed system that acts as if it it is trying to steer the world in some particular direction(s). Examples include animals, humans, and organisations (more on agents in a subsequent post). Understanding agents is key to the above questions. Artificial agents are widely considered the primary existential threat from AGI-level technology, whether they emerge spontaneously or through deliberate design. Despite the myriad risks to our existence, highly capable agents pose a distinct danger, because many goals can be achieved more effectively by accumulating influence over the world. Whereas an asteroid moving towards earth isn’t intending to harm humans and won’t resist redirection, misaligned agents might be distinctly adversarial and active threats.

Second, the preservation of human agency is critical in the approaching technological transition, for both individuals and collectives. Concerns have already been raised that manipulative social media algorithms and content recommenders undermine users’ ability to focus on their long-term goals. More powerful assistants could exacerbate this. And as more decision-making is delegated to AI systems, the ability of society to set its own trajectory comes into question

Human agency can also be nurtured and protected. Helping people to help themselves is less paternalistic than directly fulfilling their desires, and fostering empowerment may be less contingent on complete alignment than direct satisfaction of individual preferences. Indeed, self-determination theory provides evidence that humans intrinsically value agency, and some human rights can be interpreted as “protections of our normative agency”.

Third, artificial agents might themselves eventually constitute moral patients. A clearer understanding of agency could help us refine our moral intuitions and avoid unethical actions. Some ethical dilemmas might be possible to avoid altogether by only designing artificial systems that lack moral patienthood.

Key questions

One hope for our research is that it would build up a theory of agency. Such a theory would ideally answer questions such as:

What are the possible kinds of agents that can be created, and along what dimension can they differ? The agents we’ve seen so far primarily include animals, humans, and human organisations, but the range of possible goal-directed systems is likely much larger than that.
Emergence: how are agents created? For example, when might a large language model become agentic? When does a system of agents become a “meta-agent”, such as an organisation?
Disempowerment: how is agency lost? How do we preserve and nurture human agency?
What are the ethical demands posed by various types of systems and agents?
How to recognise agents and measure agency? A concrete operationalization would help us to detect agency in artificial systems, and agency loss in humans.
How to predict agent behaviour? What behaviour is incentivised and how do agents generalise to new situations? If we understand the impact of the behaviour, we may also be able to anticipate danger.
What are the possible relationships between agents? Which are harmful and which are beneficial?
How do we shape agents, to make them safe, fair, and beneficial?

Causality

Causality is helpful for understanding agents. Philosophers have been interested in causality for a long time, not just because the exact relationship between a cause and an effect is intellectually intriguing, but because it underpins so many other concepts, many of which are relevant to understanding agents and designing safe AGI.

For example, both influence and response are causal concepts. We want agents that influence the world in positive ways, and respond appropriately to instructions. A range of other other relevant concepts also build on causality:

Agency, since a goal-directed system is one where the system’s goals are driving (causing) its behaviour.
Intention, which relates to the reasons for (causes of) action, and to goal-directed means-end influence. Intent is an important concept when ascribing legal and moral responsibility.
Harm, manipulation, and deception which consider how your well-being, actions, or beliefs were influenced and which are usually taken to be intentional.
Fairness in how e.g. a hiring decision responds to personal attributes such as gender or race and make them influence the outcome.
Robust generalisation to changes in the environment is much easier for agents with a causal model of their environment.
Hypotheticals / counterfactuals, as the alternative world deviates from the actual world based on one or more causal interventions.

The tree of causality

The rest of this sequence will explain in more detail how these concepts are grounded in causality, and the research this has led to. We hope this will enable and motivate other researchers to join our effort of building a formal theory of safe A(G)I based on causal foundations. Much of our recent work fits into this vision. For example, in discovering agents and reasoning about causality in games, we developed a better understanding of how to represent various aspects of reality with causal models. With the agent incentives paper, we showed how such models can be analysed to reveal safety-relevant properties. And with path-specific objectives, we illustrated how this kind of analysis can inspire improved designs.

We hope this will complement other research directions crucial to safe AGI, like scalable alignment, dangerous capability evaluations, robustness, interpretability, ethics, policy and governance, forecasting, agent foundations, and risk mapping.

Conclusions

We hope that a causality-based understanding of agency and related aspects will help designers of AI systems by clarifying the space of possibilities for agents, and how to avoid especially risky configurations. It may help regulators with a better picture of what to look out for, and what should count as sufficient evidence of safety. It may help us all decide what behaviour is acceptable towards what kinds of systems. And finally, but not least, it may help individuals understand what it is that they seek to preserve and enhance in their interactions with artificially intelligent systems.

In the next post, we explain causality and causal models in more detail, covering Pearl’s different causal models, and how they can be generalised to account for the presence of one or more agents.

AI ALIGNMENT FORUM
AF

29

Introduction to Towards Causal Foundations of Safe AGI

29

Agency

Key questions

Causality

Conclusions