We are a collection of researchers interested in using causal models to understand agent incentives, in order to design safe and fair AI algorithms. If you are interested in collaborating on any related problems, feel free to reach out to us.
Honesty Is the Best Policy: Defining and Mitigating AI Deception: Formal definition of intent and deception and graphical criteria. RL and LM experiments to illustrate. Francis Rhys Ward, Tom Everitt, Francesco Belardinelli, Francesca Toni. 2023.
Human Control: Definitions and Algorithms:
We study definitions of human control, including variants of corrigibility and alignment, the assurances they offer for human autonomy, and the algorithms that can be used to obtain them.
Ryan Carey, Tom Everitt
UAI, 2023
Discovering Agents (summary):
A new causal definition of agency that allows us to discover whether an agent is present in a system, leading to better causal modelling of AI agents and their incentives.
Zachary Kenton, Ramana Kumar, Sebastian Farquhar, Jonathan Richens, Matt MacDermott, Tom Everitt
arXiv, 2022
Reasoning about Causality in Games (tweet thread):
Introduces (structural) causal games, a single modelling framework that allows for both causal and game-theoretic reasoning.
Lewis Hammond, James Fox, Tom Everitt, Ryan Carey, Alessandro Abate, Michael Wooldridge
Artificial Intelligence Journal, 2023
Counterfactual Harm:
Agents must have a causal understanding of the world, in order to robustly minimize harm across distributional shifts.
Jonathan G. Richens, Rory Beard, Daniel H. Thompson
Neurips, 2022
Path-Specific Objectives for Safer Agent Incentives:
How do you tell an ML system to optimise an objective, but not by any means? E.g. optimize user engagement without manipulating the user?
Sebastian Farquhar, Ryan Carey, Tom Everitt
AAAI-22
A Complete Criterion for Value of Information in Soluble Influence Diagrams:
Presents a complete graphical criterion for value of information in influence diagrams with more than one decision node, along with ID homomorphisms and trees of systems.
Chris van Merwijk*, Ryan Carey*, Tom Everitt
AAAI-22
Why Fair Labels Can Yield Unfair Predictions: Graphical Conditions for Introduced Unfairness:
When is unfairness incentivised? Perhaps surprisingly, unfairness can be incentivized even when labels are completely fair.
Carolyn Ashurst, Ryan Carey, Silvia Chiappa, Tom Everitt
AAAI-22
Agent Incentives: A Causal Perspective (AI:ACP) (summary): An agent’s incentives are largely determined by its causal context. This paper gives sound and complete graphical criteria for four incentive concepts: value of information, value of control, response incentives, and control incentives.
Tom Everitt*, Ryan Carey*, Eric Langlois*, Pedro A. Ortega, Shane Legg
AAAI-21
How RL Agents Behave When Their Actions Are Modified (summary): RL algorithms like Q-learning and SARSA make different causal assumptions about their environment. These assumptions determine how user interventions affect the learnt policy.
Eric Langlois, Tom Everitt
AAAI-21
Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice (video, summary): Introduces a notion of subgames in multi-agent (causal) influence diagrams, alongside classic equilibrium refinements.
Lewis Hammond, James Fox, Tom Everitt, Alessandro Abate, Michael Wooldridge
AAMAS-21
Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective (summary, summary 2): Analyzes various reward tampering (aka “wireheading”) problems with causal influence diagrams.
Tom Everitt, Marcus Hutter, Ramana Kumar, Victoria Krakovna
Synthese, 2021
PyCID: A Python Library for Causal Influence Diagrams (github): Describes our Python package for analyzing (multi-agent) causal influence diagrams.
James Fox, Tom Everitt, Ryan Carey, Eric Langlois, Alessandro Abate, Michael Wooldridge
SciPy, 2021
Modeling AGI safety frameworks with causal influence diagrams
Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg
IJCAI AI Safety Workshop, 2019
The Incentives that Shape Behavior (summary): Superseded by AI:ACP.
Ryan Carey*, Eric Langlois*, Tom Everitt, Shane Legg
Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings (summary): Superseded by AI:ACP.
Tom Everitt, Pedro A. Ortega, Elizabeth Barnes, Shane Legg
(* denotes equal contribution)
pycid: A Python implementation of causal influence diagrams, built on pgmpy.
CID Latex Package: A package for drawing professional looking influence diagrams, see tutorial.
If you find our work interesting, here are some pointers for further reading and ways to get involved.
Our work builds on both the causality and the AGI safety literature. To be able to contribute to this area of research, you would need a reasonably deep understanding of both.
An accessible introduction to causality is The Book of Why. The next step is A Primer, which introduces the formal frameworks, along with plenty of exercises. The deepest and most detailed account is Causality, but it’s not an easy read (chapters 1, 3, and 7 are the most important).
The book that initiated the field is Superintelligence. It remains a good read, though it has since been complemented by books such as Human Compatible, overview papers like Is Power-Seeking AI an Existential Risk, AGI safety literature review, Artificial Intelligence, Values and Alignment, and blog posts like Without specific countermeasures, and What Failure Looks like.
There are also good online courses on the topic, such as the AGI Safety Fundamentals Course and the ML Safety Course.
With this background you should be able to read our technical papers above. Please reach out to Tom if you are working on a related project and would like to join our meetings.
In short, we offer a friendly community and a promising research direction for a positive post-AGI future. Welcome!