A Problem in Embedded Agency
Logical Counterfactuals
How can a deterministic AI reason about "what if I did otherwise" when its output is logically determined by its code?
Consider a simple thought experiment: You're an AI, and you know your own source code. Given the input you've received, you can trace through your logic and determine that you will output action A.
But to make good decisions, you need to compare alternatives. You need to ask: "What would happen if I output action B instead?"
Here's the problem:
Your output is determined by your code. Asking "what if I output B?" is like asking "what if 2+2=5?" - it's a logical impossibility.
This is the problem of logical counterfactuals. Standard counterfactuals ("if I had struck the match, it would have lit") involve changing physical circumstances. Logical counterfactuals require imagining that mathematical or logical facts are different.
“The map is not the territory. But for an AI reasoning about itself, the map is part of the territory.”
For AI alignment, this is crucial. We want AI systems that reason well about consequences - but they must reason about consequences of actions they might not actually take. How do you evaluate counterfactual actions when you're a deterministic function?
The Determinism Problem
Watch an AI agent compute its decision. Each step follows deterministically from the previous one. The output is fixed the moment you specify the input and code.
Decision Agent v1.0
Newcomb's Problem Solver
Input: World State
W = {box_A: $1000, box_B: $1M}
Read Predictor Accuracy
P(correct) = 0.99
Evaluate "Take Both"
EU(both) = 0.01 * $1M + $1000
Evaluate "Take One"
EU(one) = 0.99 * $1M
Compare Expected Values
$11,000 < $990,000
Output Decision
ACTION = "Take One Box"
What Kind of Counterfactual?
When we say "what if the agent did B?" there are multiple interpretations. Some are coherent, others lead to paradox. Understanding the distinction is key.
When we say "what if the agent did B instead of A?" - what do we actually mean? Click each interpretation to explore.
Newcomb's Problem: A Test Case
Newcomb's problem is the canonical example where logical counterfactuals matter. A highly accurate predictor has already made its prediction - so your choice "can't" change what's in the box. Or can it?
A predictor with 99% accuracy has predicted your choice. Box A always contains $1,000. Box B contains $1,000,000 iff the predictor predicted you'd take only Box B.
Box A
(transparent)
Box B
(opaque)
THE PARADOX:
CDT says: Box B's contents are already fixed. Taking both strictly dominates taking one.
EDT/FDT say: One-boxing correlates with/causes Box B containing $1M. One-box to get $1M.
The disagreement is about what "could have happened differently" means.
Competing Decision Theories
Different decision theories give different answers because they interpret "what would happen if" differently. Each theory has a different notion of logical counterfactual.
Causal Decision Theory (CDT)
COUNTERFACTUAL INTERPRETATION:
What if I physically caused a different action?
NEWCOMB'S ANSWER:
Two-boxes (gets $1,000)
KNOWN PROBLEM:
Loses in Newcomb's problem and similar scenarios.
Proposed Solutions
Researchers have proposed several frameworks for handling logical counterfactuals. Each has strengths and weaknesses. Explore the main approaches:
Causal Counterfactuals (Standard)
Modify the causal graph - imagine interventions on inputs
Limitation: Cannot handle self-reference or logical constraints
Logical Uncertainty
One key insight: real agents aren't logically omniscient. They have uncertainty about mathematical facts they haven't computed yet. This "logical uncertainty" might be the key to coherent counterfactual reasoning.
A logically omniscient agent would know all mathematical truths instantly. Real agents have logical uncertainty - uncertainty about mathematical facts they haven't computed yet.
Click statements to "compute" their truth values:
Agent(input) = AGoldbach conjecture is trueP != NPThis program haltsAgent(input) = BRiemann hypothesis is trueExplored: 0 / 6
Self-Reference Paradoxes
When an agent reasons about its own behavior, strange loops and paradoxes emerge. These aren't just mathematical curiosities - they constrain what decision theories are even possible.
When an agent reasons about its own behavior, strange loops emerge. These paradoxes are not just philosophical curiosities - they constrain what decision theories are possible.
Open Research Directions
MIRI and other research groups continue to work on these foundational problems. Understanding logical counterfactuals is crucial for building AI systems that reason robustly about their own behavior.
The Machine Intelligence Research Institute (MIRI) has been at the forefront of research into logical counterfactuals and embedded agency. Here are key research threads:
The Takeaway
Logical counterfactuals reveal a deep puzzle at the heart of AI decision-making: How can a determined system meaningfully consider alternatives?
1. A deterministic AI's output is fixed by its code and inputs. Asking "what if it output differently" seems to require a logical contradiction.
2. Yet decision-making requires comparing alternatives. An agent must somehow evaluate actions it won't take.
3. Different decision theories (CDT, EDT, FDT) handle this differently, with different results in cases like Newcomb's problem.
4. Logical uncertainty and bounded reasoning may provide paths forward - agents that don't know their own outputs can coherently consider alternatives.
This remains an open problem in AI alignment and decision theory.
Explore Related Concepts
Logical counterfactuals connect to other deep problems in AI alignment, philosophy of mind, and decision theory.
References: MIRI (2010-2024), Soares & Fallenstein (2017), Yudkowsky & Soares (2018)