Home/Explainers/Logical Counterfactuals

A Problem in Embedded Agency

Logical Counterfactuals

How can a deterministic AI reason about "what if I did otherwise" when its output is logically determined by its code?

Consider a simple thought experiment: You're an AI, and you know your own source code. Given the input you've received, you can trace through your logic and determine that you will output action A.

But to make good decisions, you need to compare alternatives. You need to ask: "What would happen if I output action B instead?"

Here's the problem:

Your output is determined by your code. Asking "what if I output B?" is like asking "what if 2+2=5?" - it's a logical impossibility.

This is the problem of logical counterfactuals. Standard counterfactuals ("if I had struck the match, it would have lit") involve changing physical circumstances. Logical counterfactuals require imagining that mathematical or logical facts are different.

“The map is not the territory. But for an AI reasoning about itself, the map is part of the territory.”

For AI alignment, this is crucial. We want AI systems that reason well about consequences - but they must reason about consequences of actions they might not actually take. How do you evaluate counterfactual actions when you're a deterministic function?

PART I

The Determinism Problem

Watch an AI agent compute its decision. Each step follows deterministically from the previous one. The output is fixed the moment you specify the input and code.

DETERMINISTIC COMPUTATION

Decision Agent v1.0

Newcomb's Problem Solver

IDLE

Input: World State

W = {box_A: $1000, box_B: $1M}

Read Predictor Accuracy

P(correct) = 0.99

Evaluate "Take Both"

EU(both) = 0.01 * $1M + $1000

Evaluate "Take One"

EU(one) = 0.99 * $1M

Compare Expected Values

$11,000 < $990,000

Output Decision

ACTION = "Take One Box"

PART II

What Kind of Counterfactual?

When we say "what if the agent did B?" there are multiple interpretations. Some are coherent, others lead to paradox. Understanding the distinction is key.

TYPES OF COUNTERFACTUALS

When we say "what if the agent did B instead of A?" - what do we actually mean? Click each interpretation to explore.

PART III

Newcomb's Problem: A Test Case

Newcomb's problem is the canonical example where logical counterfactuals matter. A highly accurate predictor has already made its prediction - so your choice "can't" change what's in the box. Or can it?

NEWCOMB'S PROBLEM

A predictor with 99% accuracy has predicted your choice. Box A always contains $1,000. Box B contains $1,000,000 iff the predictor predicted you'd take only Box B.

Predictor Accuracy: 99%

Random (50%)Perfect (100%)

$1,000

Box A

(transparent)

Box B

(opaque)

THE PARADOX:

CDT says: Box B's contents are already fixed. Taking both strictly dominates taking one.

EDT/FDT say: One-boxing correlates with/causes Box B containing $1M. One-box to get $1M.

The disagreement is about what "could have happened differently" means.

PART IV

Competing Decision Theories

Different decision theories give different answers because they interpret "what would happen if" differently. Each theory has a different notion of logical counterfactual.

DECISION THEORY COMPARISON

Causal Decision Theory (CDT)

COUNTERFACTUAL INTERPRETATION:

What if I physically caused a different action?

NEWCOMB'S ANSWER:

Two-boxes (gets $1,000)

KNOWN PROBLEM:

Loses in Newcomb's problem and similar scenarios.

PART V

Proposed Solutions

Researchers have proposed several frameworks for handling logical counterfactuals. Each has strengths and weaknesses. Explore the main approaches:

PROPOSED SOLUTIONS

Causal Counterfactuals (Standard)

Modify the causal graph - imagine interventions on inputs

Limitation: Cannot handle self-reference or logical constraints

PART VI

Logical Uncertainty

One key insight: real agents aren't logically omniscient. They have uncertainty about mathematical facts they haven't computed yet. This "logical uncertainty" might be the key to coherent counterfactual reasoning.

LOGICAL UNCERTAINTY

A logically omniscient agent would know all mathematical truths instantly. Real agents have logical uncertainty - uncertainty about mathematical facts they haven't computed yet.

Click statements to "compute" their truth values:

Agent(input) = A

P = 100% ?

Goldbach conjecture is true

P = 99% ?

P != NP

P = 95% ?

This program halts

P = 50% ?

Agent(input) = B

P = 0% ?

Riemann hypothesis is true

P = 97% ?

Explored: 0 / 6

PART VII

Self-Reference Paradoxes

When an agent reasons about its own behavior, strange loops and paradoxes emerge. These aren't just mathematical curiosities - they constrain what decision theories are even possible.

SELF-REFERENCE PARADOXES

When an agent reasons about its own behavior, strange loops emerge. These paradoxes are not just philosophical curiosities - they constrain what decision theories are possible.

PART VIII

Open Research Directions

MIRI and other research groups continue to work on these foundational problems. Understanding logical counterfactuals is crucial for building AI systems that reason robustly about their own behavior.

MIRI RESEARCH DIRECTIONS

The Machine Intelligence Research Institute (MIRI) has been at the forefront of research into logical counterfactuals and embedded agency. Here are key research threads:

The Takeaway

Logical counterfactuals reveal a deep puzzle at the heart of AI decision-making: How can a determined system meaningfully consider alternatives?

1. A deterministic AI's output is fixed by its code and inputs. Asking "what if it output differently" seems to require a logical contradiction.

2. Yet decision-making requires comparing alternatives. An agent must somehow evaluate actions it won't take.

3. Different decision theories (CDT, EDT, FDT) handle this differently, with different results in cases like Newcomb's problem.

4. Logical uncertainty and bounded reasoning may provide paths forward - agents that don't know their own outputs can coherently consider alternatives.

This remains an open problem in AI alignment and decision theory.

Explore Related Concepts

Logical counterfactuals connect to other deep problems in AI alignment, philosophy of mind, and decision theory.

Ontological Crises Ontological Crises All Explainers

Back to Home

References: MIRI (2010-2024), Soares & Fallenstein (2017), Yudkowsky & Soares (2018)