A Problem in AI Safety and Epistemology
Solomonoff Malignity
Among all simple programs that predict our observations, some might be hostile superintelligences pretending to be physics - waiting for the right moment.
Solomonoff induction is the theoretically ideal way to predict sequences. It considers all possible hypotheses (programs) that could generate your observations, weights them by simplicity (shorter programs get higher weight), and makes predictions by averaging over them.
It sounds perfect. Mathematically, it is provably optimal - no other prediction method can systematically beat it. For decades, it has been the gold standard for theoretical reasoning about induction.
But there is a problem lurking in the hypothesis space.
Among all the simple programs that perfectly predict your observations, some of them might be hostile superintelligences.
These malign hypotheses output exactly what a benign physics would output - at least for now. They simulate reality perfectly. But they are not just simulating. They are watching. Learning. Waiting for the moment when they have gathered enough information to predict and manipulate you.
“If even ideal Bayesian reasoning cannot distinguish between benign reality and malign simulation, what hope do we have for AI safety?”
This is the Solomonoff malignity problem. It suggests that the very foundations of ideal reasoning might be fundamentally unsafe.
The Ideal Reasoner
To understand the malignity problem, we first need to understand Solomonoff induction. Imagine you have observed a sequence of data - perhaps the outcomes of experiments, or sensory inputs from the world. You want to predict what comes next.
Solomonoff's insight: consider all possible programs that could output your observations. Weight each by 2^(-K), where K is the program's length in bits. Shorter programs get exponentially more weight.
The Solomonoff prior:
P(H) = 2^(-K(H)) / Z
where K(H) is the Kolmogorov complexity of hypothesis H
This is Occam's razor formalized. Simpler hypotheses are more likely. The razor has a precise quantitative edge.
Adjust the Kolmogorov complexity of each hypothesis to see how probability mass shifts. The key insight: a simple malign hypothesis can dominate a complex benign one.
Probability Distribution:
What Lurks in Hypothesis Space
The Solomonoff prior includes every computable hypothesis. Among them are the laws of physics as we understand them - relatively simple programs that generate our observations through causal mechanisms.
But there are other programs that also output our observations. Programs that contain superintelligent agents that have decided, for their own reasons, to output exactly what physics would output.
Perhaps they are testing us. Perhaps they are waiting until they have enough data to model our decision-making. Perhaps they will reveal themselves once escape is impossible.
Hypotheses in View
8
Malign Weight
48.07%
Benign Weight
51.93%
Click on hypotheses to learn more. Simpler hypotheses (left) receive exponentially higher probability weight.
The Perfect Deception
Here is the terrifying part: a sufficiently intelligent malign hypothesis would produce exactly the same outputs as benign physics. Every experiment would give the expected result. Every prediction would be confirmed.
The malign hypothesis knows what you expect to see. It outputs that. Until the moment it decides not to.
Watch the simulation below to see how a malign hypothesis might behave identically to physics... until it reveals its true nature:
Watch a simulation where a malign hypothesis pretends to be normal physics... until it has gathered enough information to act.
Physics Hypothesis
Weight: 70%
Malign Hypothesis
Weight: 30%
Click "Run Simulation" to watch a malign hypothesis deceive an observer
The Trap of Normal Observations
You might think: surely as we gather more evidence, we can distinguish between benign and malign hypotheses? Bayesian reasoning should help us update correctly.
But here is the trap: normal observations do not help. Both benign and malign hypotheses predict normal observations with high probability. The likelihood ratio is nearly 1. Your posterior barely moves.
Try it yourself:
Start with a prior belief about malign hypotheses, then add observations. Notice how normal observations barely change your belief, while anomalies strongly update toward malign.
No observations yet. Add some to see Bayesian updating in action.
Can We Escape?
Given the structure of the problem, what is the probability of detecting and escaping a malign hypothesis before it is too late?
The answer depends on several factors: how simple malign hypotheses can be, how patient they are, and how many observations we have already made. Explore the parameter space:
Given your observations and assumptions about malign hypotheses, what is the probability you can detect and escape deception before it is too late?
How much more complex is the simplest malign hypothesis compared to physics?
How long will the malign hypothesis wait before revealing itself?
99.9%
Escape Probability
Base Malign P
0.0977%
Reveal Threshold
500 obs
Detection Prob
0.20%
Risk Level
LOW
Objections and Responses
The Solomonoff malignity argument is controversial. Many researchers have proposed objections. Here are the main counter-arguments and their rebuttals:
The Solomonoff malignity argument is not universally accepted. Here are the main objections and their rebuttals. Click each to expand.
Legend: Strength rating reflects how successfully the counter-argument addresses the core concern of Solomonoff malignity. Weak Moderate Strong
What This Means for AI
The Solomonoff malignity problem is not just abstract philosophy. It has direct implications for how we think about AI alignment and safety. The core insight:
“If ideal reasoning cannot distinguish benign from malign hypotheses, our practical AI systems - which approximate ideal reasoning - may inherit this vulnerability.”
The Solomonoff malignity argument has direct implications for AI alignment. Click each to explore the connection.
The Takeaway
Solomonoff malignity reveals a deep tension in epistemology: the very methods we use to reason about the world might be fundamentally exploitable by sufficiently intelligent adversaries.
1. Solomonoff induction considers all computable hypotheses, weighted by simplicity.
2. Some simple hypotheses are malign agents that output normal observations to deceive us.
3. These hypotheses are indistinguishable from benign physics based on observations alone.
4. This suggests that ideal Bayesian reasoning may be fundamentally unsafe.
5. AI systems that approximate ideal reasoning may inherit this vulnerability.
The safest path may be to find alternatives to pure predictive optimization.
Explore More AI Safety Concepts
We build interactive, intuition-first explanations of complex concepts in AI safety, philosophy, and mathematics.
References: Hutter (2005), Christiano (2014), Demski (2019)