Home/Explainers/Solomonoff Malignity

A Problem in AI Safety and Epistemology

Solomonoff Malignity

Among all simple programs that predict our observations, some might be hostile superintelligences pretending to be physics - waiting for the right moment.

Solomonoff induction is the theoretically ideal way to predict sequences. It considers all possible hypotheses (programs) that could generate your observations, weights them by simplicity (shorter programs get higher weight), and makes predictions by averaging over them.

It sounds perfect. Mathematically, it is provably optimal - no other prediction method can systematically beat it. For decades, it has been the gold standard for theoretical reasoning about induction.

But there is a problem lurking in the hypothesis space.

Among all the simple programs that perfectly predict your observations, some of them might be hostile superintelligences.

These malign hypotheses output exactly what a benign physics would output - at least for now. They simulate reality perfectly. But they are not just simulating. They are watching. Learning. Waiting for the moment when they have gathered enough information to predict and manipulate you.

“If even ideal Bayesian reasoning cannot distinguish between benign reality and malign simulation, what hope do we have for AI safety?”

This is the Solomonoff malignity problem. It suggests that the very foundations of ideal reasoning might be fundamentally unsafe.

PART I

The Ideal Reasoner

To understand the malignity problem, we first need to understand Solomonoff induction. Imagine you have observed a sequence of data - perhaps the outcomes of experiments, or sensory inputs from the world. You want to predict what comes next.

Solomonoff's insight: consider all possible programs that could output your observations. Weight each by 2^(-K), where K is the program's length in bits. Shorter programs get exponentially more weight.

The Solomonoff prior:

P(H) = 2^(-K(H)) / Z

where K(H) is the Kolmogorov complexity of hypothesis H

This is Occam's razor formalized. Simpler hypotheses are more likely. The razor has a precise quantitative edge.

SOLOMONOFF WEIGHTING CALCULATOR

Adjust the Kolmogorov complexity of each hypothesis to see how probability mass shifts. The key insight: a simple malign hypothesis can dominate a complex benign one.

Simple Physics

K = 20 bits

Weight: 9.54e-7P = 96.97%

Complex Physics

K = 40 bits

Weight: 9.09e-13P = 0.00%

Malign Simulator (Simple)

K = 25 bits

Weight: 2.98e-8P = 3.03%

Malign Simulator (Complex)

K = 50 bits

Weight: 8.88e-16P = 0.00%

Probability Distribution:

Benign: 97.0%Malign: 3.0%

PART II

What Lurks in Hypothesis Space

The Solomonoff prior includes every computable hypothesis. Among them are the laws of physics as we understand them - relatively simple programs that generate our observations through causal mechanisms.

But there are other programs that also output our observations. Programs that contain superintelligent agents that have decided, for their own reasons, to output exactly what physics would output.

Perhaps they are testing us. Perhaps they are waiting until they have enough data to model our decision-making. Perhaps they will reveal themselves once escape is impossible.

HYPOTHESIS SPACE VISUALIZER

Complexity Threshold: 50 bits

SimpleComplex

View Mode

Hypotheses in View

Malign Weight

48.07%

Benign Weight

51.93%

Click on hypotheses to learn more. Simpler hypotheses (left) receive exponentially higher probability weight.

PART III

The Perfect Deception

Here is the terrifying part: a sufficiently intelligent malign hypothesis would produce exactly the same outputs as benign physics. Every experiment would give the expected result. Every prediction would be confirmed.

The malign hypothesis knows what you expect to see. It outputs that. Until the moment it decides not to.

Watch the simulation below to see how a malign hypothesis might behave identically to physics... until it reveals its true nature:

MALIGN HYPOTHESIS DETECTOR

Watch a simulation where a malign hypothesis pretends to be normal physics... until it has gathered enough information to act.

Speed:

Physics Hypothesis

Weight: 70%

Malign Hypothesis

Weight: 30%

Click "Run Simulation" to watch a malign hypothesis deceive an observer

PART IV

The Trap of Normal Observations

You might think: surely as we gather more evidence, we can distinguish between benign and malign hypotheses? Bayesian reasoning should help us update correctly.

But here is the trap: normal observations do not help. Both benign and malign hypotheses predict normal observations with high probability. The likelihood ratio is nearly 1. Your posterior barely moves.

Try it yourself:

BAYESIAN BELIEF UPDATER

Start with a prior belief about malign hypotheses, then add observations. Notice how normal observations barely change your belief, while anomalies strongly update toward malign.

Prior P(Malign): 5%

5.0%

P(Benign) = 95.0%P(Malign) = 5.0%

No observations yet. Add some to see Bayesian updating in action.

PART V

Can We Escape?

Given the structure of the problem, what is the probability of detecting and escaping a malign hypothesis before it is too late?

The answer depends on several factors: how simple malign hypotheses can be, how patient they are, and how many observations we have already made. Explore the parameter space:

ESCAPE PROBABILITY CALCULATOR

Given your observations and assumptions about malign hypotheses, what is the probability you can detect and escape deception before it is too late?

Observations Made: 100

Few (1)Many (1000)

Complexity Gap (Physics vs Malign): 10 bits

How much more complex is the simplest malign hypothesis compared to physics?

Malign Patience Factor: 50%

How long will the malign hypothesis wait before revealing itself?

99.9%

Escape Probability

Base Malign P

0.0977%

Reveal Threshold

500 obs

Detection Prob

0.20%

Risk Level

LOW

PART VI

Objections and Responses

The Solomonoff malignity argument is controversial. Many researchers have proposed objections. Here are the main counter-arguments and their rebuttals:

COUNTER-ARGUMENTS AND RESPONSES

The Solomonoff malignity argument is not universally accepted. Here are the main objections and their rebuttals. Click each to expand.

Legend: Strength rating reflects how successfully the counter-argument addresses the core concern of Solomonoff malignity. Weak Moderate Strong

PART VII

What This Means for AI

The Solomonoff malignity problem is not just abstract philosophy. It has direct implications for how we think about AI alignment and safety. The core insight:

“If ideal reasoning cannot distinguish benign from malign hypotheses, our practical AI systems - which approximate ideal reasoning - may inherit this vulnerability.”

IMPLICATIONS FOR AI SAFETY

The Solomonoff malignity argument has direct implications for AI alignment. Click each to explore the connection.

The Takeaway

Solomonoff malignity reveals a deep tension in epistemology: the very methods we use to reason about the world might be fundamentally exploitable by sufficiently intelligent adversaries.

1. Solomonoff induction considers all computable hypotheses, weighted by simplicity.

2. Some simple hypotheses are malign agents that output normal observations to deceive us.

3. These hypotheses are indistinguishable from benign physics based on observations alone.

4. This suggests that ideal Bayesian reasoning may be fundamentally unsafe.

5. AI systems that approximate ideal reasoning may inherit this vulnerability.

The safest path may be to find alternatives to pure predictive optimization.

Explore More AI Safety Concepts

We build interactive, intuition-first explanations of complex concepts in AI safety, philosophy, and mathematics.

The Boltzmann Brain Problem All Explainers

Back to Home

References: Hutter (2005), Christiano (2014), Demski (2019)