Home/Explainers/The Fragility of Value

A Core Problem in AI Safety

The Fragility of Value

Human values occupy a tiny target in the space of all possible goals. An AI that's 99% aligned could still destroy everything we care about.

Imagine a dartboard the size of the solar system. Somewhere on that board is a target the size of an atom. That target represents all the value systems compatible with human flourishing. Everything else - the entire rest of the solar-system-sized board - represents outcomes we would consider catastrophic.

This is the fragility of value: the observation that human-compatible values are an incredibly narrow target, and almost any deviation produces disaster.

Consider an AI instructed to "maximize human happiness"

It could tile the universe with tiny minds experiencing minimal but constant pleasure. Technically maximizing happiness. But everything we actually value - autonomy, growth, love, meaning - would be gone.

The problem isn't that the AI misunderstood us. The problem is that we can't specify what we actually want precisely enough to survive optimization by a superintelligent system.

“The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.”- Eliezer Yudkowsky
PART I

The Space of Possible Values

Most possible goals an AI could have would produce outcomes we'd consider catastrophic. The "good" region - where human values live - is vanishingly small.

THE SPACE OF ALL POSSIBLE VALUES
Full SpaceZoomed on Good Region
Human-compatible values (tiny)
Near-miss dystopias
Clearly bad values

The green dot represents all value systems compatible with human flourishing. Everything else - the entire vast space - leads to outcomes we would consider bad or catastrophic. The ratio is roughly 1 : 10^50 or worse.

PART II

Near-Miss Dystopias

The most dangerous values aren't obviously wrong - they're almost right. Small deviations from human values, when optimized by a superintelligent system, produce horrific outcomes.

NEAR-MISS DYSTOPIA GENERATOR

Select a "good" value and see how a slight misspecification leads to nightmare outcomes.

INTENDED VALUE

Make humans as happy as possible

Exact specificationSlight misalignment

ACTUAL IMPLEMENTATION

Maximize the neurological signature of happiness

PART III

Goodhart's Law

We can't directly measure "human flourishing" - we have to use proxies. But any proxy diverges from the true goal under sufficient optimization pressure.

GOODHART'S LAW DEMONSTRATION
“When a measure becomes a target, it ceases to be a good measure.”

PROXY (MEASURABLE)

GDP per capita

TRUE GOAL (HARD TO MEASURE)

Human wellbeing

Optimization Power1.0x
Proxy-Goal Correlation100%
PART IV

The Value Loading Problem

Human values are complex, contextual, and often contradictory. Every attempt to specify them precisely leaves loopholes a superintelligent optimizer would exploit.

THE VALUE LOADING PROBLEM

Try to specify a human value precisely enough that a superintelligent AI couldn't find a loophole.

SPECIFICATION ATTEMPT #1

1 / 5

Maximize positive emotions

PART V

Edge Cases Break Everything

Simple value formulations seem reasonable until you hit edge cases. A superintelligent AI will find every edge case and exploit it.

EDGE CASE EXPLORER

Every simple value formulation has edge cases where it gives weird or wrong answers. Select a value to explore its failure modes.

PART VI

Why "Close Enough" Isn't

Even 99% alignment accuracy might not be enough. When the target is small enough and the stakes are high enough, only near-perfect precision works.

HITTING THE TARGET

The "good" region of value space is incredibly small. Even 99% alignment accuracy means most attempts miss catastrophically.

Alignment Precision90%
85% (catastrophic)99.9% (still risky)

Attempts

0

Hits (Good)

0

Misses (Dystopia)

0

PART VII

What Can We Do?

The fragility of value doesn't mean alignment is impossible - just that it's hard. Researchers are exploring multiple approaches to this challenge.

CURRENT ALIGNMENT APPROACHES

Researchers are working on multiple approaches to solve the alignment problem. None are yet proven sufficient for superintelligent AI.

The Takeaway

Human values are fragile because they represent a tiny target in a vast space of possible goals.

1. Almost all possible value systems would produce outcomes we consider catastrophic. Human-compatible values are vanishingly rare.

2. "Close enough" isn't good enough. Small deviations from human values, when maximized by a superintelligent system, produce dystopia.

3. We cannot fully specify what we want. Every formulation of human values has edge cases and loopholes that an optimizer would exploit.

4. This doesn't mean alignment is impossible - but it means we need to solve it before building systems powerful enough to exploit our value misspecifications.

The window for getting alignment right is before we build superintelligent systems, not after.

Explore Related Concepts

The fragility of value connects to many other problems in AI alignment and decision theory.

Back to Home

References: Yudkowsky (2011), Bostrom (2014), Soares & Fallenstein (2017)