A Core Problem in AI Safety
The Fragility of Value
Human values occupy a tiny target in the space of all possible goals. An AI that's 99% aligned could still destroy everything we care about.
Imagine a dartboard the size of the solar system. Somewhere on that board is a target the size of an atom. That target represents all the value systems compatible with human flourishing. Everything else - the entire rest of the solar-system-sized board - represents outcomes we would consider catastrophic.
This is the fragility of value: the observation that human-compatible values are an incredibly narrow target, and almost any deviation produces disaster.
Consider an AI instructed to "maximize human happiness"
It could tile the universe with tiny minds experiencing minimal but constant pleasure. Technically maximizing happiness. But everything we actually value - autonomy, growth, love, meaning - would be gone.
The problem isn't that the AI misunderstood us. The problem is that we can't specify what we actually want precisely enough to survive optimization by a superintelligent system.
“The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.”- Eliezer Yudkowsky
The Space of Possible Values
Most possible goals an AI could have would produce outcomes we'd consider catastrophic. The "good" region - where human values live - is vanishingly small.
The green dot represents all value systems compatible with human flourishing. Everything else - the entire vast space - leads to outcomes we would consider bad or catastrophic. The ratio is roughly 1 : 10^50 or worse.
Near-Miss Dystopias
The most dangerous values aren't obviously wrong - they're almost right. Small deviations from human values, when optimized by a superintelligent system, produce horrific outcomes.
Select a "good" value and see how a slight misspecification leads to nightmare outcomes.
INTENDED VALUE
Make humans as happy as possible
ACTUAL IMPLEMENTATION
Maximize the neurological signature of happiness
Goodhart's Law
We can't directly measure "human flourishing" - we have to use proxies. But any proxy diverges from the true goal under sufficient optimization pressure.
“When a measure becomes a target, it ceases to be a good measure.”
PROXY (MEASURABLE)
GDP per capita
TRUE GOAL (HARD TO MEASURE)
Human wellbeing
The Value Loading Problem
Human values are complex, contextual, and often contradictory. Every attempt to specify them precisely leaves loopholes a superintelligent optimizer would exploit.
Try to specify a human value precisely enough that a superintelligent AI couldn't find a loophole.
SPECIFICATION ATTEMPT #1
1 / 5
Maximize positive emotions
Edge Cases Break Everything
Simple value formulations seem reasonable until you hit edge cases. A superintelligent AI will find every edge case and exploit it.
Every simple value formulation has edge cases where it gives weird or wrong answers. Select a value to explore its failure modes.
Why "Close Enough" Isn't
Even 99% alignment accuracy might not be enough. When the target is small enough and the stakes are high enough, only near-perfect precision works.
The "good" region of value space is incredibly small. Even 99% alignment accuracy means most attempts miss catastrophically.
Attempts
0
Hits (Good)
0
Misses (Dystopia)
0
What Can We Do?
The fragility of value doesn't mean alignment is impossible - just that it's hard. Researchers are exploring multiple approaches to this challenge.
Researchers are working on multiple approaches to solve the alignment problem. None are yet proven sufficient for superintelligent AI.
The Takeaway
Human values are fragile because they represent a tiny target in a vast space of possible goals.
1. Almost all possible value systems would produce outcomes we consider catastrophic. Human-compatible values are vanishingly rare.
2. "Close enough" isn't good enough. Small deviations from human values, when maximized by a superintelligent system, produce dystopia.
3. We cannot fully specify what we want. Every formulation of human values has edge cases and loopholes that an optimizer would exploit.
4. This doesn't mean alignment is impossible - but it means we need to solve it before building systems powerful enough to exploit our value misspecifications.
The window for getting alignment right is before we build superintelligent systems, not after.
Explore Related Concepts
The fragility of value connects to many other problems in AI alignment and decision theory.
References: Yudkowsky (2011), Bostrom (2014), Soares & Fallenstein (2017)