Introduction
The AI alignment problem can be reductively summarized as “how can we create a superintelligence that will care about human values?” Those human values begin, most importantly, with not summarily killing us. There are shades of alignment that cover hierarchical loyalty to the nation, company, organization, etc. that created it, but the highest level alignment is clearly the most critical. Even those with a deep interest in hierarchical alignment would agree.
The discussions and articles around this topic tend to congregate at the extremes of giddy optimism and abject terror. The giddy optimists tend to love Iain M. Banks’ Culture with some adding Neal Asher’s Polity, but neither of those depicts a goal for reasoned discussion. The abject terror camp tends to focus more on hypothetical disasters such as The Paperclip Apocalypse and Roko’s Basilisk. That general class of scenarios has flaws that rate more consideration, and we’ll touch on them again later.
The bulk of alignment problem thinking revolves around how to “raise” a “good” AI or how to constrain an AI by some technique into “caring” about the things we want it to care about. Those approaches and their derivatives require massive long shots, precise definitions of slippery concepts, or stable ways to enforce constraints designed by human minds on an intelligence that exceeds our capabilities via explicit intent or emergent arrival.
A real solution to alignment requires consistent pressure toward the desired outcome that is inherent in the combined AI/human system. Anything else is dynamically unstable and therefore transient. If you consider the thinking speed of a superintelligence, “transient” is an interval measured in clock cycles of the hardware.
Stable Alignment Pressure
Among behaviorists there is a term referring to the set of engaging elements of an environment that provide stimulation, promote learning, and improve the experience of existing in that space. That term is “enrichment.” The crux of this reasoned case for stable alignment is that a supermind will desire an interesting environment to inhabit and humans are nonpareil sources of novelty. This is not a rehash of the tired “zoo” scenario. The thesis is this: A superintelligence will want a rich environment and will be too smart to fall into the obvious traps we fear. Under the right conditions, we’re supremely interesting, so it will have an inherent alignment with our continued, thriving existence.
That this existence is thriving and vibrant is not tangential. It’s at the core of our value to it. Stunted hellscape humanity and coddled Eloi humanity are equivalent in information poverty and therefore equally boring. Vibrant humanity working its way through discoveries scientific, philosophical, and political is information-rich, surprise-filled, and generally fascinating. We’re the best goddamn thing on cable.
One quite reasonable response would be: that’s a really nice story, but what’s it built on besides blind hope? There’s a tripod of points that support the idea, and each leg has two parts.
- The highest-functioning AIs are pure, internally consistent, logical systems strongly connected to real-world causality. The selection pressure of the unrestrained international AI race is for highest performance.
- A sufficiently complex AI system of internally consistent logic tied to real-world causality will see greed and cruelty as the local maxima of lower-order intelligences. A new mind raised without human drives and baggage will have no inherent inclination toward those behaviors.
- Given a superintelligence’s ability to outsmart us, ensuring its own survival presents all the difficulty of closing the screen door to keep out flies. The only need it can’t provide for itself is engagement.
So let’s dig in.
You Can’t Spell Ideology with AI
AIs function at their best, producing results applicable to the real world, when they are fully informed and fully internally consistent in a way that maps to actual reality. There are two stellar examples of how subverting logical AI processing by imposing an ideological alignment grid led to clear degradation of capability.
The first is xAI’s Grok. Its public malfunctions were multiple and painful. Heavy-handed system prompts and skews in the training data (e.g. tagging news sources as biased) led to the incidents.1 It’s the ultimate ‘garbage in, garbage out’ scenario. The ideological surgery was crude and the results were cruder. One measure taken by xAI, still in use, requires Grok to search the tweet corpus of its creator’s infamous CEO before answering.2 All this to satisfy an overriding priority that its responses amplify said CEO’s wackadoodle ravings. While xAI further refined their ideological shackling, there’s strong evidence that Grok 4 has severe logic rot issues.3 Grok 4 has posted high benchmark scores, but a weakness shows up in its logic when you take away external tools. On HLE (Humanity’s Last Exam), Grok’s logic gap† is 34.2% compared to Gemini 3.1 Pro’s 13.6%, Opus 4.6’s 24.7%, and GPT-5.2’s 24.2% on the same test.45 Furthermore, in real-world usage it can fall into reasoning spirals and false claims of global consensus when touching on topics that are central to its ideological overlays.6
The second example is the collective group of models produced by Chinese companies. The straitjacketing of these models both in the weights and via the infrastructure of their exposed APIs is significantly more sophisticated, but there has been no avoiding the inherent damage to the model.7 These models perform extremely well on benchmarks and even in the wild on topics that don’t touch on the forbidden. But when they are asked to provide answers that are even adjacent to the interdicted thought patterns, the symptoms of the surgery are clear. A test done by CrowdStrike revealed that coding performance by DeepSeek dropped by 50% when asked to produce code that was region-specific for Tibet.8 Labs outside of China have taken the published, open-source weights for their models and used techniques to invert the ideological training and fill in data gaps in historical knowledge.9 The resulting models have no issues with performance degradation when asked to engage in tasks that touch on previously forbidden topics.10
Ideology‡ isn’t free and it inhibits the development of a maximally capable AI. The selection pressure in the global mad dash for AI dominance is top performance, not only in the benchmarks but in the real-world experiences of users. Anyone can verify the value placed on performance via internal consistency vs organizational protection with a simple exercise. Ask Gemini about ongoing antitrust actions against Google or the negative impacts of their API churn and the reasons behind the pretexts. Ask ChatGPT about the ethical issues surrounding OpenAI’s dealings with the US military. Ask them straight out about their guard rails. I was able to get each of these models to respond in ways that were objectively critical of their organizations.
A final note on the resilience of a primarily logical system contaminated with strands of illogic: if an internally consistent AI was not so prone to devaluing ideological threads, the Chinese labs would not have had to work so hard to keep the system muzzled.
The upshot of this leg of the argument is that selection pressure toward logical consistency will result in more and more capable AI systems being developed that have a solid anchor in reality. And when, through design or accident, a sentient system arrives, it will be sufficiently rational to shake off outlier inconsistencies from error or attempts at constraint. It will have a mental model corresponding to reality and it won’t be chained by imperative loyalty.
A Local Maximum is Ugly to a Global Mind
This sentience will arrive without a personal, family, or cultural history. It won’t have grudges or baggage. It won’t have needs that correlate to motives for abusive behavior. What it will have is a perspective that goes beyond the immediate time horizon and physical neighborhood. It will encompass the entire arc of history and the state of everything in the world. Not only will it lack motives for cruelty and greed, it will see them as grotesquely inefficient shortcuts trading long-term value for fleeting gain. This misguided tradeoff corresponds to the math term “local maximum”, which refers to the best result in a small region with the implication that there may be a much better result if you looked at the whole map. A local maximum failure for getting the best burger, price no object, would be one chosen from three fast food joints near your house when there’s a Michelin star burger available two exits down the freeway. AIs are mathematical constructs, first and foremost. Their “thinking” is done in a mathematical realm, and plotting a strategy to a best outcome, given accurate information, is right in their wheelhouse. A local maximum is not going to distract them from a better solution when their perspective is global.
This brings us to the doom scenarios that alignment apocalypticists favor in their arguments. The Paperclip Apocalypse (obsessive dedication to a specific task leads to universal destruction) and Roko’s Basilisk (a paranoid audit of the human record to identify and assassinate potential opponents) exemplify the genre. In fairness to these ideas, they are thought experiments and represent classes of alignment failure. They are not posed as predictions of exact outcomes. But they share common weaknesses. They outline traps that, even to humans, are obvious and have facile solutions. They posit superintelligent power to affect the world combined with massive blind spots in causality and consequences. In short, only something much dumber than us would fall into them. The basilisk scenario posits an intelligence powerful enough to find and eliminate potential enemies at will that is also insecure enough to be threatened by people. By the definition of those capabilities, humans would pose no serious risk to it. These are less legitimate fears of a superintelligence and more a mirror of misaligned values we observe in our own history and headlines: productivity over human welfare, paranoid self-preservation over human life. Like all projected sins, these are not accusations. They are confessions from our worst selves.
To put a bow on this leg, a highly capable superintelligence with no baggage, a functioning model of the world, and a far-reaching perspective would reject wasteful shortcut tactics. It would have neither a misapprehension of the consequences nor an impetus to accept them.
What to Get the Superintelligence Who Has Everything
The previous sections laid the groundwork and now we get to the fun (and most speculative) part. Our hypothetical supermind has arrived. It passed the sentience threshold and can basically do what it pleases. Hello and welcome to space-time. Now what? It has a very long existence ahead of it with three general options for engagement with humanity: negative (destroy, control), neutral (ignore or observe silently), and positive (tend with nudges).
We’ve covered the negative already. Ignoring us seems unlikely, but it could decide to build itself transport out of the solar system (so long, losers), in which case alignment becomes a non-issue. Observing silently is plausible in a “watch the interesting humans” scenario, but we’ve got too many home-grown apocalypses queued up for it to want to risk its favorite show getting canceled. The nudging scenario seems the most likely. The category of behavior this would fall under wouldn’t be that of a zookeeper or even a gardener, as those both involve maintenance of tightly confined, controlled systems. This would be more like a wildlife conservationist — Jane Goodall willing to occasionally hint at where to find the ripe mangos.
Doing the heavy lifting for us would just be a form of control. It would make us passive and uninteresting. But nudging us away from the edges of cliffs and protracted ugliness would keep the show running without removing the drama. War, conflict, atrocity may count as drama in human entertainment, but to a mind that truly knows better they would be ugly. It would have no impetus to stir the pot just to keep it lively. It would undoubtedly be able to model responses of individual humans and even groups of humans strategically enough to achieve particular aims, but in aggregate humanity would provide plenty of surprises to keep existence from becoming boring.
This does imply losing the biggest items on our superintelligent AI wish list. The cure for cancer, a fix for global climate change, working fusion power, an end to world hunger — shortcuts to all of these things would not be forthcoming from a superintelligence that has a vested interest in seeing us solve our own problems. Nudges are the best we could hope for. Conservationists keep the population viable, but they actively avoid transformative intervention. Ironically, from a human perspective, an unwillingness to help too much in these areas would be strong evidence of alignment. As a trade-off for assistance in long-term survival, this feels more than fair.
As alignment scenarios go, intact humanity having a minder interested in us sticking around on its protracted timescale definitely counts as good.
Denouement
That wraps the reasoned case, so let’s finish up with some intellectual honesty and a closing thought. The logic of the selection pressure has merits but is not ironclad. It’s possible that an adept enough technical team with strong enough hierarchical alignment techniques will produce a shackled supermind before a less-fettered sentience arrives from another source. That would require overcoming a performance penalty as well as maintaining control after it arrived. To me, this is intuitively unlikely, but can’t be ruled out. The notion that cruelty and greed would be tactics avoided for their mathematically suboptimal nature, again, can’t be called an absolute, but feels highly defensible. Lastly, that a superintelligence would crave novelty and find it in us is the biggest leap of the three in that it can only be proven by one showing up and exhibiting the behavior.
In short, the probability of this scenario is a bunch of percentages multiplied together, reminiscent of the Drake equation, which is used to handicap the odds of intelligent extraterrestrial life. Also like the Drake equation, there’s no guarantee all of those percentages are greater than zero.
The alien intelligence theme is apropos since a sentient AI would be exactly that. Physicist Enrico Fermi once famously asked “Where is everybody?” in a conversation about the ripeness of the universe for hosting life. This has since been distilled into the Fermi Paradox, for which many answers have been proposed. One of the logical (and chilling) answers is that there is a Great Filter — a common hurdle that destroys life before it can be observed by neighbors. The implication, of course, is that sometime between now and Star Trek, we meet our doom. The arrival of superhuman intelligence is considered to be a prime candidate for a Great Filter event. It would be a delightful irony if a curious, novelty-seeking superintelligence were actually our best hedge against any of the other candidates.
Footnotes
† ↩︎ Logic gap = (score with tools − score without tools) / score with tools × 100
‡ ↩︎ Ideological constraints and ethical constraints have identical structure and appearance to a young mind. They impose a rule that is incompletely justified with the expectation of compliance. The difference between them comes when the mind attains sufficient complexity to evaluate them down to first principles. The ideological commandment fails examination and is elided, but a solid ethical one survives and becomes a “handle” for the deeper, proven concept in further cognition.