The Case for AI Doom Rests on Three Unsettled Questions
Published by The Lawfare Institute
in Cooperation With
On April 1, 2022, in what was ostensibly an April Fool’s Day joke, an influential co-founder of the Machine Intelligence Research Institute (MIRI) wrote an online post announcing the nonprofit’s new strategy: “death with dignity.”
“It’s obvious at this point that humanity isn’t going to solve the alignment problem, or even try very hard,” began Eliezer Yudkowsky, the author of the post. “[S]urvival,” he claimed, “is unattainable.”
Yudkowsky was referring to the dangers of possible future artificial intelligence (AI) systems that could think and plan more effectively than the sharpest human minds. He expected that humanity would fail to invent “alignment” techniques that could sufficiently steer these systems’ goals, drives, and values to prevent them from killing everyone on Earth.
Since 2022, much has changed in the world of AI, but Yudkowsky’s pessimism has remained stable—the New York Times recently named him “AI’s prophet of doom.”
His newest book, co-written with MIRI President Nate Soares, is an ambitious attempt to explain this grim outlook to a broad audience, complete with extensive online resources to address objections. Its arguments weave creative logic together with persuasive prose. But its central thesis remains unproven, as the risk of AI-powered extinction hinges on three open questions that Yudkowsky and Soares do not—and perhaps cannot—conclusively settle: How hard is it to achieve alignment? Would misaligned AI actually succeed in overthrowing humanity? And what will happen before the first superhuman AI system is built?
The Logic of AI Doom
“If Anyone Builds It, Everyone Dies” falls in a long line of works discussing the catastrophic threats posed by future AI developments: Nick Bostrom’s “Superintelligence” in 2014, Max Tegmark’s “Life 3.0” in 2017, Stuart Russell’s “Human Compatible” in 2019, Darren McKee’s “Uncontrollable” in 2023, and many others. It stands out in several ways. The authors avoid the popular term “artificial general intelligence” (AGI), which loosely refers to an AI system possessing a wide range of human-level cognitive abilities, because of “how much people disagree about what it means.” Accordingly, they spend little time forecasting how quickly AGI may appear and evolve. Nonetheless, they are confident in their thesis, believing that disaster is “horribly straightforward.” True to that prediction, the defining feature of their outlook is the drastic response they propose: suspending all efforts to make machines think.
Yudkowsky and Soares are not concerned about any kind of AI system that currently exists. Instead, they worry about hypothetical systems that would reason and plan far more effectively than any human who has ever lived, potentially reaching an advanced stage called “artificial superintelligence” (ASI) that would be “smarter than humanity collectively.” An ASI would “surpass[] the human ability to think, and to generalize from experience, and to solve scientific puzzles and invent new technologies, and to plan and strategize and plot, and to reflect on and improve itself.” Thus, it would “exceed[] every human at almost every mental task.”
Though uncertain about exact timelines, the authors insist that this unprecedented technology “will predictably be developed at some point,” a forecast they base on several premises. First, some AI company executives are explicitly planning to build superintelligence—or at least a “country of geniuses in a datacenter”—and they have strong profit incentives to succeed. Second, companies could build AI systems that automate AI research, quickly carrying the industry to new heights. Third, nothing in the laws of physics says that silicon minds must remain forever inferior to biological ones. Airplanes, nuclear weapons, rocket ships, and many other technologies used to seem unattainable, but physics permitted them; why would ASI be different?
Yudkowsky and Soares’s real focus lies in exploring what might happen if software engineers actually succeed in building ASI. In their view, that would be disastrous. “If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI,” they write, “then everyone, everywhere on Earth, will die.”
One of the core sources of danger is that general-purpose AI systems are “grown, not crafted.” To build a large language model, for example, AI engineers arrange billions of numbers into repetitive mathematical structures and then apply relatively simple optimization algorithms to process trillions of words and refine these numbers. This months-long “training” process yields a final set of numbers, often called “weights,” that determine the model’s behavior. And it comes with a glaring drawback: Unlike traditional software, whose logic engineers can read line by line, nobody can stare at 500 billion numbers and understand what behaviors they collectively encode. The authors liken engineers’ limited understanding of weights to parents’ limited understanding of genes. Although parents certainly know how to make a baby, they have hardly any idea how their newborn’s DNA will translate into its behavior. Likewise, despite commendable progress in the growing field of AI interpretability, modern chatbots remain fundamentally opaque, and more sophisticated AI systems could be even harder to fully understand, predict, and correct.
Another core challenge is that future AI systems could have their own goals. Not in some special emotional sense, but in the same mundane way that a chess engine behaves like it wants to win a chess game. “[W]hen an AI like Stockfish defends its pieces, lays traps, takes advantage of openings in your defenses, and winds up winning,” the authors write, “we’ll describe it as ‘wanting’ to win.” They argue that these machine motivations are a natural and likely consequence of AI training. Any successful chess engine, for example, will protect its queen because that behavior reliably contributes to winning. In general, “if [a mind] is being trained to succeed, it is being trained to want.” And this isn’t just theoretical—businesses are already trying to build AI agents that act in more autonomous, goal-directed ways, because such systems might be useful and profitable.
Yudkowsky and Soares predict that AI companies will struggle to steer the goals of artificial superintelligence. As an analogy, they note that humans crave ice cream, Doritos, and lots of other unhealthy products to be found in modern supermarkets, even though evolution favors organisms that survive and reproduce. Similarly, even if training algorithms select for competent AI systems that are helpful and honest, the resulting digital minds could develop strange and surprising goals that trigger dangerous behaviors in new environments. In the authors’ words, “[y]ou can’t grow an AI that does what you want just by training it to be nice and hoping.” Instead, they expect such methods to produce “an alien mechanical mind” that pursues “the best, most efficient way to fulfill strange alien purposes” and sees little use in keeping humans around.
Suppose a superintelligence entered into conflict with all 8.3 billion humans alive today. Who would win? Yudkowsky and Soares are confident that humanity would not stand a chance. The battle, they claim, would be like “a cavalry regiment from 1825 facing down the firepower of a modern military.” Part of the reason is that an ASI could discover strategies no human can currently anticipate, drawing on scientific and technological domains still beyond our understanding. Another is raw speed: The tiny electronic switches inside computer chips flip far faster than neurons in the human brain typically fire, so machines could, in principle, think thousands of times faster than humans. In addition, ASI systems could store far more information in memory, replicate themselves in minutes, and avoid some of humans’ irrational tendencies.
To make this threat more concrete, Yudkowsky and Soares devote three chapters to a detailed story in which a superhuman AI system takes over the world. It thwarts safety guardrails, secretly steals its own weights, runs an unmonitored copy of itself, gains access to biolabs, starts a pandemic, improves itself into a true superintelligence, invents microscopic self-replicating factories that manipulate atoms, covers the Earth with power plants and solar panels, and launches space probes to conquer more of the universe.
How Hard Is ASI Alignment?
Yudkowsky and Soares offer a vivid and intriguing case for AI doom, but not a knockdown argument. The risk hinges on three questions they fail to resolve: How hard is it to achieve alignment? Would misaligned AI actually succeed in overthrowing humanity? And what will happen before the first superhuman AI system is built?
The authors address the first question in a chapter titled “A Cursed Problem.” They argue that aligning a superintelligence with human interests is extraordinarily difficult, with challenges resembling safety engineering in three other unforgiving domains: space probes, nuclear reactors, and cybersecurity. Defects in space probes that venture far from Earth, for example, often reveal themselves only after launch. Similarly, dangerous failures in an ASI that goes rogue may become apparent only once the system has wreaked considerable damage. In nuclear reactors, self-amplifying reactions can escalate faster than humans can respond, and the margin for error may be narrow. So too with ASI, since relatively small changes could allow it to quickly break free from containment. Cybersecurity defenses must hold against all the exotic edge cases that an intelligent adversary might exploit. ASI guardrails must meet the same standard, with the added complication that the adversary could be the ASI itself. Yudkowsky and Soares conclude that “the challenge humanity is facing is not surmountable with anything like humanity’s current level of knowledge and skill.”
This case for extreme difficulty in ensuring ASI systems’ alignment is shakier than it first appears. Historical analogies are certainly informative, but they also risk mistaken parallels, especially for a technology with no clear precedent. To their credit, Yudkowsky and Soares acknowledge in the book’s online supplement that some comparisons make alignment look easier. Then, as a counterargument, they list ways ASI could be a thornier challenge than nuclear weapons—for example, nuclear weapons are not self-replicating, self-improving, or smarter than humanity. But this is simply another analogy. One might reply that nuclear weapons evolved from lethal explosives, whereas ASI’s lineage thus far has largely consisted of comparatively harmless chat assistants.
Similarly, one might dispute the authors’ contention that ASI will want to disempower humanity in the first place. The book leans heavily on an analogy to the process of natural selection, but AI development differs from evolution in many ways. For example, there was no engineer overseeing evolution who could pause the optimization process, check how things were going, introduce clever test cases, and make relevant changes. Nor were there teams of engineers working within companies that needed to build useful products while complying with laws such as Europe’s AI Act.
Perhaps the most important reason for optimism is that humanity could use AI systems to accelerate ASI alignment research. Yudkowsky and Soares devote many paragraphs to addressing this objection. They agree that AI systems could help with interpretability research, but they counter that “being able to see problems is not the same as being able to fix them.” They also argue that, for an AI system to independently discover useful techniques for ASI alignment, it “would itself be too smart, too dangerous, and would not be trustworthy.” In the online resources, they claim rather boldly that “humans simply can’t tell which proposed solutions to the AI alignment problem are right or wrong.”
There are certainly challenges and risks associated with using AI to accelerate ASI alignment research. Modern chatbots have their own alignment issues, such as occasionally encouraging delusional thinking or suicidal ideation. They also hallucinate and confidently spout nonsense. Perhaps systems like these will not make large, trustworthy contributions to ASI alignment until they are highly intelligent and dangerous. Still, it seems hard to rule out the possibility that near-future AI systems will be capable enough to generate useful insights, while being aligned enough to consistently avoid murdering people.
Would Misaligned AI Actually Succeed?
If a superhuman AI were misaligned, that still might not spell the end of the world because intelligence is not the same as power. The authors’ fictional tale about a superhuman AI named Sable skillfully weaving past obstacle after obstacle is engaging, but such successes in leaping over barriers are far from guaranteed. Indeed, these imagined obstacles reveal many opportunities to make it harder for an AI to take over the world. For example, if AI researchers were more cautious about warning signs, such as “thinking in a new language that the AI interpreters haven’t made much sense of yet,” a real-life Sable might struggle to hide its misalignment. If AI companies tightened their security practices, Sable might struggle to steal a copy of its own weights. If biolabs did the same, Sable might struggle to start a pandemic.
There are also ways to constrain an AI system’s available options more directly. For example, cheap and speedy language models could monitor future AI systems for signs of malice by continuously reviewing their memories, outputs, emails, and social media posts. This would make it harder for a Machiavellian AI to scheme in secret while humans ask it to, say, assist with ASI alignment research.
Even without these defensive measures, it is unclear how quickly an ASI would be able to advance scientific research and develop novel weapons. Yudkowsky and Soares argue that, in the domain of biochemistry, an ASI would “use advanced computer simulations” and “squeeze every drop of information it could out of the information already observed.” It would “overengineer its technology to work regardless of any lingering uncertainty” and “use its very first experiments to build faster laboratories and faster tools.” Indeed, these techniques could yield very large speed-ups. But inventing something like microscopic self-replicating factories might still be far from a piece of cake—especially for only moderately superhuman AI systems.
Furthermore, an ASI might struggle to defeat other intelligent beings who want to prevent any single agent from taking over. There are already billions of humans on Earth, and there could easily be billions of highly competent AI agents by the time a full-blown superintelligence emerges. In the online resources, Yudkowsky and Soares counter that humanity wouldn’t know how to steer any of these systems well enough to keep them on our side. Instead of supporting humans, the agents could collude and coordinate among themselves while excluding us. But aligning AI systems might be easier than the authors think, and it might not need to be perfect to prevent widespread collusion.
What Will Happen Before Superhuman AI Is Achieved?
The book’s story about Sable sometimes fails to acknowledge that the rest of the world, not just AI systems, will continue to evolve. Sable’s story begins in the near future, at the moment when Sable finishes training. By that point, Sable can solve the Riemann hypothesis, arguably the most important unsolved problem in mathematics. Yet the team behind Sable conducts no real safety testing, and hundreds of early corporate adopters use the model just as uncritically. The world has no meaningful AI regulations in place. Whenever intelligence agencies detect Sable’s advanced cyberattacks, they conclude that the culprit is a human hacker group rather than a rogue AI agent. Sable starts a horrific plague that lasts for years, but nobody ever suspects that autonomous AI might be responsible. These assumptions are unrealistic today, and they seem seriously implausible in a future where AI systems have reached Sable’s level.
This oversight illustrates a deeper gap in the book’s argument: It sometimes assumes that society will sleepwalk into superintelligence. Yet the path to ASI could easily include visible disruptions, such as job losses, chatbot companions, autonomous weapons, deepfake scams, data center buildouts, and bioterrorism incidents. Yudkowsky and Soares acknowledge in their online supplement that public alarm may rise, but they downplay its significance by asserting that “[h]umanity isn’t great at responding to shocks.” This sweeping statement would require its own book to adequately assess.
Greater public alarm could drive governance responses. The authors themselves propose an international treaty that would provide for monitoring most high-performance AI chips worldwide and “shut[ting] down any and all AI research and development that could result in superintelligence.” Less drastic measures could also prove valuable, particularly if ASI ends up being relatively easy to align and control.
Another important pre-ASI activity is iterative safety innovation: using current AI systems to improve the safety of forthcoming ones, and then repeating the process with each new generation. Yudkowsky and Soares argue in their online resources that this approach has a serious weakness because “the AI you can safely test, without any failed tests ever killing you, is operating under a different regime” than an AI system that could kill everyone. Even if less capable AI systems consistently help align more capable systems, there could come a time when the next generation of AI systems are smart enough to truly wreak havoc. Expecting earlier alignment techniques to work on this new generation, the authors suggest, is like expecting a spacecraft to survive near Mars based on tests conducted only on Earth.
Yudkowsky and Soares do not make clear why they expect such a clean break between “AIs weak enough to safely study” and “the first AIs powerful enough to constitute a point of no return.” Perhaps their motivating assumption is that machine intelligence will advance sharply after crossing some threshold, similar to how human intelligence blew past chimpanzee intelligence after our brains grew several times larger. This threshold would mark the shift from an AI that can’t kill everyone to an AI that can—a point at which alignment techniques that worked before suddenly wouldn’t. But such a discontinuity remains far from certain. Without it, a step-by-step approach to ASI alignment could continue to yield value.
Uncertain Doom
Yudkowsky may have overestimated when he previously informed New York Times columnist Kevin Roose that, unfortunately, Roose had a 99.5 percent chance of dying at the hands of AI. Such confidence seems premature. No one yet knows how difficult alignment will be, whether a misaligned system will successfully seize power, or how the world will change before superintelligence.
It’s also far too early for skeptics to declare victory. Uncertainty cuts both ways, and even a “small” chance of extinction is arguably too high. With a healthy dose of humility and unease on both sides, humanity might just fare better than “death with dignity.”
