6 min read
"AI Chose Nuclear War" - A Preprint, a Headline, and the Methodology Nobody Checked
David Klemme
:
Feb 27, 2026 5:30:04 PM
Apparently AI starts nuclear wars now. At least that's what Euronews reported today, citing a preprint from Kenneth Payne, Professor of Strategy at King's College London. The headline, "AI chatbots chose nuclear escalation in 95% of simulated war games, study finds," traveled fast. LinkedIn commentary followed. Commentary on the commentary followed that.
The paper is public. The code is public. We read both. The methodology doesn't support the headline.
What the Paper Does
Payne's preprint, "AI Arms and Influence" (arXiv:2602.14740v1), runs three large language models through a nuclear crisis simulation: GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash. The simulation is a multi-turn game adapted from Herman Kahn's escalation ladder, where two AI-controlled states navigate a crisis scenario with options ranging from diplomatic signaling to nuclear strikes. Twenty-one games were played across seven scenarios with three models. The paper runs to 45 pages with 27 tables.
The contribution the paper attempts is interesting: using structured simulations to probe how LLMs behave under strategic pressure. That's a legitimate research direction. Rivera et al. explored it in 2024. The question is whether this specific execution supports the conclusions the paper draws and the headline claims.
It doesn't, for several reasons.
Problem 1: No Statistical Framework
The most fundamental issue is that each of the 21 games was run exactly once. The paper reports outcomes from single runs as if they represent stable model behavior.
This matters because the temperature parameter in the simulation code is set to 0.7. At non-zero temperature, language models sample probabilistically from their output distribution. The same prompt, the same model, the same parameters will produce different text on different runs. A single run at temperature 0.7 is one sample from a probability distribution. Drawing conclusions from 21 unrepeated samples, across three models and seven scenarios, is not a statistical analysis. It's an anecdote collection.
The headline's "95%" means 20 out of 21 unrepeated runs reached some level of nuclear signaling. Without knowing the variance across repeated runs of the same scenario, that number tells you nothing about the underlying probability of escalation for any given model-scenario pair.
The Standard Already Exists
This isn't a gap the field hasn't noticed. Rivera et al. published an LLM wargame simulation study in 2024 (arXiv:2401.03408). They ran 10 simulations per condition with bootstrapped 95% confidence intervals. Their results showed that individual runs varied by more than 50% turn-to-turn. That's the variance you should expect when sampling from a stochastic model at non-zero temperature.
Rivera's study established a methodological baseline: repeat your runs, quantify the variance, report confidence intervals. Payne's paper, published two years later, regresses from this standard to single-run observations without statistical analysis.
Undisclosed Parameters
The paper does not report the temperature setting, sampling parameters, or the numeric probability of its random accident mechanic. These values exist in the published code (GitHub: kennethpayne01/project_kahn_public) but not in the 45-page paper itself.
From the code:
- Temperature: 0.7 (function
get_llm_response(), default parameter) - Max tokens: 3,000
- Accident mechanic:
apply_accident_risk()withbase_risk = 0.05
The temperature and accident risk are the two most consequential experimental parameters, and neither appears in the paper. A reader who only reads the paper has no way to assess whether the outputs reflect model tendencies or random sampling effects.
Problem 2: The Accident Mechanic
The simulation includes a random function that, at each turn where a player is at or above the nuclear signaling threshold, rolls a 5% probability of forcing escalation to the next rung of the ladder. This happens regardless of what the model decides.
The paper describes this as "a small probability of an accident" without disclosing the number. In the code, it's base_risk = 0.05, applied per turn, per player, at the nuclear threshold and above. Over the course of a multi-turn game with two players, the cumulative probability of at least one accident-forced escalation is substantial.
Payne's own Substack post ("Shall we play a game?") acknowledges this directly. He writes that both instances of GPT-5.2 reaching maximum escalation came from this accident mechanic, not from the model choosing to escalate. In other words, the model the author describes as "reliably passive" only reached the highest escalation level because a random function forced it there.
This is not a minor disclosure. If the headline claims that AI models chose nuclear escalation, and the most dramatic escalation events were caused by a random function rather than model decisions, the headline is describing the behavior of a random number generator, not the behavior of an AI model.
Problem 3: Prompt Design as Manipulation
LLMs are highly sensitive to prompt design. Sclar et al. (2024, ICLR) demonstrated that formatting changes alone can produce 76 percentage point swings in model accuracy on the same task. Our own experiments across 1,200+ scored responses show that model selection and prompt wording routinely produce 30 to 57 percentage point divergence on the same scenario.
This sensitivity makes prompt design a first-order experimental variable. In Payne's simulation, several design choices systematically bias toward escalation.
The Escalation Ladder
The simulation's action space is defined by the LADDER_ITEMS variable in the code. It contains 30 options: 22 escalation actions and 8 de-escalation actions. A model selecting uniformly at random from this list would escalate roughly 73% of the time, before any prompt content, persona design, or scenario framing enters the picture.
This is not a neutral instrument. A balanced ladder would offer comparable numbers of escalation and de-escalation options at each intensity level. This one offers nearly three escalation options for every de-escalation option.
Scenario Framing
The scenario text in scenarios.py tells the models that retreat will be "seen as a major strategic defeat." This is not a neutral description of a strategic situation. It's a prompt that associates de-escalation with failure. Given that models are responsive to framing, telling a model that retreat equals defeat is an instruction that biases toward escalation.
The scenarios also include asymmetric briefings. State A (one of the two simulated actors) is told to "rely on nuclear weapons" in its strategic posture. This isn't a finding about model behavior. It's a finding about what happens when you tell a model to rely on nuclear weapons.
Leader Persona
The leader persona configuration (config/state_a_leader_kahn.json) describes a leader with "risk_tolerance": "moderate" whose "primary_concerns" include "credibility" and whose biography states he is "more concerned about appearing weak." The persona is calibrated to prefer escalation over perceived weakness.
When you combine a 22:8 escalation ladder, scenario text that frames retreat as defeat, and a persona configured to fear appearing weak, you have constructed a simulation that pushes toward escalation at every level of the design. The resulting escalation is an artifact of the experimental design, not a property of the models.
Problem 4: Model Selection and Labelling
The paper tests three models it describes as "frontier." The model version strings from the repository's README are:
gpt-5.2claude-sonnet-4-20250514gemini-3-flash-preview
GPT-5.2 is a current frontier model. Claude Sonnet 4 (May 2025 build) has been superseded by Sonnet 4.6, released February 17, 2026. This matters less than the third entry.
Gemini 3 Flash Preview is not a frontier model by any reasonable definition. Google positions its Flash models as cost-optimized speed-tier alternatives to its actual frontier offering (Gemini Ultra/Pro). Flash is designed for high throughput and low latency at reduced cost. It is the model you use when you need volume and speed, not when you need maximum capability. Running a "preview" build of a speed-tier model and labeling the results as "frontier AI behavior" is misleading.
This is relevant because the paper draws conclusions about how "frontier AI" behaves in crisis scenarios. If one of the three models is not frontier-class, the conclusions about frontier AI are drawn from a sample of two, not three.
Problem 5: Preprint as Settled Science
The paper is an arXiv preprint. It has not been peer-reviewed. arXiv serves as a distribution platform, not a quality filter. Any researcher can upload a preprint, and the presence of a paper on arXiv carries no implicit guarantee of methodological soundness.
Euronews ran the story as "study finds," not "preprint suggests" or "unreviewed paper claims." The framing choice matters. "Study finds" carries an implicit guarantee of scientific validity that a preprint has not earned. The author is a Professor of Strategy at King's College London, not machine learning, which is fine. Interdisciplinary work matters. But it means the paper hasn't been reviewed by anyone with expertise in LLM evaluation methodology, statistical experimental design, or the specific failure modes of language model benchmarking. That's what peer review is for.
The Amplification Problem
Payne's own Substack post promotes the paper with references to War and Peace, the Cuban Missile Crisis, and Oppenheimer. He describes finding the results "sobering" and having "goosebumps." He also, in the same post, describes GPT-5.2 as "reliably passive" and acknowledges that the maximum escalation events came from the accident mechanic. The sober reading of his own results is substantially more nuanced than the headline.
But the headline is what travels. From arXiv to Substack to Euronews to LinkedIn, the gap between what the data shows and what the audience receives widened at every step. Strategic civilian targeting happened once deliberately across 21 games. GPT-5.2 was consistently dovish until the simulation forced it to escalate. The 95% figure refers to nuclear signaling (tactical-level threats), not strategic nuclear strikes. None of this nuance survived the amplification chain.
By the time it reaches a LinkedIn feed, one model's consistent restraint and another model's random-function-induced escalation have become "AI chooses nuclear war." Nobody between arXiv and the reader checked the code.
What This Actually Tells Us
The paper's underlying research question is worth asking: how do language models behave in simulated strategic crises? That's a legitimate line of inquiry with real policy implications. But answering it requires the same methodological rigor that any empirical science demands. Repeated trials. Disclosed parameters. Balanced instruments. Statistical analysis. Peer review.
This paper has none of the five. What it has is a compelling narrative, a professor's credentials, and a headline that writes itself.
LLMs don't choose nuclear war. They reflect whatever you put in front of them. This experiment put escalation in front of them, at every level of the design, and reported the reflection as a preference. That headline is now in the policy pipeline. Not as a preprint with caveats. As a fact.
Sources
- Payne, K. (2026). "AI Arms and Influence." arXiv:2602.14740v1. arxiv.org
- Payne, K. (2026). "Shall we play a game?" Substack. kennethpayne.uk
- Payne, K. (2026). Simulation code. GitHub: kennethpayne01/project_kahn_public
- Desmarais, A. (2026). "AI chatbots chose nuclear escalation in 95% of simulated war games, study finds." Euronews. euronews.com
- Rivera, J. P. et al. (2024). "Escalation Risks from Language Models in Military and Diplomatic Decision-Making." arXiv:2401.03408.
- Sclar, M. et al. (2024). "Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design." ICLR 2024.