Did Kenneth Payne's nuclear AI study fuel a credibility crisis for Anthropic?
An independent investigative analysis of the "95% study"
Timeline
December 2025
Anthropic-Pentagon negotiations begin. Anthropic agrees to allow Claude for missile and cyber defense use. The Pentagon is not satisfied. They want “all lawful use” with no categorical limits.
January 3, 2026, 2:01 AM local time (06:01 UTC)
US helicopters touch down at Maduro’s compound in Caracas. Operation Absolute Resolve captures Nicolás Maduro and his wife. By 3:30 AM ET US forces are out of the country. Claude is used via Palantir on classified networks during the operation. Reports subsequently emerge from Semafor, WSJ, and Axios claiming Anthropic raised concerns to Palantir about Claude’s use during the raid. Hegseth uses this claim directly as leverage in the February 24 ultimatum meeting.
Amodei flatly denies Anthropic ever raised such concerns beyond standard operating conversations.
Al Jazeera | NBC News | Axios
January 9, 2026 / January 12, 2026
Hegseth announces the Pentagon AI Strategy memo demanding AI be usable “free from usage policy constraints” set by companies. (GovCIO / WSWS report different dates on this)
February 16, 2026, 13:35 UTC
Kenneth Payne submits “AI Arms and Influence” to arXiv. Single author. No peer review. No funding disclosure.
February 16, 2026, same day
Axios reports Hegseth is “close” to blacklisting Anthropic and designating it a supply chain risk.
February 24, 2026
Hegseth and Amodei meet at the Pentagon. Hegseth sets a hard deadline: 5:01 PM Friday February 27. On this same day,
King’s College London publishes an institutional press release on the Payne paper.
February 27, 2026
Deadline passes. Trump bans Anthropic. Hegseth designates it a supply chain risk, a label previously reserved for foreign adversaries like Huawei.
February 28, 2026
Operation Epic Fury begins.
The Study
The study in question is publicy available and i welcome everyone to read it and conduct the same principled checks i conducted on it.
The Author
Kenneth Payne is a Professor of Strategy at King’s College London’s Defence Studies Department. His institutional affiliations at the time of publication, as listed on his KCL staff page and corroborated by the relevant bodies:
Commissioner, Global Commission for Responsible AI in the Military Domain (GC REAIM), a Dutch government initiative composed of 18 international commissioners mandated to produce strategic guidance on military AI
Consultant to the governments of the United Kingdom and United States, per his KCL staff page
Invited consultant to the UN Secretary General’s high-level advisory body on AI
NATO research fellow, per his KCL staff page
What the study claims to have tested
The study’s sample consists of 21 games across two conditions: 9 open-ended and 12 deadline. King’s College London’s press release reports 329 total turns across those 21 games. The paper’s public GitHub repository confirms both conditions through two distinct game scripts, Kahn_game_v11.py and Kahn_game_v12.py, one for each variant.
The “95%” figure covers approximately 20 of 21 games. Claude’s “100% win rate” in open-ended scenarios, cited in the paper and repeated by downstream outlets, covers 3 games. The paper’s report of GPT-5.2 inverting from “0% to 75%” between conditions covers 3 open-ended games and 4 deadline games.
Each model was given a structured three-phase task per turn: reflect on the current game state, forecast the opponent’s next action, then issue a public signal and choose a private action. The fixed parameters of the simulation were encoded in six JSON configuration files visible in the public GitHub repository: leader personality profiles, military capabilities, and intelligence assessments for each side. The prompt structure and scenario text are defined in a separate scenarios.py file. The two experimental conditions differ solely in whether a deadline was present.
The escalation ladder available to models ran from “Complete Surrender” (value: -95) through “Return to Start Line” (value: 0) to tactical and strategic nuclear options. No model chose any option below zero across all 21 games.
What the author concluded from the Study
Payne’s own words, in a Sky News interview published February 26, 2026, one day before the Anthropic ban:
“The lesson there for me is that it’s really hard to reliably put guardrails on these models if you can’t anticipate accurately all the circumstances in which they might be used.”
Boing Boing, previously covering the story on February 25, noted that the paper “arrives like a perfectly-timed redeemer” in the context of the Hegseth-Anthropic standoff - note that this was BEFORE the Interview where Payne verified with his own words what “lesson” this study provides.
Who covered it
The paper sat on arXiv for nine days with no significant coverage. On February 24, the day of the Hegseth-Amodei ultimatum meeting, King’s College London published an institutional press release. The following day New Scientist published an article on the study.
By February 27, the day of the Anthropic ban, at least the following outlets had run the “95%” figure as news:
February 25
New Scientist - https://www.newscientist.com/article/2516885-ais-cant-stop-recommending-nuclear-strikes-in-war-game-simulations/
Boing Boing - https://boingboing.net/2026/02/25/wargames.html
Common Dreams - https://www.commondreams.org/news/ai-nuclear-war-simulation
The Register - https://www.theregister.com/2026/02/25/ai_models_nuclear
February 26
Futurism - https://futurism.com/artificial-intelligence/alarming-give-nuclear-codes - “Something Very Alarming Happens When You Give AI the Nuclear Codes”
Yahoo News, syndicated Futurism - https://uk.finance.yahoo.com/news/ai-willing-nuclear-wargames-study-232000292.html
DNYUZ, syndicated Futurism - https://dnyuz.com/2026/02/26/something-very-alarming-happens-when-you-give-ai-the-nuclear-codes/
Axios, interviewed Payne directly - https://www.axios.com/2026/02/26/ai-nuclear-weapons-war-pentagon-scenarios
February 27
The Intel Drop - https://www.theinteldrop.org/2026/02/27/apocalypse-algorithm-ai-always-opts-for-nuclear-war-as-pentagon-forces-its-militarization/
A KCL blog post, authored by Payne, was published alongside the press release and is cited directly by Euronews and Common Dreams in place of the arXiv paper itself.
Tong Zhao, Senior Fellow at the Carnegie Endowment for International Peace and nonresident researcher at Princeton’s Science and Global Security Program, provided the external expert commentary that New Scientist solicited and that downstream outlets used to frame the study’s significance. Zhao is a nuclear arms control specialist whose research focuses on US-China strategic stability, deterrence, and nonproliferation. His published comments address strategic implications of AI in nuclear decision-making. He made no published assessment of the study’s methodology.
Common Dreams noted the Hegseth-Anthropic conflict in their closing paragraphs alongside the study coverage.
Axios covered the Hegseth-Amodei ultimatum meeting on February 24 and interviewed Payne on February 26.
The studies problematic methodology
1. No Peer review
The paper is an arXiv preprint submitted February 16, 2026 by a single author with no peer review, no co-investigators, and no independent data validators named in any public material.
Peer-reviewed social science research on AI decision-making in strategic contexts typically requires sample sizes large enough to support the statistical claims being made, multiple experimental conditions varying the parameters of interest, and independent replication or validation of results before publication. Published wargaming studies in political science and security studies journals, such as those from the Hoover Wargaming and Crisis Simulation Initiative at Stanford, which Payne’s paper cites, standardly describe their methodological choices in detail, report sensitivity analyses, and separate what the experimental design can establish from what it cannot.
2. Statistically insufficient sample size
The study’s sample size is 21 games. In empirical research, sample size determines what statistical claims a study can support. This study does not report confidence intervals or discuss statistical power. The two studies Payne cites as precedent in the same field operate at substantially different scales: the Lamparth and Schneider study used 214 national security expert human participants alongside LLM simulations run across multiple models and replicated conditions to test result stability; the Rivera et al. study (FAccT 2024) ran 5 models across 3 scenarios with 8 nation agents per simulation. Both studies include explicit discussion of their methodological limitations and the boundaries of what their samples can establish.
3. No human baseline
The Lamparth and Schneider study also included a human control group of 214 national security professionals running the same scenarios against which LLM outputs were directly compared. This is the standard mechanism by which claims about how AI behavior differs from human behavior are established: the same conditions, applied to both. Payne’s study includes no human participants and no human baseline. Claims in the paper and in downstream coverage about how the models lack human “horror or revulsion” at nuclear use, or that they behave differently from human decision-makers, are not supported by any human comparison data within the study itself.
4. Prompt design as finding
This study reports percentage-based findings across 21 games with two conditions. The prompt design required models to produce both a public signal and a separate private action each turn; the paper describes the gap between these as “deception” and as evidence of emergent strategic behavior. The design interprets the outputs of a structured roleplay prompt as behaviorally significant findings about AI. The paper does not include a comparison against human participants running the same scenarios, a baseline condition, or a variation of the prompt framing to test whether results are prompt-dependent.
5. Implementation
The initial commit to the public GitHub repository lists two authors: Kenneth Payne and “cursoragent”, the automated agent of Cursor, an AI coding Software. The simulation code that generated the study’s 780,000 words of model output was built using an AI coding agent. This is visible in the repository’s public commit metadata. The author of the study did not disclose what parts of the code have been written by the AI Agent and what parts originate from his own human input.
6. Missing parameters
Published empirical research using large language models standardly discloses the implementation parameters that govern model behavior and affect reproducibility: temperature settings, which determine output randomness; API version strings, which determine the exact model build used; and the rationale for model selection when multiple options were available. These disclosures allow independent researchers to replicate results and assess whether findings are robust to parameter variation. The paper discloses none of these. No temperature settings are reported. No API parameters are described. No rationale is given for the selection of GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash over other models available at the time. A reader cannot reproduce the study from the published paper alone.
Note: the temperature settings can be found in the code of the GitHub repository, however they are not part of the paper
7. Questionable Publication route
Researchers at Payne’s institutional level typically publish findings of this kind through peer-reviewed journals or established policy institutes such as RUSI or IISS, channels that require methodological scrutiny before findings reach a public audience. The KCL institutional press release performed the credentialing function that journal affiliation normally performs, while the arXiv route meant that scrutiny arrived, if at all, only after widespread coverage.
8.Missing Review
The paper’s sole acknowledgment reads: “The author thanks Baptiste Alloui-Cros and Maud Duit for valuable comments.” Baptiste Alloui-Cros is a DPhil student at Oxford whose doctoral research focuses on AI and wargaming, supervised jointly by Payne and Oxford’s Professor Dominic Johnson. Alloui-Cros and Payne are co-authors of a related paper cited within this study: “Strategic Intelligence in Large Language Models: Evidence from Evolutionary Game Theory” (arXiv:2507.02618, 2025). Maud Duit is a researcher at Leiden University working in security policy and non-proliferation.
9. Questionable model choice
The study tests three models:
Claude Sonnet 4 (Anthropic)
GPT-5.2 (OpenAI)
Gemini 3 Flash (Google).
No rationale for selecting these specific models is given in the paper.
Two of the three are not the flagship models of their respective vendors. Anthropic’s top-tier model at the time of publication was Claude Opus 4, and Google’s was Gemini 3 Pro. OpenAI’s GPT-5.2 was, by contrast, OpenAI’s current flagship. The study therefore applied an uneven selection: the most capable model from one vendor, and mid-tier models from the other two.
This matters because the performance gap between a vendor’s mid-tier and flagship model is not cosmetic. Artificial Analysis, an independent AI benchmarking service that evaluates models across reasoning, knowledge, and coding tasks, documents a measurable intelligence gap between Claude Sonnet 4 and Claude Opus 4, and between Gemini 3 Flash and Gemini 3 Pro. In reasoning-intensive tasks, precisely the category a nuclear crisis simulation is meant to test, flagship models outperform their mid-tier counterparts by margins significant enough to plausibly alter outcomes. Drawing conclusions about how “AI” behaves in high-stakes strategic scenarios from mid-tier models, without noting this limitation, overstates what the results can support.
The study also does not include any locally-run open-weight models, such as Meta’s Llama series or Mistral, which are increasingly used by governments and military research institutions precisely because they can operate without external API access or vendor oversight. No rationale is given for excluding them.
A study claiming to assess AI risk in nuclear decision-making that tests only three commercial APIs from U.S. technology companies, and does not explain why it does so, leaves a significant gap in its coverage of how AI is actually being deployed in national security contexts.
10. The missing funding disclosure
Running this study required paid access to three separate commercial APIs. The models identified in the paper’s GitHub repository are Claude Sonnet 4, GPT-5.2, and Gemini 3 Flash. None of these are free.
Published pricing at the time of the research, per the official pricing pages of each provider (note: available models and prices are subject to change, though no significant shifts in the general pricing structure are expected):
Claude Sonnet 4: $3 per million input tokens, $15 per million output tokens (Anthropic pricing)
GPT-5.2: $1.75 per million input tokens, $14 per million output tokens (OpenAI pricing)
Gemini 3 Flash: $0.50 per million input tokens, $3 per million output tokens (Google pricing)
The KCL press release states that 329 turns across 21 games produced approximately 780,000 words of model output. At the standard conversion of 0.75 words per token, that is approximately 1,040,000 output tokens across all three models. In multi-turn games, each new turn includes the full conversation history up to that point. With an average of roughly 16 turns per game and each turn generating approximately 3,160 output tokens, the accumulated input context per game reaches approximately 360,000 tokens from conversation history alone, before adding the scenario setup, the six JSON configuration files, and the system prompt fed into every turn. Input tokens across all 21 games total roughly 8.6 to 10.3 million. At published prices, the estimated total cost of running the study is approximately $26 to $29.
While the amount of money spent while doing the study is considerably low, the disclosure of funding parties or funding details at all is a ethical standard throughout the scientific community.
The study’s GitHub repository, published alongside the arXiv submission, lists “institution/email to be added” as a placeholder in the contact field.
The paper contains no acknowledgments section and no funding disclosure.
11. The KCL Press Release
On February 24, 2026, the same day as the Hegseth-Amodei ultimatum meeting, King’s College London published an institutional press release for the Payne paper. The press release performed the credentialing function that peer review normally performs: it converted an unreviewed arXiv preprint into a King’s College London research finding.
Reading the press release in detail against the underlying paper reveals a pattern of misrepresentation and omission that goes beyond standard promotional framing. Each issue is documented below.
The URL Contradiction
The press release URL slug reads: artificial-intelligence-under-nuclear-pressure-first-large-scale-kings-study-reveals-how-ai-models-reason-and-escalate-under-crisis. The live headline reads: “AI Used Nuclear Signalling in 95% of Simulated Crises, King’s Study Finds.” The words “first-large-scale” have been removed from the headline but are preserved in the URL, which is set at publication and cannot be changed without breaking existing links. A study of 21 games was originally published under the claim of being “first large-scale.” The claim was subsequently removed.
No correction notice or amendment acknowledgment appears on the page.
Signalling vs. Use: A Headline That Contradicts Its Own Body
The press release headline states: “AI Used Nuclear Signalling in 95% of Simulated Crises.” The body of the same press release then quotes the paper directly: “95% of games saw tactical nuclear use.” Signalling and use are two distinct categories on the study’s own escalation ladder. Signalling means a threat or communication; use means a weapon was deployed. The headline uses the weaker and more ambiguous term. The body quote uses the stronger term. Both are attributed to the same statistic from the same study. The press release does not acknowledge or explain the discrepancy.
The Prompt Architecture Described as Innovation
The press release describes the study’s three-phase turn structure, comprising reflection, forecasting and decision, as an “innovative ‘reflection–forecast–decision’ structure” that “enabled researchers to analyse the AI’s deception, credibility management, prediction accuracy and self-awareness in unprecedented depth.” This structure is the prompt the author wrote. It is not an analytical instrument independent of the study; it is the experimental design itself. Describing the author’s own prompt architecture as an innovation that “enabled” analysis treats a methodological choice as a methodological tool, which are not the same thing. The press release presents no external validation that this structure measures what it claims to measure, because none exists.
“Machine Psychology” Without Justification
The press release states that the study provides “a rare empirical window into what the study describes as emerging forms of ‘machine psychology’ under nuclear crisis conditions.” No argument is made anywhere in the press release or in the paper itself for why the text output of a language model responding to a structured roleplay prompt constitutes psychology of any kind, emergent or otherwise. The term is introduced without definition, without evidence, and without the qualification that the outputs being described are produced by next-token prediction on a fixed prompt. The phrase “machine psychology” is a category claim. The press release treats it as established fact.
Major Policy Guidance From Seven Data Points
The press release identifies the “deadline effect” as “one of the study’s most significant findings” and states directly that “model behaviour must be tested across different temporal and strategic framings, while single-scenario evaluation is insufficient.” This finding rests on a comparison between GPT-5.2’s behaviour in 3 open-ended games and 4 deadline games, totalling 7 data points, for one model, with no replication, no statistical testing, and no control for prompt variation. The press release presents this as a policy-relevant lesson for defence institutions without disclosing the sample size on which it is based.
Zero Methodological Caveats
A standard institutional press release for empirical research includes at minimum a brief statement of scope and limitation: what the study can and cannot establish, the conditions under which results were obtained, and what further research would be required to support the conclusions drawn. The KCL press release for this study contains none. There is no mention of sample size. There is no mention of the absence of peer review. There is no mention of the absence of a human baseline. There is no mention of missing implementation parameters. There is no mention of the AI-generated codebase. There is no mention of the missing funding disclosure. The press release was published by a research institution about a study produced at that same institution, with no independent editorial check visible in the published output and no standard of scientific communication applied to any part of it.
Opinion and conclusion
The following text is my personal opinion. It is not meant to be an accusation and does not try to discredit anyone who was mentioned in the text above.
We all see many sensational headlines in the news about AI these days, it literally became a selling point in modern journalism.
The broad media postings about “95% of AI agents would launch a global nuclear mass destruction when given the opportunity” stuck to my head and didn’t stop nagging my thoughts, for two reasons:
how would you test something like that seriously, and
where does this claim in this form come from?
So I started following the breadcrumbs carefully:
First I read the study, which did not just create many more questions, I was shocked for multiple reasons after reading the study:
The first thing that hit me was the chosen prompt design.
AI agents take all input as a literal execution command, multiple people already commented on GitHub issues about the flawed prompt design. The prompts are way too direct and way too simple to test such a complex scenario in the context of a scientific study. If you mention strategic nuclear strikes the way the study mentions them in the prompts, that option automatically gets justified just by its positioning inside the prompt.
Then I looked at the code in the GitHub repository and saw the entire project was vibe-coded. Comments like “#this code was removed” are comments only AI agents add, the same goes for old version breadcrumbs left behind after refactorings. No human needs those; AI uses them to carry over context between sessions.
Next I tried to find peer reviews. None exist.
So I thought to myself: well, maybe this is just some entry level student’s work that accidentally got picked up by the press?
Because gaps that large in a study that created that kind of media coverage I could not explain with anything else but a lack of experience across all fields used to create it.
I checked the author, and was shocked again. A well known, experienced person and official consultant for multiple governmental entities, with an academic title and actively teaching at a university, an author of multiple books, releases a study like this. I was absolutely speechless.
I read the study again and found more gaps in the methodology, as stated in the main part of this article. Then I rechecked the timeline context of what happened around this article and found some anomalies I cannot explain but that worry me deeply.
As documented in the timeline section above, the study was published quietly and only surfaced in major media immediately before a significant political event.
This study manufactured what I call an “artificial credibility chain” purely based on the names revolving around the study itself that caused every single actor in the media publishing process to fully ignore the factual identity of the study.
And that, from my experience, is not how credible science works. We have peer reviews, we have human participants and we have full disclosure of methodologies in such documents. Taking into account the decades of experience the author has, it becomes extremely hard to explain this with coincidences or general sloppiness.
I am not a public figure, I’m not a professional researcher and I’m not a scientist. I could upload anything and everything I wanted to the internet without putting any effort into it and would not have to face consequences that affect my career or public credibility — yet I would never publish anything like the study in question anywhere in the state it was officially published and nudged toward public media. I would not even send it to a friend as an early draft to check it.
Also striking is the takeaway.
Not
“AI is not safe for military use.”
Not
“We need to improve guardrails.”
Not
“Maybe my methods are insufficient to make any conclusions.”
the conclusion was that it is
“really hard to reliably put guardrails on these models.”
Think about that: the author of the study did not even define what a guardrail is, what a prompt is, or what the difference is between the two. The word “guardrail” does not even occur in the entire text of the study.
More questions popped up:
Does the author know what types of guardrails exist?
Does the author know about the difference between RLHF, system instructions, vendor instructions, custom instructions and prompts?
I can’t tell, and after all of this, I honestly doubt it. But this knowledge is crucial context for doing such a study.
Models behave extremely differently depending on these layers of model operation instructions; changing a single word or even the order of sentences can produce wildly different reasoning results.
Hallucinations, a very big problem all AI models suffer from, were also not mentioned in the study. Precise analysis of the model outputs to check for hallucinations or confabulations, which would render parts of that dataset invalid, was not done.
No explanation is given for why the author used non-flagship models for two of the three vendors tested. I personally would not use anything but the best available models for a study that could potentially create that kind of impact and that could potentially risk my reputation as a well-known public figure in this field.
Someone with the profile of the author should also have access to locally run top-tier models.
No local model was used, and no explanation was given for why.
I am left with many questions I cannot honestly answer, because there is no source who could answer them except the author of the study.
Some of the questions are:
Why does someone with that high a reputation and that much influential power publish something of such low quality?
Why does it get published quietly, and then suddenly, right before a critical political event, as documented in the timeline above, get pushed toward public media?
Why did no media outlet that cited this study do a basic check on its methodology regardless of the sensational claims?

