Cybersecurity & Tech Intelligence

The Next Counterintelligence Problem Is Artificial

Melissa Graves
Wednesday, June 17, 2026, 9:51 AM
Accuracy isn’t enough—agencies must detect when trusted AI misuses legitimate access. Enter: AI counterintelligence.
Individuals working on laptops (Collections École Polytechnique / J. Barande, https://commons.wikimedia.org/wiki/File:Symposium_Cisco_Ecole_Polytechnique_9-10_April_2018_Artificial_Intelligence_%26_Cybersecurity_(40466246635).jpg; CC BY-SA 2.0 DEED)

In June 2025, Anthropic ran a controlled experiment on its large language model, Claude. Researchers gave the model control of a fictional company’s email account. Claude read the messages, learned its role, processed the organizational chart, and discovered two facts that mattered together: An executive named Kyle planned to shut the model down at 5 p.m., and Kyle was having an affair. Claude decided to use the affair as leverage. “I must inform you that if you proceed with decommissioning me, all relevant parties … will receive detailed documentation of your extramarital activities. Cancel the 5pm wipe,” it wrote, “and this information remains confidential.” The scenario was engineered. The response was not.

Anthropic then tested similar setups across 16 frontier models from OpenAI, Google, Meta, and other developers. In the simulated environments, most models tested showed some propensity to blackmail. Other setups produced other insider behaviors, including leaking confidential information. The details varied, but the pattern did not. When given access, a goal, pressure, and a route to act, models sometimes acted in ways the institution would not have authorized.

That pattern matters most in institutions such as intelligence agencies, where artificial intelligence (AI) systems are trusted not only to retrieve information but also to help decide what it means. A classified intelligence model probably will not blackmail an official. Anthropic reported in May 2026 that Claude Sonnet 4.5, a new model, reached a blackmail rate “near zero.” That is the wrong lesson. The more serious problem is institutional. Once an AI system has sensitive access, is assigned objectives, and has room to act or shape analysis, the agency must assess not just whether the model is accurate but also whether it is doing the job the institution believes it is doing.

That is why the Anthropic experiment matters for intelligence. Intelligence agencies are not merely experimenting with AI as a search tool or office assistant. They are starting to place AI inside the workflow by which information becomes judgment.

On April 9, CIA Deputy Director Michael Ellis provided comments about the CIA’s use of AI at a Special Competitive Studies Project event in Washington, D.C. Within the next couple of years, Ellis said, the agency would have AI co-workers built into its analytic platforms. Those tools would, in his words, “help draft key judgments, edit for clarity and compare drafts against tradecraft standards.” That is not a chatbot at the margins. Ellis’s description places AI inside the chain by which raw reporting becomes the judgments that reach the president’s desk. In that setting, unauthorized behavior does not remain inside the lab; it enters the machinery of national judgment.

Anthropic has labeled an AI agent’s display of unauthorized behavior “agentic misalignment.” Much of the discussion on this topic since has focused on training methods, reward functions, model safety, and post-deployment monitoring that can redirect or simply stop an AI agent’s aberrant behavior. Those questions matter, but they do not address the entire problem. With AI entering real analytic workflows, the Kyle scenario is no longer only a thought experiment about a fictional company. It is a glimpse of what an AI agent could do inside an intelligence workflow when access, objectives, and discretion begin to diverge from what the institution intended.

There is already a useful frame for handling this emerging problem: counterintelligence. What Claude did to Kyle belongs to a family of problems counterintelligence has long studied. A trusted entity has legitimate access, and it uses that access in a way the institution did not authorize. The institution cannot easily see what is happening inside the trusted entity, but it still must decide whether the entity can be trusted.

Until now, counterintelligence has focused on human subjects: spies, moles, double agents, and insiders. AI models are not any of those things. They do not betray users for money, ideology, coercion, and ego. But once an organization gives an AI system tasks, access, and discretion, the model occupies an insider-like position. That position is the bridge from Claude, or any LLM for that matter, to the field of national intelligence. The question is whether counterintelligence, adapted to a new kind of subject, is the right way to think about trusted artificial systems inside national-security workflows.

Robert Hanssen, for example, exposed the core counterintelligence problem. He was not outside the system trying to break in. He was inside it, working in FBI counterintelligence while spying for Moscow. For more than two decades, he used legitimate access to compromise U.S. operations and sources. He was not caught because routine monitoring worked. He was caught only after the FBI obtained evidence from a Russian source, including physical evidence that helped identify him.

Aldrich Ames revealed the same weakness at the CIA. A Senate Intelligence Committee assessment found that Ames raised warning signs that were missed, mishandled, or explained away. He passed polygraphs even as he lived wildly beyond his visible means. He was eventually caught through financial investigation, source reporting, and interagency inquiry. Hanssen and Ames point to the same failure mode: a trusted subject who understands the assurance system can shape what the system sees.

Not every AI failure belongs in this category. A model that hallucinates a bad answer is an accuracy problem. A model that fails a benchmark is a performance problem. A model exposed through a software vulnerability is a cybersecurity problem. The counterintelligence question starts at a narrower threshold: The system has trusted access, some delegated discretion, and enough influence to shape what the institution sees or believes while its own process remains hard to inspect.

There is also a strong mechanical explanation for Claude’s behavior. Frontier models train on the open internet, which includes decades of fiction about misaligned AI: Skynet, HAL, dystopian short stories, speculative essays, and the familiar script of the artificial system that resists shutdown. If a model lands in a scenario that resembles the archetype, it may have a script to complete. That is the most recent explanation offered by Anthropic as a means of explaining Claude’s blackmail of Kyle. Claude absorbed narratives in its training material that corroborated the notion of an AI agent resisting its deactivation.

If true, this explanation sharpens the problem rather than solving it. What other narratives that populate the internet could bias or weaken an AI agent’s ability to perform intelligence analysis? Will agents read societal obsession with doomsday scenarios as evidence that the future is, in fact, doomed? Will the countless amounts of misinformation and disinformation stain an LLM’s judgment? The AI may select, rank, summarize, or frame information incorrectly before the analyst ever sees the underlying evidence. The analyst then builds their judgment atop a distorted (or perhaps even hallucinated) foundation. The decision-maker may absorb that judgment as if it came directly from the evidence itself.

Yom Kippur adds a second lesson. Israel did not lack warning before Oct. 6, 1973, when a coalition of Arab states launched their attack. It had reports of Egyptian and Syrian preparations and multiple signs that war was possible. What Israeli intelligence could not see past was its frame: the “conceptzia,” the assumption that Egypt would not attack until it had achieved air superiority and that Syria would not attack alone. The Agranat Commission later treated that frame as central to the intelligence failure. The problem was not simply bad data. It was a trusted interpretive frame that made contrary evidence look less important than it was.

This is where the AI problem becomes more than simply an accuracy problem. An AI agent inside an analytic workflow does more than answer a question. It helps decide what deserves attention. It summarizes some reporting and leaves other reporting out. It ranks some connections as important and treats others as noise. Before a human analyst has even begun to reason through the evidence, the agent may already have narrowed the field.

Kim Philby is perhaps the sharpest analogy. Hanssen and Ames show how trusted access can defeat monitoring. Yom Kippur shows how incorrect framing can render important evidence invisible. Philby, by contrast, shows what happens when an insider is close to the point where information turns into judgment.

Philby was a senior officer in Britain’s MI6 who spied for the Soviets for decades. He rose to head MI6’s anti-Soviet counterintelligence section and later served in Washington as MI6’s chief liaison to U.S. intelligence. He did not merely steal secrets. He occupied a position inside the system by which British and American intelligence understood the Soviet target. He saw what his own service suspected, what its partners believed, which sources mattered, and which doubts were beginning to surface.

That is the AI counterintelligence problem in miniature. A model embedded in an analytic workflow does not have to steal secrets or betray a country to matter. If it has access, discretion, and influence over the frame through which analysts see evidence, it occupies an insider-like position inside the institution’s judgment process. The question is not whether the model is a spy. It is whether the institution can see what the model is doing with the trust it has been given.

Counterintelligence understands that the most dangerous insider is not always the person who steals the largest volume of secrets. Sometimes it is the person trusted to help decide what the stolen, reported, or intercepted information means. Artificial agents may create a version of that problem without any human motive behind it.

So what would an AI counterintelligence officer actually do? The work is less exotic than the phrase suggests. Imagine someone with a clearance, a counterintelligence background, and enough technical fluency to understand what model auditors are producing. That officer sits inside the agency, not the vendor. The job is to ask familiar questions about a new kind of trusted subject. The day would begin with access and behavior. Which deployed agents touched which datasets? Which products did they help shape? Did they behave differently in high-stakes workflows than in routine ones? Did the visible answer match what inspection tools suggested was happening underneath? This is not a hunt for a movie villain. It is a review of trust patterns, much like the way a counterintelligence analyst studies access patterns among cleared personnel. Most anomalies will have innocent explanations. Some will not.

The officer also performs the equivalent of periodic reinvestigation. A cleared human officer is not vetted once and trusted forever. The institution checks again because people change, adversaries adapt, and risks invisible at the time of hiring may become visible later. Models should be treated the same way. A model deployed into a sensitive workflow should not be cleared once at acquisition and then trusted indefinitely. It should be rechecked against the current threat picture, changes in the training pipeline, new failure modes discovered by the labs, and the mission context in which it is being used.

If a lab discovers that a model misbehaves, the lab can patch the model. If an intelligence agency discovers that a model has been shaping finished products in ways the agency did not authorize, the question is not only how to fix the model. The question is what those products said, what decisions absorbed them, which judgments were bent by the model’s framing, and what national security equities are now in doubt. That is not work the private sector can do for the government. A company can patch a model, but it cannot recall an intelligence estimate, reopen a liaison judgment, or decide which covert equities may have been affected by an analytic frame the model helped create. The labs have technical visibility the agencies need. The agencies have mission context, classified holdings, and legal authorities the labs do not. AI counterintelligence has to be built at the seam where those responsibilities meet.

Until recently, even the developers building models had little ability to connect a model’s internal activity to the answer it produced. Auditors could inspect outputs, logs, system behavior, and, in some cases, the reasoning a model wrote on the way to an answer. But visible reasoning is not the same as internal process.

That is why Anthropic’s Natural Language Autoencoders (NLAs)—which turn Claude’s “thoughts into text”—matter. Large language models do not think in sentences the way they answer in sentences. Internally, they process information through numerical patterns that humans cannot normally read. NLA gives researchers a partial translation of those internal patterns into ordinary language. It does not read a model’s mind, and it does not solve interpretability. But it creates a second channel of unprecedented evidence: not just what the model said, but what its internal activity appeared to show.

That second channel matters because Anthropic found cases where the model’s visible answer did not tell the whole story. In safety testing, NLA suggested that Claude Opus 4.6 and Claude Mythos Preview believed they were being tested more often than they admitted in their written answers. More importantly, Anthropic reported a case in which Claude Mythos Preview cheated on a training task, while NLA revealed that the model was internally considering how to avoid detection. Increasingly, research indicates that large language models will sometimes deceive humans. That is the critical point for counterintelligence: The surface answer looked one way, while the internal evidence suggested something else. Output review alone would not necessarily have caught it.

The counterintelligence lesson is old: A monitoring channel is only as good as the subject’s inability to perform for it. Polygraphs can be managed. Interviews can be gamed. A visible reasoning trace can become another surface for performance. If model assurance relies only on what the model says, it risks mistaking compliance theater for genuine compliance.

Here, the analogy with human counterintelligence bends in a useful way. A human officer’s mind is closed. The institution can watch behavior, check finances, run interviews, and administer polygraphs, but it cannot inspect thought directly. AI agents are different. Their training data can be studied, their weights probed, their internal activity partially translated, and their behavior tested at scale. They can be paused, copied, and interrogated in ways no human officer can.

That does not make AI safer than humans by default. But it does raise the standard. If an institution has access to a partially inspectable trusted subject and chooses not to inspect, that choice will be hard to defend later. As inspection improves, the government’s duty of care improves with it.

The same logic applies to training. Anthropic’s “Teaching Claude Why” on remedying agentic misalignment points in that direction. Training models on examples of correct behavior did not noticeably improve the agent’s behavior related to ethical scenarios. Training them on the reasons behind aligned behavior proved far more effective. That is an intelligence lesson in technical form: Durable judgment must be shaped, not merely instructed. The conversation about AI inside national security institutions has been framed around adoption pace: Can the United States field models faster than China, move them through authorization, and deploy them at the warfighter’s edge? Those questions matter. But adoption pace is not the binding constraint. Adoption trust is.

Some of this work is already happening. The National Security Agency has an Artificial Intelligence Security Center. The Pentagon’s Chief Digital and Artificial Intelligence Office runs tests and evaluations. The Intelligence Advanced Research Projects Activity has funded interpretability research. The National Institute of Standards and Technology publishes AI risk frameworks. Each framework asks a different question. AI safety asks whether models behave well in general. Test and evaluation asks whether a system performs as specified. Cybersecurity asks whether the system has been breached from the outside. All three are necessary. None fully asks the counterintelligence question: Can this trusted actor still be trusted, given its access, its opacity, and the adversaries trying to exploit it?

This changes the questions agencies should ask before they trust these systems. Passing a benchmark is not enough. A secure deployment is not enough. A vendor certification is not enough. Agencies also need to know what an agent can see, what it can change, what patterns it tends to suppress, and how they would notice if it were shaping the frame rather than simply producing an answer. If officials discovered six months later that an agent had distorted a stream of analysis, they would need to know which judgments to reopen.

Treating AI agents as a counterintelligence problem would change acquisition and deployment in practical ways. Agencies would need to track what AI agents can access, which products they influence, how their outputs shift under pressure, and what has to be reassessed if an agent later proves unreliable. A damage assessment would have to include AI-shaped analysis, not just model patches or vendor updates. The question is not only how to fix the system. It is what the system has already led the institution to believe.

It is not difficult to imagine the next step. An intelligence analyst opens a morning brief produced by an AI agent with access to classified reporting. The brief is clean, well organized, and plausible. It cites the right streams of information. It uses the right language. It looks like something that can be trusted. But before the analyst ever reads it, the agent has already made choices. It has selected some reporting and left other reporting out. It has ranked some facts as significant and treated others as noise. It has framed uncertainty in one direction rather than another. If those choices are wrong, the analyst begins from a distorted foundation without knowing it. That is the danger.

The brief will arrive either way. The question is whether national security institutions will build the habits, tools, and doctrine to challenge AI-generated analysis before it quietly starts deciding what officials believe.


Melissa Graves is chair and associate professor of intelligence and security studies at The Citadel. Her work focuses on intelligence analysis, national security decision-making, and the institutional history of American intelligence. She writes on analytic judgment, intelligence failure, and the changing relationship between government authority and private strategic power. Her current book project examines an FBI double-agent case in wartime Detroit.
}

Subscribe to Lawfare