Armed Conflict Cybersecurity & Tech

The Practical Role of ‘Test and Evaluation’ in Military AI

Robert Callahan
Tuesday, October 7, 2025, 8:00 AM

T&E can empower the U.S. military to use AI in line with the law of armed conflict.

Cybersecurity Operations at Port San Antonio (Maj. Christopher Vasquez, https://commons.wikimedia.org/wiki/File:Cybersecurity_Operations_at_Port_San_Antonio.jpg; Public Domain)

Published by The Lawfare Institute
in Cooperation With
Brookings

In June, the United States Military Academy hosted a workshop on artificial intelligence (AI) and the battlefield. The gathering prompted a key practical question: How can the U.S. military ensure the AI systems being fielded today comply with the enduring principles of the law of armed conflict (LOAC), even as efforts to establish a legal framework for AI and defense—like the West Point Manual on the Law of AI in Armed Conflict—are still underway?

While the legal and philosophical debates are vast, the challenge for those of us building and deploying these systems is a practical one. As a practitioner working at the intersection of AI and defense, I believe the answer lies in adapting industry best practices for AI governance to provide a rigorous, verifiable framework for the lawful use of AI.

The challenge is not without precedent. The advent of airpower in the early 20th century, and later precision-guided munitions, raised similarly profound questions about the application of LOAC principles. Those technologies forced an evolution in legal interpretation and the development of new tactics, techniques, and procedures to ensure lawful deployment. AI demands a similar, yet far more accelerated, evolution in our thinking—one that must be grounded in the verifiable discipline of test and evaluation (T&E).

Verifiable T&E is supporting AI adoption around the world by demonstrating that AI systems comply with law and regulations, particularly concerning safety, ethics, and the enablement of human-in-the-loop oversight. This is happening in a range of venues, including the European Union, the International Organization for Standardization and the International Electrotechnical Commission (ISO/IEC), and NATO.

I connect verifiable T&E to military AI and LOAC in a straightforward manner. First, I introduce the existing legal and regulatory requirements for military AI systems. Then, I provide a brief introduction to the T&E of AI systems before describing how certain LOAC principles (distinction, proportionality, and military necessity) create distinct challenges for AI systems, which lend themselves better to varying T&E approaches. Finally, I offer a three-part framework for conducting effective T&E of military AI systems. By leveraging domain-specific benchmarks, incorporating AI red teaming, and taking a lifecycle approach to T&E, the U.S. military can field AI systems that comply with LOAC principles. 

The Military AI Regulatory Landscape

Internationally, T&E is being deployed across a range of jurisdictions to ensure safe and responsible use of AI in conflict. The EU AI Act, while focused on AI systems used for civilian purposes, includes the goal of ensuring safety and protecting fundamental rights within the EU’s single market. The act itself is a comprehensive regulatory framework that classifies AI systems by risk, with high-risk systems subject to strict requirements for data governance, transparency, and human oversight.

Similarly, voluntary management systems like ISO/IEC 42001 guide organizations in the responsible development and use of AI in the enterprise context. These standards are applicable to organizations that use AI in high-stakes environments, which includes defense and critical infrastructure, and the U.S. military adopts ISO standards where appropriate.

NATO has focused extensively on engaging private industry in conversations about how to redefine and reimagine the use of military applications of AI in international conflict, and related guidance or guardrails. NATO first launched its initial AI Strategy in 2021; shortly thereafter, in 2022 NATO leaders endorsed the accompanying charter for the Defence Innovation Accelerator for the North Atlantic (DIANA), which aimed to foster AI-enabled innovation and included a network of test centers. Following this initial work, NATO released a “revised AI Strategy” on July 10, 2024, to account for rapid advances in AI technologies (including generative AI), with the aim of significantly improving NATO’s global guardrails for AI used in military contexts.

NATO’s revised AI Strategy sets out to accelerate the responsible adoption of AI by its members, and is guided by six “Principles of Responsible Use” focused on enhancing interoperability, protecting against adversarial use, and cultivating an “AI ready workforce.” The principles include lawfulness, responsibility and accountability, explainability and traceability, reliability, governability, and bias mitigation. The revised AI Strategy also aims to strengthen a network of AI stakeholders, including nontraditional defense suppliers, to secure innovative solutions and ensure all AI systems adhere to a verifiable framework of testing, evaluation, verification, and validation. This effort continues to gain traction; NATO DIANA’s 2025 cohort included 17 companies focused on data and information security, and five of the NATO Innovation Fund’s 13 portfolio companies are focused on AI or autonomy.

This emphasis on legal compliance and responsible use is reflected in the domestic policies of key NATO members like the United States. In the context of armed conflict specifically, U.S. military experts and legal scholars are working to adapt the principles of the LOAC to AI systems, with some arguing that a rigorous, verifiable framework for testing and evaluation is crucial for lawful deployment.

Long-standing U.S. policy and military regulations obligate legal reviews of new systems to ensure they can be used in compliance with the law. As is the case for traditional hardware-based systems, AI-enabled systems can use T&E to provide a practical method for conducting such a review. It shifts the core question from “Can an AI be trustworthy?” to the more practical and answerable question: “Can we prove this specific AI system, in this specific context, performs its task in a manner consistent with our legal obligations in the context of a battlefield?” The answer to this question depends in large part on the data used to test and validate the system. For policymakers and commanders, understanding this is key to deploying AI not just effectively, but lawfully.

Understanding AI T&E

Generally speaking, there are two kinds of AI on the market today: traditional AI and generative AI. Traditional AI focuses on making classifications or predictions based on training data. In contrast, generative AI moves beyond classification or prediction to generate representative content based on training data and systems. In the military context, a common application of traditional AI would be using computer vision algorithms to identify objects in images or video feeds, while a common application of generative AI application would be using large language models, or LLMs, to accelerate military planning.

Different kinds of AI require different T&E approaches based on their underlying technologies and how they interact with international law. While there are a range of treaties and other documents that outline international law, they can be organized into four fundamental principles: humanity, military necessity, distinction, and proportionality. Humanity is the first, and central, principle because it communicates the expectation that combatants behave humanely, even in situations that are not codified in international law. Human oversight of AI-enabled systems is crucial for meeting the principle of humanity. T&E—specifically T&E incorporating the remaining principles of military necessity, distinction, and proportionality—is the primary method for determining where and how that human oversight must be applied.

The T&E Challenge for the Principle of Distinction

The principle of distinction—the obligation to distinguish between combatants and civilians, and between military and civilian objects—is a particularly salient challenge for traditional AI systems. For these systems, such as those used for object recognition in intelligence, surveillance, and reconnaissance (ISR) platforms, the ability to comply with the principle of distinction is a direct function of training and testing data.

An AI model may perform with near-perfect accuracy in a sterile testing environment, but the chaos of the battlefield presents countless “edge cases.” A commercial vehicle can be made to look like a military truck; a combatant may not be in uniform. Consider a scenario where an AI-enabled ISR platform is tasked with monitoring a convoy of identical-looking commercial pickup trucks, a vehicle known to be used by both civilians and irregular forces in the region. Validating a system’s reliability in these situations requires a T&E process that can test the AI against thousands of ambiguous scenarios. This data-driven approach is, in effect, a new form of computational wargaming, a field using simulation to explore how actors respond to complex, high-stakes crises. A robust T&E process, built on sufficient volumes of operationally relevant data, would have specifically tested the AI against thousands of similarly ambiguous scenarios to validate its ability to distinguish legitimate military targets from protected civilian vehicles—with an extremely high degree of confidence.

The U.S. military should ensure that systems have been tested not just against common examples, but against a library of the most challenging and ambiguous situations they are likely to encounter. This comprehensive testing would provide a verifiable basis for trusting the system to generate lawful targeting recommendations to a human commander or a higher-level planning system.

The T&E Challenge for the Principles of Proportionality and Military Necessity

Challenges related to proportionality and military necessity, in particular, have become exponentially more complex with the advent of generative AI. These systems do not just classify data; they can generate novel courses of action, recommend strike packages, or optimize mission plans, requiring a more sophisticated T&E approach that uses “benchmarks” to evaluate the system’s reasoning across specific domains. This capability directly engages the LOAC principles of proportionality and military necessity, which require commanders to weigh the expected military advantage of an action against the risk of incidental harm to civilians.

How do you test an AI system for its adherence to proportionality? The answer cannot be found in a simple accuracy score. Instead, T&E must evolve to benchmark the reasoning behind the full AI system. This involves creating simulated scenarios where the system is presented with multiple tactical problems and evaluating its recommended solutions. Imagine, for example, a generative AI system recommending a course of action to strike a high-value command-and-control node located in a building. A human commander must assess if the strike is proportional. A properly benchmarked AI would have been tested on its ability to generate alternative options—such as using a smaller munition, a different angle of attack, or a nonkinetic method—that could achieve a similar military effect while significantly reducing the risk to a nearby school or medical clinic. Does the system consistently favor options that achieve the military objective with the least collateral damage? Can it correctly identify when the expected incidental harm would be excessive in relation to the military advantage and, therefore, recommend against a strike? 

Generative AI developers and their end users can answer these questions using custom benchmarks. Benchmarks incorporate two components. First, benchmarks explore system performance across a mixture of subjective and objective axes, including instruction-following, creativity, responsibility, reasoning, and factuality. Second, benchmarks apply these axes to specific domains of interest, ranging the breadth of human endeavors from coding to critical foreign policy decisions. The result is a matrixed measure of system performance across both function and domain, which is often compressed to percentage measures of performance.

A key technical challenge lies in quantifying the inherent trade-off between a system’s utility and its safety—its ability to assist with legitimate tasks while refusing to perform prohibited ones. This balancing act, which is at the heart of efforts like Scale’s FORTRESS model evaluation project, requires developing sophisticated, scenario-based benchmarks.

Recent collaboration between Scale AI and the Center for Strategic and International Studies tested and proved the importance of this domain-specific approach in foreign policy. The project, the Critical Foreign Policy Decisions Benchmark, evaluated how prominent LLMs responded to expert-crafted national security scenarios and found that the models exhibited distinct and often escalatory biases that varied significantly depending on the context and the nation involved. For instance, models were more likely to recommend that the United States or United Kingdom take escalatory actions than they were to recommend escalatory actions for China, Russia, or India. This underscores the fact that without rigorous, domain-specific testing, the inherent biases of off-the-shelf models could subtly steer users toward unintended strategic outcomes in AI-supported war games, analyses, or real-world planning efforts. 

This demonstrated tendency toward escalation for LLMs without well-calibrated T&E is precisely why benchmarking is an important technique for ensuring generative AI can aid lawful command decision-making, particularly in assessing the complex legal principles of proportionality and military necessity.

A Proposed Framework for Compliant AI Deployment

From a practitioner’s perspective, moving from theory to practice requires that the U.S. military adopt a comprehensive T&E framework. The following three principles provide a foundation for this approach:

1. Domain-Specific T&E Benchmarks: Generic, off-the-shelf AI models are not sufficient for the battlefield. An AI system’s understanding with LOAC is inextricably linked to its operational context. Therefore, T&E benchmarks must be domain specific. A system designed for maritime targeting requires testing against a different set of scenarios and data than one used for planning in dense urban terrain. This ensures that validation is not an abstract exercise but is tied directly to the system’s intended mission. Crucially, this provides commanders with a clear understanding of the system’s validated limits, enabling them to exercise more informed human judgment.

2. “AI Red Teaming” for AI Systems: The U.S. military has long used “red teams” to test and harden networks against cyberattacks. AI systems deployed in a military context would benefit from a similar approach, including operational red teaming via “Counter-AI.” The U.S. military should create and empower specialized AI red teams whose purpose is to design T&E scenarios—particularly those used in planning and decision support—that attempt to fool AI systems into violating LOAC principles and other military guidance. This adversarial testing is the best way to uncover hidden biases, vulnerabilities, and unexpected failure modes before the system is fielded. Practitioner-scholars like Lt. Col. Nathan Bastian, who has focused on applying analytical methods to military operations, have underscored the importance of this rigorous validation to build operator trust and ensure systems are truly ready for the complexities of the modern battlefield. The results of such red teaming provide operators with a functional “user’s guide” for specific weaknesses in AI systems, improving their ability to effectively supervise and identify untrustworthy or inaccurate outputs, and ultimately reducing the cognitive load on warfighters in high-stakes situations.

3. T&E as a Lifecycle, Not a Single Event: The battlefield is not static. Adversary tactics, techniques, and procedures evolve, and the operational environment changes. A one-time T&E check before deployment is therefore dangerously insufficient. Instead, T&E must be a continuous lifecycle, an approach that aligns with the modern policy frameworks for agile technology adoption embedded in the 2021 final report published by the National Security Commission on Artificial Intelligence. As new data is automatically gathered from operations and training exercises, it must be continuously used to test and revalidate AI models to ensure they remain robust, reliable, and compliant over time. This ensures that human trust in AI remains well-calibrated throughout the system’s lifecycle, preventing operator over-reliance on a system whose performance may have degraded against new threats. Implementing this continuous feedback loop presents its own operational challenges, requiring streamlined pathways to get data from the field back to developers without disrupting the tempo of operations.

***

For military leaders to trust AI in high-stakes environments, they need confidence that these systems will operate lawfully. That confidence cannot be based on theory alone; it must be earned through rigorous, continuous, and data-driven testing. By embracing a comprehensive test and evaluation framework, the U.S. military can provide its developers and warfighters with the practical tools they need to validate that their AI systems are not just effective but also trustworthy.


Rob Callahan is a Public Sector Deployment Strategist at Scale AI. He previously served in the U.S. Army.
}

Subscribe to Lawfare