Published by The Lawfare Institute
in Cooperation With
With the 2024 election campaign around the corner and public interest in artificial intelligence (AI) at fever pitch, policymakers, business leaders, and researchers have voiced concern about AI-enabled influence operations interfering in the democratic process. Potential concerns include propagandists using language models to write news articles, personalize propaganda or phishing emails to specific targets, falsify public opinion on social media or public comment systems, and even persuade targets via one-on-one chats.
Humans can also do all of these things at smaller scales, but discovering that machines do the writing would change how we defend ourselves and respond as a society. Trust and safety employees at platforms, analysts in government, and disinformation researchers in academia and civil society cannot combat influence operations without understanding the propagandists’ tactics, techniques, and procedures. This evidence (or lack thereof) may help researchers prioritize their efforts, investigators learn where to look, platforms determine safer policies, and developers make decisions about how to release or safeguard their models. AI developers have a range of options, from keeping their models entirely closed, to allowing restricted use of the model through a user interface (like OpenAI’s ChatGPT), to making their models publicly downloadable (like Meta’s LLaMA). They can also impose guardrails that restrict the types of text that systems will output to users. Knowing whether and how language models are misused is critical for developing appropriate AI precautions and responses.
Finding these campaigns faces the dual challenges of identifying influence operations in general and of detecting AI-generated text. If language models are used in covert propaganda campaigns, how would we know? Here, we outline five approaches for establishing whether language models lie at the heart of a disinformation campaign.
Approach 1: Human Detection of AI-Generated Text in the Wild
One obvious detection method would be to find networks of fake accounts on social media platforms and somehow conclusively identify the text as generated by a language model. Take, for example, an influence operation on a social media platform that uses thousands of fake accounts to post messages favoring a political candidate. Disinformation researchers and platforms frequently find such coordinated, fake accounts attempting to manipulate political debate. If investigators can tell these accounts are using language models, that is a smoking gun.
The issue is that language models often produce text that people struggle to distinguish from human-written content. Researchers have shown that earlier generations of language models could produce news articles that people rated as credible as news from mainstream sites. A new preprint paper from Stanford researchers shows that AI-generated text can persuade people on polarized political issues—including an assault weapon ban, a carbon tax, and a paid parental-leave program—as much as human-written text. Even if operations are detected, there are few clues to know whether the text is AI generated—making it difficult to figure out how discovered influence operations should factor into developing AI precautions and responses.
One way that people may be able to spot AI-generated text is through what are sometimes called guardrail messages. OpenAI and other AI developers have tried to build safeguards on top of their models to prevent the systems from outputting harmful content to users. With these guardrails in place, a model will often refuse to perform certain tasks, like generating hate speech. When it refuses tasks, the system typically returns phrases such as, “As a large language model, I cannot ….” These guardrail error messages are beginning to sprout up across the internet—from Amazon reviews to social media posts—indicating the use of language models.
In April, the anonymous open-source intelligence researcher with the Twitter handle @conspirador0 found thousands of accounts that posted, “I’m sorry, I cannot generate inappropriate or offensive content,” another common ChatGPT reply. John Scott-Railton, senior researcher at the Citizen Lab, likewise highlighted a variety of search terms that surface AI-generated content, including “violates OpenAI’s content policy” following a similar pattern. Benjamin Strick from the Center for Informational Resilience tweeted about “an incredible amount of ChatGPT spam posted on Twitter” about Sudan.
These breadcrumbs provide evidence that language models (or chatbots, powered by language models) are being used deceptively on social media platforms—even if attribution has not yet been made to specific operators. However, they do not provide a long-term solution for detection. As time moves on, propagandists will likely filter out these error messages, making breadcrumbs that are detectable today more difficult to detect tomorrow. In fact, campaigns with these error messages are likely of the lowest quality along the spectrum: Like the pro-Chinese Communist Party Spamouflage Dragon campaign that posts copy-and-pasted repetitive text, often with little traction, operators that simply attach language models to bots without human oversight may be easiest to find but not representative of the broader landscape.
Approach 2: Automated Detection of AI-Generated Text
Boilerplate warning messages make for crummy disinformation, and they are dead giveaways that a campaign is using language models, or at least that it is claiming to. But if the text is being provided by a well-meaning organization, then that provider can be much more subtle with the clues they add to their model’s outputs.
Language models work by figuring out which words are most likely to come next. They choose one, then repeat the process for the next word. Sometimes they choose the fifth most likely word, or the third, or the hundredth. With enough words, patterns start to develop that AI detectors such as GPTZero and OpenAI’s classifier can use to flag AI-generated content. These tools are certainly helpful—they have already been used in investigations of content farms populated with AI-generated text—but they are not entirely reliable. Indeed, students have been falsely accused of using AI systems to write essays, based on these detection models.
Another approach is for AI providers to shrink the amount of text needed for detection while increasing the certainty of detections using a technique called watermarking. Since they don’t always choose the top word, the provider can embed patterns in the model’s word choice that don’t make a substantial difference to the output’s meaning. But those changes, or watermarks, allow the AI provider, or anyone who knows what they’re looking for, to detect its text. Providers could apply the watermark to every output or only for users who exhibit suspicious behavior. It is even possible to have different watermarks for specific suspicious users so that the provider could identify both that a language model was used and which account it came from.
But watermarks are not foolproof. Propagandists can remove the watermark in a variety of ways. For one, the more they edit the text, the less of the watermark remains, so more text is needed for the same level of certainty. Propagandists could also use a less powerful but freely available language model to simply paraphrase the watermarked output, thereby removing the watermark. And of course, propagandists can bypass entirely those AI providers that use watermarks by operating their own models.
AI generations will have some patterns that can give clues as to whether text was written by a human or machine, and techniques such as watermarking can make those clues very convincing. But over time, more sophisticated malicious actors who are looking to mask their behavior may learn to disguise themselves against these techniques. Those who are caught easily may be less sophisticated propagandists and those who were not so concerned about being caught in the first place.
Approach 3: Propagandists, in Their Own Voices
If detection through on-platform text grows increasingly difficult, off-platform routes will also be needed. Efforts that go directly to the humans behind the propaganda campaigns can give insight into the tactics, techniques, and procedures of these groups—perhaps showing not only that language models are being used by specific operations but also how and why they are being used.
Investigative journalists and researchers have played a critical role in shaping public understanding of troll farms and propaganda units, including through infiltration and interviews. Over the past few years, journalists and whistle-blowers have infiltrated a number of propaganda units by applying for public job postings. Back in 2015, Lyudmila Savchik went undercover with the Internet Research Agency to draw attention to the group. More recently, Fontaka reporter Ksenia Klochkova did the same with the Cyber Front Z troll farm. These efforts have provided information about propagandists’ instructions, payment, and targets.
Interviews with actors involved also help for understanding operational behavior. During the 2016 U.S. elections, a wave of fake new stories and websites stemmed from Macedonia. Interviews with some of the operators behind these accounts and pages helped unpack that the operators were young internet users seeking profit in a struggling economy, rather than ideologically motivated or government-sponsored malicious actors. Oxford Internet Institute researchers Phil Howard and Mona Elswah interviewed employees of Russia Today (an overt group) to better understand the ideological mission and operations too.
Interviews with propagandists can also help test a number of assumptions that are floating in the public square, such as their cost-saving potential. Moreover, such conversations can provide a falsifying benefit: If specific propagandists are interviewed and report that they are not using these tools, that is likely stronger evidence for non-use than a failure on the part of disinformation researchers to detect AI-generated text on platforms directly. Journalists and interviewers will, of course, have to verify their sourcing; dark PR firms may have an incentive to say they used language models to seem cutting edge, play on the anxiety around the threats from AI, and make themselves seem more effective than they truly are.
Despite the promise of infiltration and interviews, there are drawbacks. Journalists who infiltrate propaganda units, especially in authoritarian countries, could face personal safety issues. Employees who sit for interviews could face backlash from employers. And efforts to infiltrate or interview may not be able to uncover propaganda efforts at scale, though robust efforts by local journalists, and perhaps increased funding, could make a notable difference. Researchers could also uncover the use of language models for influence operations through means that are not face-to-face, like leaks from channels or discussion groups, or evidence of payments from propaganda organizations to language model providers.
Approach 4: Monitoring by AI Providers
Even before the popularity of language models, social media companies developed trust and safety teams to set policies for harmful behaviors and to ensure compliance with those rules. AI providers have some incentive to avoid the negative publicity from misuse of their models and are already starting to do the same. OpenAI, for example, has worked with external experts to red team their models for potential misuses and hired internal trust and safety employees for ongoing monitoring and to make iterative improvements. These teams have observed ways that people are “jailbreaking” ChatGPT to get around safeguards put in place to prevent it from producing harmful text. Developers can then update their guardrails based on the findings.
Trust and safety teams at AI providers could surface influence operations through multiple means. One way they could do so is through proactive monitoring: AI providers could put into place systems that detect if a user is creating a high volume of politically sensitive messages in a short time span, for instance, and flag that account for further review if so.
Providers can also look to platform takedowns for investigative leads. Companies like Facebook and Twitter frequently remove networks of fake accounts used for influence operations on the platforms. Historically, Twitter has posted content from these accounts in a hashed format (that does not expose the handle names of accounts with smaller followings) on a researcher archive to allow others to investigate the narratives and behaviors of the accounts. Facebook has partnered with independent researchers to do the same. If platforms make data from discovered campaigns available to industry collaborators (in a privacy preserving way), AI providers could search this content within the logs of their own users to see if their models generated the text. This could also lead to new sources of information for investigators, as AI providers will have other information about users who sign up for their platforms.
Whether AI providers will be able to learn of influence operations through user logs is contingent on what data they collect from user interactions. Do AI providers store user logs or privatized records, and, if so, for how long? If they find user logs matching content from known influence operations, would they share that information with the public? Additional transparency from organizations that provide general-purpose AI tools, or access to them, will be critical to understand feasibility of this route and set public expectations. And yet, beyond the short term, monitoring by AI providers may surface only a small subset of campaigns, as propagandists will be inclined to migrate to open-source tools or AI providers that do not impose these practices.
Approach 5: Intelligence Collection and State-Backed Hacking
As propagandists move toward providers or tools with less scrupulous policies, intelligence agencies may play a role in uncovering AI-generated information operations. U.S. Cyber Command, the National Security Agency (NSA), the CIA, or the FBI could collect just this sort of evidence by hacking into the propagandists’ systems to either observe the AI tools generating malicious outputs or collect emails or chat logs describing their work flows. In fact, Paul Nakasone, commander of U.S. Cyber Command and director of the NSA, recently told Politico that he’s “watching very carefully” to see whether Russians begin integrating generative AI tools into their disinformation efforts. Importantly, intelligence agencies that collect this information could also learn that an influence operation was not using AI generation, which could be similarly important to know.
The United States has become more open about its cyber offensives and has become more willing to release classified information to combat influence operations, as was seen at the outset of Russia’s full-scale invasion of Ukraine, when U.S. officials claimed they had evidence that Russia might attempt a “false flag” operation to justify a subsequent invasion. At the same time, there are many limitations to these operations, especially for data in the U.S. and relating to U.S. persons or companies. Threats from inside the country can be discovered, but, rightly, there is additional oversight and there are relatively few approved domestic targets. Those same restrictions do not apply for influence operations conducted by foreign agents on foreign infrastructure. For domestic influence operations, they will be much more limited.
U.S.-government-led means of detection are likely to be resilient even as propagandists adapt their tactics. There is no simple platform change or automatic filter to apply in order to avoid intelligence agencies. But the issue of scale remains—intelligence agencies cannot catch every campaign and are not likely to publicly disclose every campaign that they detect.
The Challenges Ahead
These five methods are neither exhaustive nor mutually exclusive. However, at least three cross-cutting challenges remain.
First, detection methods that rely on analyzing the text itself may lose effectiveness as propagandists adapt their behavior. In the disinformation space, propagandists have already adapted in a number of ways to avoid detection—from utilizing AI-generated images to bypass reverse-image searches to migrating efforts from more public social media platforms to private groups and/or encrypted channels. Down the road, propagandists are likely to migrate to AI providers that do not actively engage in detection efforts—whether by creating their own models, relying on open-source tools, or using tools that explicitly disengage in such efforts.
Second, detecting the use of language models is more likely to shine a light on individual campaigns and specific types of attacks than the broader set of operations that use them. For example, the campaigns that are easy to catch may not be representative of all campaigns. Intelligence agencies may be most likely to declassify evidence that exposes U.S. adversaries during a crisis. Propagandists may be more likely to find propaganda campaigns that do not rely on language models, since campaigns that do use these tools could hire fewer employees. The research community—which lacks a strong understanding of how many influence operations are actually taking place—will need to be careful not to make inferences that over-index on the limited (and unrepresentative) public cases from which to draw.
Finally, while surfacing campaigns may be short-term successes, longer-term solutions are going to require more than whack-a-mole discovery. If AI-generated text—through influence operations or otherwise—is used to flood social media platforms and reduce broader societal trust, then trust and safety teams will have to think through platform design changes to prevent malicious actors from capitalizing on AI-generated economies of scale. If language models are used to overwhelm public comment systems, government agencies and elected officials will need to update their processes as well. We are likely in the early stages of policy and norm development around AI-generated text, and early and robust efforts can help mitigate some of the longer-term impacts to the broader information environment.